106 lines
2.9 KiB
Markdown
106 lines
2.9 KiB
Markdown
|
|
# Meaning Kernel Extraction Pipeline
|
||
|
|
|
||
|
|
## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
- **PDF Processing**: Converts PDF pages to images for analysis
|
||
|
|
- **OCR Text Extraction**: Extracts text from diagrams using Tesseract
|
||
|
|
- **Structure Analysis**: Analyzes diagram type, dimensions, orientation
|
||
|
|
- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels
|
||
|
|
- **Confidence Scoring**: Each kernel includes confidence metrics
|
||
|
|
- **Batch Processing**: Supports single files and directories
|
||
|
|
|
||
|
|
## Installation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Required dependencies
|
||
|
|
pip install Pillow pytesseract pdf2image
|
||
|
|
|
||
|
|
# System dependencies (macOS)
|
||
|
|
brew install tesseract poppler
|
||
|
|
|
||
|
|
# System dependencies (Ubuntu/Debian)
|
||
|
|
sudo apt-get install tesseract-ocr poppler-utils
|
||
|
|
```
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Process a single PDF
|
||
|
|
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf
|
||
|
|
|
||
|
|
# Process a single image
|
||
|
|
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png
|
||
|
|
|
||
|
|
# Process a directory
|
||
|
|
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/
|
||
|
|
|
||
|
|
# Specify output directory
|
||
|
|
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output
|
||
|
|
|
||
|
|
# Run tests
|
||
|
|
python3 scripts/meaning-kernels/test_extraction.py
|
||
|
|
```
|
||
|
|
|
||
|
|
## Output Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
output_directory/
|
||
|
|
├── page_001.png # Converted page images
|
||
|
|
├── page_002.png
|
||
|
|
├── meaning_kernels.json # Structured kernel data
|
||
|
|
├── meaning_kernels.md # Human-readable report
|
||
|
|
└── extraction_stats.json # Processing statistics
|
||
|
|
```
|
||
|
|
|
||
|
|
## Kernel Types
|
||
|
|
|
||
|
|
### 1. Text Kernels
|
||
|
|
Extracted from OCR processing of diagrams.
|
||
|
|
|
||
|
|
### 2. Structure Kernels
|
||
|
|
Diagram structure analysis (type, dimensions, aspect ratio).
|
||
|
|
|
||
|
|
### 3. Summary Kernels
|
||
|
|
Combined analysis summary including extracted text and structure.
|
||
|
|
|
||
|
|
### 4. Philosophical Kernels
|
||
|
|
Detected philosophical themes using keyword analysis.
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
Create a JSON config file:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"ocr_confidence_threshold": 50,
|
||
|
|
"min_text_length": 10,
|
||
|
|
"diagram_types": ["flowchart", "hierarchy", "network"],
|
||
|
|
"extract_philosophical": true,
|
||
|
|
"philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Limitations
|
||
|
|
|
||
|
|
- OCR quality depends on diagram clarity
|
||
|
|
- Structure analysis provides basic metadata only
|
||
|
|
- Philosophical extraction is keyword-based (not semantic)
|
||
|
|
- Large PDFs can be resource-intensive
|
||
|
|
|
||
|
|
## Dependencies Notice
|
||
|
|
|
||
|
|
If you see warnings like:
|
||
|
|
|
||
|
|
```
|
||
|
|
Warning: pytesseract not available. Install with: pip install pytesseract
|
||
|
|
Warning: pdf2image not available. Install with: pip install pdf2image
|
||
|
|
```
|
||
|
|
|
||
|
|
Install the Python packages from `requirements.txt` and the system OCR engine as shown above.
|