diff --git a/scripts/meaning-kernels/README.md b/scripts/meaning-kernels/README.md new file mode 100644 index 00000000..6aeb05bb --- /dev/null +++ b/scripts/meaning-kernels/README.md @@ -0,0 +1,105 @@ +# Meaning Kernel Extraction Pipeline + +## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams + +## Overview + +This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations. + +## Features + +- **PDF Processing**: Converts PDF pages to images for analysis +- **OCR Text Extraction**: Extracts text from diagrams using Tesseract +- **Structure Analysis**: Analyzes diagram type, dimensions, orientation +- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels +- **Confidence Scoring**: Each kernel includes confidence metrics +- **Batch Processing**: Supports single files and directories + +## Installation + +```bash +# Required dependencies +pip install Pillow pytesseract pdf2image + +# System dependencies (macOS) +brew install tesseract poppler + +# System dependencies (Ubuntu/Debian) +sudo apt-get install tesseract-ocr poppler-utils +``` + +## Usage + +```bash +# Process a single PDF +python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf + +# Process a single image +python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png + +# Process a directory +python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/ + +# Specify output directory +python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output + +# Run tests +python3 scripts/meaning-kernels/test_extraction.py +``` + +## Output Structure + +``` +output_directory/ +├── page_001.png # Converted page images +├── page_002.png +├── meaning_kernels.json # Structured kernel data +├── meaning_kernels.md # Human-readable report +└── extraction_stats.json # Processing statistics +``` + +## Kernel Types + +### 1. Text Kernels +Extracted from OCR processing of diagrams. + +### 2. Structure Kernels +Diagram structure analysis (type, dimensions, aspect ratio). + +### 3. Summary Kernels +Combined analysis summary including extracted text and structure. + +### 4. Philosophical Kernels +Detected philosophical themes using keyword analysis. + +## Configuration + +Create a JSON config file: + +```json +{ + "ocr_confidence_threshold": 50, + "min_text_length": 10, + "diagram_types": ["flowchart", "hierarchy", "network"], + "extract_philosophical": true, + "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"] +} +``` + +## Limitations + +- OCR quality depends on diagram clarity +- Structure analysis provides basic metadata only +- Philosophical extraction is keyword-based (not semantic) +- Large PDFs can be resource-intensive + +## Dependencies Notice + +If you see warnings like: + +``` +Warning: pytesseract not available. Install with: pip install pytesseract +Warning: pdf2image not available. Install with: pip install pdf2image +``` + +Install the Python packages from `requirements.txt` and the system OCR engine as shown above. diff --git a/scripts/meaning-kernels/requirements.txt b/scripts/meaning-kernels/requirements.txt new file mode 100644 index 00000000..0b892c27 --- /dev/null +++ b/scripts/meaning-kernels/requirements.txt @@ -0,0 +1,4 @@ +# OCR and PDF processing dependencies for meaning kernel extraction +Pillow>=10.0.0 +pytesseract>=0.3.10 +pdf2image>=1.16.3