fix: add OCR dependencies for meaning kernel extraction

Add requirements.txt and README.md under scripts/meaning-kernels/ to document and enable installation of pytesseract, pdf2image, and Pillow. Addresses missing dependency warnings when running the enhanced meaning kernel extraction pipeline (#493). Closes #563
2026-04-29 07:17:13 -04:00
2 changed files with 109 additions and 0 deletions
--- a/scripts/meaning-kernels/README.md
+++ b/scripts/meaning-kernels/README.md
@@ -0,0 +1,105 @@
+# Meaning Kernel Extraction Pipeline
+
+## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams
+
+## Overview
+
+This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.
+
+## Features
+
+- **PDF Processing**: Converts PDF pages to images for analysis
+- **OCR Text Extraction**: Extracts text from diagrams using Tesseract
+- **Structure Analysis**: Analyzes diagram type, dimensions, orientation
+- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels
+- **Confidence Scoring**: Each kernel includes confidence metrics
+- **Batch Processing**: Supports single files and directories
+
+## Installation
+
+```bash
+# Required dependencies
+pip install Pillow pytesseract pdf2image
+
+# System dependencies (macOS)
+brew install tesseract poppler
+
+# System dependencies (Ubuntu/Debian)
+sudo apt-get install tesseract-ocr poppler-utils
+```
+
+## Usage
+
+```bash
+# Process a single PDF
+python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf
+
+# Process a single image
+python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png
+
+# Process a directory
+python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/
+
+# Specify output directory
+python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output
+
+# Run tests
+python3 scripts/meaning-kernels/test_extraction.py
+```
+
+## Output Structure
+
+```
+output_directory/
+├── page_001.png              # Converted page images
+├── page_002.png
+├── meaning_kernels.json      # Structured kernel data
+├── meaning_kernels.md        # Human-readable report
+└── extraction_stats.json     # Processing statistics
+```
+
+## Kernel Types
+
+### 1. Text Kernels
+Extracted from OCR processing of diagrams.
+
+### 2. Structure Kernels
+Diagram structure analysis (type, dimensions, aspect ratio).
+
+### 3. Summary Kernels
+Combined analysis summary including extracted text and structure.
+
+### 4. Philosophical Kernels
+Detected philosophical themes using keyword analysis.
+
+## Configuration
+
+Create a JSON config file:
+
+```json
+{
+  "ocr_confidence_threshold": 50,
+  "min_text_length": 10,
+  "diagram_types": ["flowchart", "hierarchy", "network"],
+  "extract_philosophical": true,
+  "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
+}
+```
+
+## Limitations
+
+- OCR quality depends on diagram clarity
+- Structure analysis provides basic metadata only
+- Philosophical extraction is keyword-based (not semantic)
+- Large PDFs can be resource-intensive
+
+## Dependencies Notice
+
+If you see warnings like:
+
+```
+Warning: pytesseract not available. Install with: pip install pytesseract
+Warning: pdf2image not available. Install with: pip install pdf2image
+```
+
+Install the Python packages from `requirements.txt` and the system OCR engine as shown above.
--- a/scripts/meaning-kernels/requirements.txt
+++ b/scripts/meaning-kernels/requirements.txt
@@ -0,0 +1,4 @@
+# OCR and PDF processing dependencies for meaning kernel extraction
+Pillow>=10.0.0
+pytesseract>=0.3.10
+pdf2image>=1.16.3