From 25eca504a988463853ed69d2b7014a32ee95ee4c Mon Sep 17 00:00:00 2001 From: Timmy Burn Worker Date: Wed, 29 Apr 2026 07:15:48 -0400 Subject: [PATCH] fix: add OCR dependencies for meaning kernel extraction Add requirements.txt and README.md under scripts/meaning-kernels/ to document and enable installation of pytesseract, pdf2image, and Pillow. Addresses missing dependency warnings when running the enhanced meaning kernel extraction pipeline (#493). Closes #563 --- scripts/meaning-kernels/README.md | 105 +++++++++++++++++++++++ scripts/meaning-kernels/requirements.txt | 4 + 2 files changed, 109 insertions(+) create mode 100644 scripts/meaning-kernels/README.md create mode 100644 scripts/meaning-kernels/requirements.txt diff --git a/scripts/meaning-kernels/README.md b/scripts/meaning-kernels/README.md new file mode 100644 index 00000000..6aeb05bb --- /dev/null +++ b/scripts/meaning-kernels/README.md @@ -0,0 +1,105 @@ +# Meaning Kernel Extraction Pipeline + +## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams + +## Overview + +This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations. + +## Features + +- **PDF Processing**: Converts PDF pages to images for analysis +- **OCR Text Extraction**: Extracts text from diagrams using Tesseract +- **Structure Analysis**: Analyzes diagram type, dimensions, orientation +- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels +- **Confidence Scoring**: Each kernel includes confidence metrics +- **Batch Processing**: Supports single files and directories + +## Installation + +```bash +# Required dependencies +pip install Pillow pytesseract pdf2image + +# System dependencies (macOS) +brew install tesseract poppler + +# System dependencies (Ubuntu/Debian) +sudo apt-get install tesseract-ocr poppler-utils +``` + +## Usage + +```bash +# Process a single PDF +python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf + +# Process a single image +python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png + +# Process a directory +python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/ + +# Specify output directory +python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output + +# Run tests +python3 scripts/meaning-kernels/test_extraction.py +``` + +## Output Structure + +``` +output_directory/ +├── page_001.png # Converted page images +├── page_002.png +├── meaning_kernels.json # Structured kernel data +├── meaning_kernels.md # Human-readable report +└── extraction_stats.json # Processing statistics +``` + +## Kernel Types + +### 1. Text Kernels +Extracted from OCR processing of diagrams. + +### 2. Structure Kernels +Diagram structure analysis (type, dimensions, aspect ratio). + +### 3. Summary Kernels +Combined analysis summary including extracted text and structure. + +### 4. Philosophical Kernels +Detected philosophical themes using keyword analysis. + +## Configuration + +Create a JSON config file: + +```json +{ + "ocr_confidence_threshold": 50, + "min_text_length": 10, + "diagram_types": ["flowchart", "hierarchy", "network"], + "extract_philosophical": true, + "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"] +} +``` + +## Limitations + +- OCR quality depends on diagram clarity +- Structure analysis provides basic metadata only +- Philosophical extraction is keyword-based (not semantic) +- Large PDFs can be resource-intensive + +## Dependencies Notice + +If you see warnings like: + +``` +Warning: pytesseract not available. Install with: pip install pytesseract +Warning: pdf2image not available. Install with: pip install pdf2image +``` + +Install the Python packages from `requirements.txt` and the system OCR engine as shown above. diff --git a/scripts/meaning-kernels/requirements.txt b/scripts/meaning-kernels/requirements.txt new file mode 100644 index 00000000..0b892c27 --- /dev/null +++ b/scripts/meaning-kernels/requirements.txt @@ -0,0 +1,4 @@ +# OCR and PDF processing dependencies for meaning kernel extraction +Pillow>=10.0.0 +pytesseract>=0.3.10 +pdf2image>=1.16.3