Fix #493: Add multimodal meaning kernel extraction pipeline

- Added extract_meaning_kernels.py for processing PDF diagrams - Extracts text using OCR (Tesseract) when available - Analyzes diagram structure (type, dimensions, orientation) - Generates structured meaning kernels with metadata - Outputs JSON (machine-readable) and Markdown (human-readable) - Includes test pipeline and documentation - Supports single files and batch processing Pipeline components: - DiagramProcessor: Main processing engine - MeaningKernel: Structured kernel representation - PDF to image conversion - OCR text extraction - Structure analysis - Kernel generation with confidence scoring Acceptance criteria met: ✓ Processes academic PDF diagrams ✓ Extracts structured text meaning kernels ✓ Generates machine-readable JSON output ✓ Includes human-readable reports ✓ Supports batch processing ✓ Provides confidence scoring
2026-04-13 21:20:42 -04:00
commit 0a52cff8a7
6 changed files with 705 additions and 0 deletions
--- a/scripts/multimodal/README.md
+++ b/scripts/multimodal/README.md
@@ -0,0 +1,128 @@
+# Multimodal Meaning Kernel Extraction Pipeline
+
+Extracts structured meaning kernels from academic PDF diagrams into text format.
+
+## Issue #493
+
+[Multimodal] Extract Meaning Kernels from Research Diagrams
+
+## Overview
+
+This pipeline processes academic PDF diagrams and images to extract structured "meaning kernels" - discrete units of meaning that can be stored, indexed, and analyzed.
+
+## Features
+
+- **PDF Processing**: Converts PDF pages to images and processes each page
+- **OCR Text Extraction**: Extracts text from diagrams using Tesseract OCR
+- **Structure Analysis**: Analyzes diagram structure (type, dimensions, orientation)
+- **Kernel Generation**: Creates structured meaning kernels with metadata
+- **Multiple Output Formats**: JSON for machine processing, Markdown for human readability
+
+## Installation
+
+```bash
+# Required dependencies
+pip install Pillow pytesseract pdf2image
+
+# System dependencies (macOS)
+brew install tesseract poppler
+
+# System dependencies (Ubuntu/Debian)
+sudo apt-get install tesseract-ocr poppler-utils
+```
+
+## Usage
+
+```bash
+# Process a single PDF
+python3 scripts/multimodal/extract_meaning_kernels.py research_paper.pdf
+
+# Process a single image
+python3 scripts/multimodal/extract_meaning_kernels.py diagram.png
+
+# Process a directory of files
+python3 scripts/multimodal/extract_meaning_kernels.py /path/to/diagrams/
+
+# Specify output directory
+python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -o ./output
+
+# Use configuration file
+python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -c config.json
+```
+
+## Output Structure
+
+For each processed file, the pipeline creates:
+
+```
+output_directory/
+├── page_001.png          # Converted page images
+├── page_002.png
+├── meaning_kernels.json  # Structured kernel data
+├── meaning_kernels.md    # Human-readable report
+└── extraction_stats.json # Processing statistics
+```
+
+## Meaning Kernel Format
+
+Each kernel contains:
+
+```json
+{
+  "kernel_id": "kernel_20260413_181234_p1_text",
+  "content": "Extracted text content from the diagram",
+  "source": "path/to/source/file.png",
+  "confidence": 0.85,
+  "metadata": {
+    "type": "text_extraction",
+    "word_count": 42,
+    "line_count": 5,
+    "structure": {...}
+  },
+  "timestamp": "2026-04-13T18:12:34.567890",
+  "hash": "a1b2c3d4e5f6g7h8"
+}
+```
+
+## Kernel Types
+
+1. **Text Extraction**: Direct OCR text from the diagram
+2. **Structure Analysis**: Diagram type, dimensions, orientation
+3. **Summary**: Combined analysis of text and structure
+
+## Configuration
+
+Create a JSON config file:
+
+```json
+{
+  "ocr_confidence_threshold": 50,
+  "min_text_length": 10,
+  "diagram_types": ["flowchart", "hierarchy", "network"],
+  "output_format": ["json", "markdown"],
+  "verbose": true
+}
+```
+
+## Use Cases
+
+- **Research Analysis**: Extract key concepts from academic papers
+- **Knowledge Graphs**: Build structured knowledge from visual information
+- **Document Indexing**: Make diagram content searchable
+- **Content Summarization**: Generate text summaries of visual content
+- **Machine Learning**: Training data for multimodal AI models
+
+## Limitations
+
+- OCR quality depends on diagram clarity and resolution
+- Structure analysis is simplified (real CV would be more accurate)
+- Complex diagrams may need specialized processing
+- Large PDFs can be resource-intensive
+
+## Future Enhancements
+
+- Computer vision for diagram element detection
+- Specialized processors for different diagram types
+- Integration with LLMs for semantic analysis
+- Batch processing with parallelization
+- API endpoint for web integration