- Added extract_meaning_kernels.py for processing PDF diagrams - Extracts text using OCR (Tesseract) when available - Analyzes diagram structure (type, dimensions, orientation) - Generates structured meaning kernels with metadata - Outputs JSON (machine-readable) and Markdown (human-readable) - Includes test pipeline and documentation - Supports single files and batch processing Pipeline components: - DiagramProcessor: Main processing engine - MeaningKernel: Structured kernel representation - PDF to image conversion - OCR text extraction - Structure analysis - Kernel generation with confidence scoring Acceptance criteria met: ✓ Processes academic PDF diagrams ✓ Extracts structured text meaning kernels ✓ Generates machine-readable JSON output ✓ Includes human-readable reports ✓ Supports batch processing ✓ Provides confidence scoring
Multimodal Meaning Kernel Extraction Pipeline
Extracts structured meaning kernels from academic PDF diagrams into text format.
Issue #493
[Multimodal] Extract Meaning Kernels from Research Diagrams
Overview
This pipeline processes academic PDF diagrams and images to extract structured "meaning kernels" - discrete units of meaning that can be stored, indexed, and analyzed.
Features
- PDF Processing: Converts PDF pages to images and processes each page
- OCR Text Extraction: Extracts text from diagrams using Tesseract OCR
- Structure Analysis: Analyzes diagram structure (type, dimensions, orientation)
- Kernel Generation: Creates structured meaning kernels with metadata
- Multiple Output Formats: JSON for machine processing, Markdown for human readability
Installation
# Required dependencies
pip install Pillow pytesseract pdf2image
# System dependencies (macOS)
brew install tesseract poppler
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
Usage
# Process a single PDF
python3 scripts/multimodal/extract_meaning_kernels.py research_paper.pdf
# Process a single image
python3 scripts/multimodal/extract_meaning_kernels.py diagram.png
# Process a directory of files
python3 scripts/multimodal/extract_meaning_kernels.py /path/to/diagrams/
# Specify output directory
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -o ./output
# Use configuration file
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -c config.json
Output Structure
For each processed file, the pipeline creates:
output_directory/
├── page_001.png # Converted page images
├── page_002.png
├── meaning_kernels.json # Structured kernel data
├── meaning_kernels.md # Human-readable report
└── extraction_stats.json # Processing statistics
Meaning Kernel Format
Each kernel contains:
{
"kernel_id": "kernel_20260413_181234_p1_text",
"content": "Extracted text content from the diagram",
"source": "path/to/source/file.png",
"confidence": 0.85,
"metadata": {
"type": "text_extraction",
"word_count": 42,
"line_count": 5,
"structure": {...}
},
"timestamp": "2026-04-13T18:12:34.567890",
"hash": "a1b2c3d4e5f6g7h8"
}
Kernel Types
- Text Extraction: Direct OCR text from the diagram
- Structure Analysis: Diagram type, dimensions, orientation
- Summary: Combined analysis of text and structure
Configuration
Create a JSON config file:
{
"ocr_confidence_threshold": 50,
"min_text_length": 10,
"diagram_types": ["flowchart", "hierarchy", "network"],
"output_format": ["json", "markdown"],
"verbose": true
}
Use Cases
- Research Analysis: Extract key concepts from academic papers
- Knowledge Graphs: Build structured knowledge from visual information
- Document Indexing: Make diagram content searchable
- Content Summarization: Generate text summaries of visual content
- Machine Learning: Training data for multimodal AI models
Limitations
- OCR quality depends on diagram clarity and resolution
- Structure analysis is simplified (real CV would be more accurate)
- Complex diagrams may need specialized processing
- Large PDFs can be resource-intensive
Future Enhancements
- Computer vision for diagram element detection
- Specialized processors for different diagram types
- Integration with LLMs for semantic analysis
- Batch processing with parallelization
- API endpoint for web integration