- Added 5 kernel types: text, structure, summary, philosophical, semantic - Improved diagram type detection with content analysis - Added color analysis and grayscale detection - Enhanced philosophical keyword extraction - Added semantic relationship detection - Improved error handling for missing dependencies - Added comprehensive testing with text-rich test images - Enhanced metadata and tagging system Key improvements: ✓ Semantic relationship detection (source → target patterns) ✓ Enhanced philosophical content extraction ✓ Color analysis and grayscale detection ✓ Better diagram type classification ✓ Comprehensive metadata and tagging ✓ Improved error handling and dependency warnings Still requires OCR dependencies for text extraction: - pytesseract for OCR - pdf2image for PDF processing - Tesseract OCR engine (see issue #563)
Meaning Kernel Extraction Pipeline
Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams
Overview
This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.
Features
- PDF Processing: Converts PDF pages to images for analysis
- OCR Text Extraction: Extracts text from diagrams using Tesseract
- Structure Analysis: Analyzes diagram type, dimensions, orientation
- Multiple Kernel Types: Generates text, structure, summary, and philosophical kernels
- Confidence Scoring: Each kernel includes confidence metrics
- Batch Processing: Supports single files and directories
Installation
# Required dependencies
pip install Pillow pytesseract pdf2image
# System dependencies (macOS)
brew install tesseract poppler
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
Usage
# Process a single PDF
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf
# Process a single image
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png
# Process a directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/
# Specify output directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output
# Run tests
python3 scripts/meaning-kernels/test_extraction.py
Output Structure
output_directory/
├── page_001.png # Converted page images
├── page_002.png
├── meaning_kernels.json # Structured kernel data
├── meaning_kernels.md # Human-readable report
└── extraction_stats.json # Processing statistics
Kernel Types
1. Text Kernels
Extracted from OCR processing of diagrams.
{
"kernel_id": "kernel_20260413_123456_p1_text",
"content": "Extracted text from diagram",
"kernel_type": "text",
"confidence": 0.85,
"metadata": {
"word_count": 42,
"diagram_type": "flowchart"
}
}
2. Structure Kernels
Diagram structure analysis.
{
"kernel_id": "kernel_20260413_123456_p1_structure",
"content": "Diagram type: flowchart. Dimensions: 800x600. Aspect ratio: 1.33.",
"kernel_type": "structure",
"confidence": 0.9,
"metadata": {
"dimensions": {"width": 800, "height": 600},
"aspect_ratio": 1.33,
"diagram_type": "flowchart"
}
}
3. Summary Kernels
Combined analysis summary.
{
"kernel_id": "kernel_20260413_123456_p1_summary",
"content": "Research diagram analysis: flowchart diagram. Contains text: Input → Processing → Output...",
"kernel_type": "summary",
"confidence": 0.7,
"metadata": {
"has_text": true,
"text_length": 150
}
}
4. Philosophical Kernels
Extracted philosophical themes (when detected).
{
"kernel_id": "kernel_20260413_123456_p1_philosophical",
"content": "Philosophical themes detected: knowledge, truth. Source text explores concepts of knowledge.",
"kernel_type": "philosophical",
"confidence": 0.6,
"metadata": {
"extraction_method": "keyword_analysis",
"source_text_length": 200
}
}
Configuration
Create a JSON config file:
{
"ocr_confidence_threshold": 50,
"min_text_length": 10,
"diagram_types": ["flowchart", "hierarchy", "network"],
"extract_philosophical": true,
"philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
}
Limitations
- OCR quality depends on diagram clarity
- Structure analysis is simplified
- Philosophical extraction is keyword-based
- Large PDFs can be resource-intensive
Future Enhancements
- Computer vision for diagram element detection
- LLM integration for semantic analysis
- Specialized processors for different diagram types
- Integration with knowledge graphs
- API endpoint for web integration
Files
extract_meaning_kernels.py- Main extraction pipelinetest_extraction.py- Test scriptrequirements.txt- Python dependenciesREADME.md- This documentation