# Meaning Kernel Extraction Pipeline ## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams ## Overview This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations. ## Features - **PDF Processing**: Converts PDF pages to images for analysis - **OCR Text Extraction**: Extracts text from diagrams using Tesseract - **Structure Analysis**: Analyzes diagram type, dimensions, orientation - **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels - **Confidence Scoring**: Each kernel includes confidence metrics - **Batch Processing**: Supports single files and directories ## Installation ```bash # Required dependencies pip install Pillow pytesseract pdf2image # System dependencies (macOS) brew install tesseract poppler # System dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr poppler-utils ``` ## Usage ```bash # Process a single PDF python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf # Process a single image python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png # Process a directory python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/ # Specify output directory python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output # Run tests python3 scripts/meaning-kernels/test_extraction.py ``` ## Output Structure ``` output_directory/ ├── page_001.png # Converted page images ├── page_002.png ├── meaning_kernels.json # Structured kernel data ├── meaning_kernels.md # Human-readable report └── extraction_stats.json # Processing statistics ``` ## Kernel Types ### 1. Text Kernels Extracted from OCR processing of diagrams. ```json { "kernel_id": "kernel_20260413_123456_p1_text", "content": "Extracted text from diagram", "kernel_type": "text", "confidence": 0.85, "metadata": { "word_count": 42, "diagram_type": "flowchart" } } ``` ### 2. Structure Kernels Diagram structure analysis. ```json { "kernel_id": "kernel_20260413_123456_p1_structure", "content": "Diagram type: flowchart. Dimensions: 800x600. Aspect ratio: 1.33.", "kernel_type": "structure", "confidence": 0.9, "metadata": { "dimensions": {"width": 800, "height": 600}, "aspect_ratio": 1.33, "diagram_type": "flowchart" } } ``` ### 3. Summary Kernels Combined analysis summary. ```json { "kernel_id": "kernel_20260413_123456_p1_summary", "content": "Research diagram analysis: flowchart diagram. Contains text: Input → Processing → Output...", "kernel_type": "summary", "confidence": 0.7, "metadata": { "has_text": true, "text_length": 150 } } ``` ### 4. Philosophical Kernels Extracted philosophical themes (when detected). ```json { "kernel_id": "kernel_20260413_123456_p1_philosophical", "content": "Philosophical themes detected: knowledge, truth. Source text explores concepts of knowledge.", "kernel_type": "philosophical", "confidence": 0.6, "metadata": { "extraction_method": "keyword_analysis", "source_text_length": 200 } } ``` ## Configuration Create a JSON config file: ```json { "ocr_confidence_threshold": 50, "min_text_length": 10, "diagram_types": ["flowchart", "hierarchy", "network"], "extract_philosophical": true, "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"] } ``` ## Limitations - OCR quality depends on diagram clarity - Structure analysis is simplified - Philosophical extraction is keyword-based - Large PDFs can be resource-intensive ## Future Enhancements - Computer vision for diagram element detection - LLM integration for semantic analysis - Specialized processors for different diagram types - Integration with knowledge graphs - API endpoint for web integration ## Files - `extract_meaning_kernels.py` - Main extraction pipeline - `test_extraction.py` - Test script - `requirements.txt` - Python dependencies - `README.md` - This documentation