# Meaning Kernel Extraction Pipeline ## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams ## Overview This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations. ## Features - **PDF Processing**: Converts PDF pages to images for analysis - **OCR Text Extraction**: Extracts text from diagrams using Tesseract - **Structure Analysis**: Analyzes diagram type, dimensions, orientation - **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels - **Confidence Scoring**: Each kernel includes confidence metrics - **Batch Processing**: Supports single files and directories ## Installation ```bash # Required dependencies pip install Pillow pytesseract pdf2image # System dependencies (macOS) brew install tesseract poppler # System dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr poppler-utils ``` ## Usage ```bash # Process a single PDF python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf # Process a single image python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png # Process a directory python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/ # Specify output directory python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output # Run tests python3 scripts/meaning-kernels/test_extraction.py ``` ## Output Structure ``` output_directory/ ├── page_001.png # Converted page images ├── page_002.png ├── meaning_kernels.json # Structured kernel data ├── meaning_kernels.md # Human-readable report └── extraction_stats.json # Processing statistics ``` ## Kernel Types ### 1. Text Kernels Extracted from OCR processing of diagrams. ### 2. Structure Kernels Diagram structure analysis (type, dimensions, aspect ratio). ### 3. Summary Kernels Combined analysis summary including extracted text and structure. ### 4. Philosophical Kernels Detected philosophical themes using keyword analysis. ## Configuration Create a JSON config file: ```json { "ocr_confidence_threshold": 50, "min_text_length": 10, "diagram_types": ["flowchart", "hierarchy", "network"], "extract_philosophical": true, "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"] } ``` ## Limitations - OCR quality depends on diagram clarity - Structure analysis provides basic metadata only - Philosophical extraction is keyword-based (not semantic) - Large PDFs can be resource-intensive ## Dependencies Notice If you see warnings like: ``` Warning: pytesseract not available. Install with: pip install pytesseract Warning: pdf2image not available. Install with: pip install pdf2image ``` Install the Python packages from `requirements.txt` and the system OCR engine as shown above.