Files
timmy-config/scripts/multimodal
Alexander Whitestone 0a52cff8a7 Fix #493: Add multimodal meaning kernel extraction pipeline
- Added extract_meaning_kernels.py for processing PDF diagrams
- Extracts text using OCR (Tesseract) when available
- Analyzes diagram structure (type, dimensions, orientation)
- Generates structured meaning kernels with metadata
- Outputs JSON (machine-readable) and Markdown (human-readable)
- Includes test pipeline and documentation
- Supports single files and batch processing

Pipeline components:
- DiagramProcessor: Main processing engine
- MeaningKernel: Structured kernel representation
- PDF to image conversion
- OCR text extraction
- Structure analysis
- Kernel generation with confidence scoring

Acceptance criteria met:
✓ Processes academic PDF diagrams
✓ Extracts structured text meaning kernels
✓ Generates machine-readable JSON output
✓ Includes human-readable reports
✓ Supports batch processing
✓ Provides confidence scoring
2026-04-13 21:20:42 -04:00
..

Multimodal Meaning Kernel Extraction Pipeline

Extracts structured meaning kernels from academic PDF diagrams into text format.

Issue #493

[Multimodal] Extract Meaning Kernels from Research Diagrams

Overview

This pipeline processes academic PDF diagrams and images to extract structured "meaning kernels" - discrete units of meaning that can be stored, indexed, and analyzed.

Features

  • PDF Processing: Converts PDF pages to images and processes each page
  • OCR Text Extraction: Extracts text from diagrams using Tesseract OCR
  • Structure Analysis: Analyzes diagram structure (type, dimensions, orientation)
  • Kernel Generation: Creates structured meaning kernels with metadata
  • Multiple Output Formats: JSON for machine processing, Markdown for human readability

Installation

# Required dependencies
pip install Pillow pytesseract pdf2image

# System dependencies (macOS)
brew install tesseract poppler

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

Usage

# Process a single PDF
python3 scripts/multimodal/extract_meaning_kernels.py research_paper.pdf

# Process a single image
python3 scripts/multimodal/extract_meaning_kernels.py diagram.png

# Process a directory of files
python3 scripts/multimodal/extract_meaning_kernels.py /path/to/diagrams/

# Specify output directory
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -o ./output

# Use configuration file
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -c config.json

Output Structure

For each processed file, the pipeline creates:

output_directory/
├── page_001.png          # Converted page images
├── page_002.png
├── meaning_kernels.json  # Structured kernel data
├── meaning_kernels.md    # Human-readable report
└── extraction_stats.json # Processing statistics

Meaning Kernel Format

Each kernel contains:

{
  "kernel_id": "kernel_20260413_181234_p1_text",
  "content": "Extracted text content from the diagram",
  "source": "path/to/source/file.png",
  "confidence": 0.85,
  "metadata": {
    "type": "text_extraction",
    "word_count": 42,
    "line_count": 5,
    "structure": {...}
  },
  "timestamp": "2026-04-13T18:12:34.567890",
  "hash": "a1b2c3d4e5f6g7h8"
}

Kernel Types

  1. Text Extraction: Direct OCR text from the diagram
  2. Structure Analysis: Diagram type, dimensions, orientation
  3. Summary: Combined analysis of text and structure

Configuration

Create a JSON config file:

{
  "ocr_confidence_threshold": 50,
  "min_text_length": 10,
  "diagram_types": ["flowchart", "hierarchy", "network"],
  "output_format": ["json", "markdown"],
  "verbose": true
}

Use Cases

  • Research Analysis: Extract key concepts from academic papers
  • Knowledge Graphs: Build structured knowledge from visual information
  • Document Indexing: Make diagram content searchable
  • Content Summarization: Generate text summaries of visual content
  • Machine Learning: Training data for multimodal AI models

Limitations

  • OCR quality depends on diagram clarity and resolution
  • Structure analysis is simplified (real CV would be more accurate)
  • Complex diagrams may need specialized processing
  • Large PDFs can be resource-intensive

Future Enhancements

  • Computer vision for diagram element detection
  • Specialized processors for different diagram types
  • Integration with LLMs for semantic analysis
  • Batch processing with parallelization
  • API endpoint for web integration