# Multimodal Meaning Kernel Extraction Pipeline

Extracts structured meaning kernels from academic PDF diagrams into text format.

## Issue #493

[Multimodal] Extract Meaning Kernels from Research Diagrams

## Overview

This pipeline processes academic PDF diagrams and images to extract structured "meaning kernels" - discrete units of meaning that can be stored, indexed, and analyzed.

## Features

- **PDF Processing**: Converts PDF pages to images and processes each page
- **OCR Text Extraction**: Extracts text from diagrams using Tesseract OCR
- **Structure Analysis**: Analyzes diagram structure (type, dimensions, orientation)
- **Kernel Generation**: Creates structured meaning kernels with metadata
- **Multiple Output Formats**: JSON for machine processing, Markdown for human readability

## Installation

```bash
# Required dependencies
pip install Pillow pytesseract pdf2image

# System dependencies (macOS)
brew install tesseract poppler

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
```

## Usage

```bash
# Process a single PDF
python3 scripts/multimodal/extract_meaning_kernels.py research_paper.pdf

# Process a single image
python3 scripts/multimodal/extract_meaning_kernels.py diagram.png

# Process a directory of files
python3 scripts/multimodal/extract_meaning_kernels.py /path/to/diagrams/

# Specify output directory
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -o ./output

# Use configuration file
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -c config.json
```

## Output Structure

For each processed file, the pipeline creates:

```
output_directory/
├── page_001.png          # Converted page images
├── page_002.png
├── meaning_kernels.json  # Structured kernel data
├── meaning_kernels.md    # Human-readable report
└── extraction_stats.json # Processing statistics
```

## Meaning Kernel Format

Each kernel contains:

```json
{
  "kernel_id": "kernel_20260413_181234_p1_text",
  "content": "Extracted text content from the diagram",
  "source": "path/to/source/file.png",
  "confidence": 0.85,
  "metadata": {
    "type": "text_extraction",
    "word_count": 42,
    "line_count": 5,
    "structure": {...}
  },
  "timestamp": "2026-04-13T18:12:34.567890",
  "hash": "a1b2c3d4e5f6g7h8"
}
```

## Kernel Types

1. **Text Extraction**: Direct OCR text from the diagram
2. **Structure Analysis**: Diagram type, dimensions, orientation
3. **Summary**: Combined analysis of text and structure

## Configuration

Create a JSON config file:

```json
{
  "ocr_confidence_threshold": 50,
  "min_text_length": 10,
  "diagram_types": ["flowchart", "hierarchy", "network"],
  "output_format": ["json", "markdown"],
  "verbose": true
}
```

## Use Cases

- **Research Analysis**: Extract key concepts from academic papers
- **Knowledge Graphs**: Build structured knowledge from visual information
- **Document Indexing**: Make diagram content searchable
- **Content Summarization**: Generate text summaries of visual content
- **Machine Learning**: Training data for multimodal AI models

## Limitations

- OCR quality depends on diagram clarity and resolution
- Structure analysis is simplified (real CV would be more accurate)
- Complex diagrams may need specialized processing
- Large PDFs can be resource-intensive

## Future Enhancements

- Computer vision for diagram element detection
- Specialized processors for different diagram types
- Integration with LLMs for semantic analysis
- Batch processing with parallelization
- API endpoint for web integration