scripts/meaning-kernels/README.md

# Meaning Kernel Extraction Pipeline

## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams

## Overview

This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.

## Features

- **PDF Processing**: Converts PDF pages to images for analysis
- **OCR Text Extraction**: Extracts text from diagrams using Tesseract
- **Structure Analysis**: Analyzes diagram type, dimensions, orientation
- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels
- **Confidence Scoring**: Each kernel includes confidence metrics
- **Batch Processing**: Supports single files and directories

## Installation

```bash
# Required dependencies
pip install Pillow pytesseract pdf2image

# System dependencies (macOS)
brew install tesseract poppler

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
```

## Usage

```bash
# Process a single PDF
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf

# Process a single image
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png

# Process a directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/

# Specify output directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output

# Run tests
python3 scripts/meaning-kernels/test_extraction.py
```

## Output Structure

```
output_directory/
├── page_001.png              # Converted page images
├── page_002.png
├── meaning_kernels.json      # Structured kernel data
├── meaning_kernels.md        # Human-readable report
└── extraction_stats.json     # Processing statistics
```

## Kernel Types

### 1. Text Kernels
Extracted from OCR processing of diagrams.

### 2. Structure Kernels
Diagram structure analysis (type, dimensions, aspect ratio).

### 3. Summary Kernels
Combined analysis summary including extracted text and structure.

### 4. Philosophical Kernels
Detected philosophical themes using keyword analysis.

## Configuration

Create a JSON config file:

```json
{
  "ocr_confidence_threshold": 50,
  "min_text_length": 10,
  "diagram_types": ["flowchart", "hierarchy", "network"],
  "extract_philosophical": true,
  "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
}
```

## Limitations

- OCR quality depends on diagram clarity
- Structure analysis provides basic metadata only
- Philosophical extraction is keyword-based (not semantic)
- Large PDFs can be resource-intensive

## Dependencies Notice

If you see warnings like:

```
Warning: pytesseract not available. Install with: pip install pytesseract
Warning: pdf2image not available. Install with: pip install pdf2image
```

Install the Python packages from `requirements.txt` and the system OCR engine as shown above.
fix: add OCR dependencies for meaning kernel extraction Add requirements.txt and README.md under scripts/meaning-kernels/ to document and enable installation of pytesseract, pdf2image, and Pillow. Addresses missing dependency warnings when running the enhanced meaning kernel extraction pipeline (#493). Closes #563 2026-04-29 07:15:48 -04:00			`# Meaning Kernel Extraction Pipeline`

			`## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams`

			`## Overview`

			`This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.`

			`## Features`

			`- PDF Processing: Converts PDF pages to images for analysis`
			`- OCR Text Extraction: Extracts text from diagrams using Tesseract`
			`- Structure Analysis: Analyzes diagram type, dimensions, orientation`
			`- Multiple Kernel Types: Generates text, structure, summary, and philosophical kernels`
			`- Confidence Scoring: Each kernel includes confidence metrics`
			`- Batch Processing: Supports single files and directories`

			`## Installation`

			```bash
			`# Required dependencies`
			`pip install Pillow pytesseract pdf2image`

			`# System dependencies (macOS)`
			`brew install tesseract poppler`

			`# System dependencies (Ubuntu/Debian)`
			`sudo apt-get install tesseract-ocr poppler-utils`
			```

			`## Usage`

			```bash
			`# Process a single PDF`
			`python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf`

			`# Process a single image`
			`python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png`

			`# Process a directory`
			`python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/`

			`# Specify output directory`
			`python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output`

			`# Run tests`
			`python3 scripts/meaning-kernels/test_extraction.py`
			```

			`## Output Structure`

			```
			`output_directory/`
			`├── page_001.png # Converted page images`
			`├── page_002.png`
			`├── meaning_kernels.json # Structured kernel data`
			`├── meaning_kernels.md # Human-readable report`
			`└── extraction_stats.json # Processing statistics`
			```

			`## Kernel Types`

			`### 1. Text Kernels`
			`Extracted from OCR processing of diagrams.`

			`### 2. Structure Kernels`
			`Diagram structure analysis (type, dimensions, aspect ratio).`

			`### 3. Summary Kernels`
			`Combined analysis summary including extracted text and structure.`

			`### 4. Philosophical Kernels`
			`Detected philosophical themes using keyword analysis.`

			`## Configuration`

			`Create a JSON config file:`

			```json
			`{`
			`"ocr_confidence_threshold": 50,`
			`"min_text_length": 10,`
			`"diagram_types": ["flowchart", "hierarchy", "network"],`
			`"extract_philosophical": true,`
			`"philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]`
			`}`
			```

			`## Limitations`

			`- OCR quality depends on diagram clarity`
			`- Structure analysis provides basic metadata only`
			`- Philosophical extraction is keyword-based (not semantic)`
			`- Large PDFs can be resource-intensive`

			`## Dependencies Notice`

			`If you see warnings like:`

			```
			`Warning: pytesseract not available. Install with: pip install pytesseract`
			`Warning: pdf2image not available. Install with: pip install pdf2image`
			```

			Install the Python packages from `requirements.txt` and the system OCR engine as shown above.