From 25eca504a988463853ed69d2b7014a32ee95ee4c Mon Sep 17 00:00:00 2001
From: Timmy Burn Worker <timmy@example.com>
Date: Wed, 29 Apr 2026 07:15:48 -0400
Subject: [PATCH] fix: add OCR dependencies for meaning kernel extraction

Add requirements.txt and README.md under scripts/meaning-kernels/
to document and enable installation of pytesseract, pdf2image,
and Pillow. Addresses missing dependency warnings when
running the enhanced meaning kernel extraction pipeline (#493).

Closes #563
---
 scripts/meaning-kernels/README.md        | 105 +++++++++++++++++++++++
 scripts/meaning-kernels/requirements.txt |   4 +
 2 files changed, 109 insertions(+)
 create mode 100644 scripts/meaning-kernels/README.md
 create mode 100644 scripts/meaning-kernels/requirements.txt

diff --git a/scripts/meaning-kernels/README.md b/scripts/meaning-kernels/README.md
new file mode 100644
index 00000000..6aeb05bb
--- /dev/null
+++ b/scripts/meaning-kernels/README.md
@@ -0,0 +1,105 @@
+# Meaning Kernel Extraction Pipeline
+
+## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams
+
+## Overview
+
+This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.
+
+## Features
+
+- **PDF Processing**: Converts PDF pages to images for analysis
+- **OCR Text Extraction**: Extracts text from diagrams using Tesseract
+- **Structure Analysis**: Analyzes diagram type, dimensions, orientation
+- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels
+- **Confidence Scoring**: Each kernel includes confidence metrics
+- **Batch Processing**: Supports single files and directories
+
+## Installation
+
+```bash
+# Required dependencies
+pip install Pillow pytesseract pdf2image
+
+# System dependencies (macOS)
+brew install tesseract poppler
+
+# System dependencies (Ubuntu/Debian)
+sudo apt-get install tesseract-ocr poppler-utils
+```
+
+## Usage
+
+```bash
+# Process a single PDF
+python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf
+
+# Process a single image
+python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png
+
+# Process a directory
+python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/
+
+# Specify output directory
+python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output
+
+# Run tests
+python3 scripts/meaning-kernels/test_extraction.py
+```
+
+## Output Structure
+
+```
+output_directory/
+├── page_001.png              # Converted page images
+├── page_002.png
+├── meaning_kernels.json      # Structured kernel data
+├── meaning_kernels.md        # Human-readable report
+└── extraction_stats.json     # Processing statistics
+```
+
+## Kernel Types
+
+### 1. Text Kernels
+Extracted from OCR processing of diagrams.
+
+### 2. Structure Kernels
+Diagram structure analysis (type, dimensions, aspect ratio).
+
+### 3. Summary Kernels
+Combined analysis summary including extracted text and structure.
+
+### 4. Philosophical Kernels
+Detected philosophical themes using keyword analysis.
+
+## Configuration
+
+Create a JSON config file:
+
+```json
+{
+  "ocr_confidence_threshold": 50,
+  "min_text_length": 10,
+  "diagram_types": ["flowchart", "hierarchy", "network"],
+  "extract_philosophical": true,
+  "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
+}
+```
+
+## Limitations
+
+- OCR quality depends on diagram clarity
+- Structure analysis provides basic metadata only
+- Philosophical extraction is keyword-based (not semantic)
+- Large PDFs can be resource-intensive
+
+## Dependencies Notice
+
+If you see warnings like:
+
+```
+Warning: pytesseract not available. Install with: pip install pytesseract
+Warning: pdf2image not available. Install with: pip install pdf2image
+```
+
+Install the Python packages from `requirements.txt` and the system OCR engine as shown above.
diff --git a/scripts/meaning-kernels/requirements.txt b/scripts/meaning-kernels/requirements.txt
new file mode 100644
index 00000000..0b892c27
--- /dev/null
+++ b/scripts/meaning-kernels/requirements.txt
@@ -0,0 +1,4 @@
+# OCR and PDF processing dependencies for meaning kernel extraction
+Pillow>=10.0.0
+pytesseract>=0.3.10
+pdf2image>=1.16.3