Compare commits

...

1 Commits

Author SHA1 Message Date
Timmy Burn Worker
25eca504a9 fix: add OCR dependencies for meaning kernel extraction
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 28s
Smoke Test / smoke (pull_request) Failing after 27s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 24s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m6s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 13s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 10s
Validate Config / Shell Script Lint (pull_request) Failing after 44s
Validate Config / Playbook Schema Validation (pull_request) Successful in 15s
Architecture Lint / Lint Repository (pull_request) Failing after 15s
PR Checklist / pr-checklist (pull_request) Successful in 6m1s
Add requirements.txt and README.md under scripts/meaning-kernels/
to document and enable installation of pytesseract, pdf2image,
and Pillow. Addresses missing dependency warnings when
running the enhanced meaning kernel extraction pipeline (#493).

Closes #563
2026-04-29 07:17:13 -04:00
2 changed files with 109 additions and 0 deletions

View File

@@ -0,0 +1,105 @@
# Meaning Kernel Extraction Pipeline
## Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams
## Overview
This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.
## Features
- **PDF Processing**: Converts PDF pages to images for analysis
- **OCR Text Extraction**: Extracts text from diagrams using Tesseract
- **Structure Analysis**: Analyzes diagram type, dimensions, orientation
- **Multiple Kernel Types**: Generates text, structure, summary, and philosophical kernels
- **Confidence Scoring**: Each kernel includes confidence metrics
- **Batch Processing**: Supports single files and directories
## Installation
```bash
# Required dependencies
pip install Pillow pytesseract pdf2image
# System dependencies (macOS)
brew install tesseract poppler
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
```
## Usage
```bash
# Process a single PDF
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf
# Process a single image
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png
# Process a directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/
# Specify output directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output
# Run tests
python3 scripts/meaning-kernels/test_extraction.py
```
## Output Structure
```
output_directory/
├── page_001.png # Converted page images
├── page_002.png
├── meaning_kernels.json # Structured kernel data
├── meaning_kernels.md # Human-readable report
└── extraction_stats.json # Processing statistics
```
## Kernel Types
### 1. Text Kernels
Extracted from OCR processing of diagrams.
### 2. Structure Kernels
Diagram structure analysis (type, dimensions, aspect ratio).
### 3. Summary Kernels
Combined analysis summary including extracted text and structure.
### 4. Philosophical Kernels
Detected philosophical themes using keyword analysis.
## Configuration
Create a JSON config file:
```json
{
"ocr_confidence_threshold": 50,
"min_text_length": 10,
"diagram_types": ["flowchart", "hierarchy", "network"],
"extract_philosophical": true,
"philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
}
```
## Limitations
- OCR quality depends on diagram clarity
- Structure analysis provides basic metadata only
- Philosophical extraction is keyword-based (not semantic)
- Large PDFs can be resource-intensive
## Dependencies Notice
If you see warnings like:
```
Warning: pytesseract not available. Install with: pip install pytesseract
Warning: pdf2image not available. Install with: pip install pdf2image
```
Install the Python packages from `requirements.txt` and the system OCR engine as shown above.

View File

@@ -0,0 +1,4 @@
# OCR and PDF processing dependencies for meaning kernel extraction
Pillow>=10.0.0
pytesseract>=0.3.10
pdf2image>=1.16.3