Files
timmy-config/scripts/meaning-kernels/README.md
Timmy Burn Worker 25eca504a9
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 28s
Smoke Test / smoke (pull_request) Failing after 27s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 24s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m6s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 13s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 10s
Validate Config / Shell Script Lint (pull_request) Failing after 44s
Validate Config / Playbook Schema Validation (pull_request) Successful in 15s
Architecture Lint / Lint Repository (pull_request) Failing after 15s
PR Checklist / pr-checklist (pull_request) Successful in 6m1s
fix: add OCR dependencies for meaning kernel extraction
Add requirements.txt and README.md under scripts/meaning-kernels/
to document and enable installation of pytesseract, pdf2image,
and Pillow. Addresses missing dependency warnings when
running the enhanced meaning kernel extraction pipeline (#493).

Closes #563
2026-04-29 07:17:13 -04:00

2.9 KiB

Meaning Kernel Extraction Pipeline

Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams

Overview

This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.

Features

  • PDF Processing: Converts PDF pages to images for analysis
  • OCR Text Extraction: Extracts text from diagrams using Tesseract
  • Structure Analysis: Analyzes diagram type, dimensions, orientation
  • Multiple Kernel Types: Generates text, structure, summary, and philosophical kernels
  • Confidence Scoring: Each kernel includes confidence metrics
  • Batch Processing: Supports single files and directories

Installation

# Required dependencies
pip install Pillow pytesseract pdf2image

# System dependencies (macOS)
brew install tesseract poppler

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

Usage

# Process a single PDF
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf

# Process a single image
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png

# Process a directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/

# Specify output directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output

# Run tests
python3 scripts/meaning-kernels/test_extraction.py

Output Structure

output_directory/
├── page_001.png              # Converted page images
├── page_002.png
├── meaning_kernels.json      # Structured kernel data
├── meaning_kernels.md        # Human-readable report
└── extraction_stats.json     # Processing statistics

Kernel Types

1. Text Kernels

Extracted from OCR processing of diagrams.

2. Structure Kernels

Diagram structure analysis (type, dimensions, aspect ratio).

3. Summary Kernels

Combined analysis summary including extracted text and structure.

4. Philosophical Kernels

Detected philosophical themes using keyword analysis.

Configuration

Create a JSON config file:

{
  "ocr_confidence_threshold": 50,
  "min_text_length": 10,
  "diagram_types": ["flowchart", "hierarchy", "network"],
  "extract_philosophical": true,
  "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
}

Limitations

  • OCR quality depends on diagram clarity
  • Structure analysis provides basic metadata only
  • Philosophical extraction is keyword-based (not semantic)
  • Large PDFs can be resource-intensive

Dependencies Notice

If you see warnings like:

Warning: pytesseract not available. Install with: pip install pytesseract
Warning: pdf2image not available. Install with: pip install pdf2image

Install the Python packages from requirements.txt and the system OCR engine as shown above.