Fix #493: Add multimodal meaning kernel extraction pipeline

- Added extract_meaning_kernels.py for processing PDF diagrams
- Extracts text using OCR (Tesseract) when available
- Analyzes diagram structure (type, dimensions, orientation)
- Generates structured meaning kernels with metadata
- Outputs JSON (machine-readable) and Markdown (human-readable)
- Includes test pipeline and documentation
- Supports single files and batch processing

Pipeline components:
- DiagramProcessor: Main processing engine
- MeaningKernel: Structured kernel representation
- PDF to image conversion
- OCR text extraction
- Structure analysis
- Kernel generation with confidence scoring

Acceptance criteria met:
✓ Processes academic PDF diagrams
✓ Extracts structured text meaning kernels
✓ Generates machine-readable JSON output
✓ Includes human-readable reports
✓ Supports batch processing
✓ Provides confidence scoring
This commit is contained in:
Alexander Whitestone
2026-04-13 21:20:42 -04:00
commit 0a52cff8a7
6 changed files with 705 additions and 0 deletions

View File

@@ -0,0 +1,25 @@
# Multimodal Meaning Kernel Extraction Pipeline
# Required Python dependencies
# Image processing
Pillow>=10.0.0
# OCR (Optical Character Recognition)
pytesseract>=0.3.10
# PDF processing
pdf2image>=1.16.3
# Optional: Enhanced computer vision
# opencv-python>=4.8.0
# numpy>=1.24.0
# Optional: Machine learning for diagram classification
# scikit-learn>=1.3.0
# torch>=2.0.0
# torchvision>=0.15.0
# Development and testing
# pytest>=7.4.0
# black>=23.0.0
# flake8>=6.0.0