Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 28s
Smoke Test / smoke (pull_request) Failing after 27s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 24s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m6s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 13s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 10s
Validate Config / Shell Script Lint (pull_request) Failing after 44s
Validate Config / Playbook Schema Validation (pull_request) Successful in 15s
Architecture Lint / Lint Repository (pull_request) Failing after 15s
PR Checklist / pr-checklist (pull_request) Successful in 6m1s
Add requirements.txt and README.md under scripts/meaning-kernels/ to document and enable installation of pytesseract, pdf2image, and Pillow. Addresses missing dependency warnings when running the enhanced meaning kernel extraction pipeline (#493). Closes #563
2.9 KiB
2.9 KiB
Meaning Kernel Extraction Pipeline
Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams
Overview
This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.
Features
- PDF Processing: Converts PDF pages to images for analysis
- OCR Text Extraction: Extracts text from diagrams using Tesseract
- Structure Analysis: Analyzes diagram type, dimensions, orientation
- Multiple Kernel Types: Generates text, structure, summary, and philosophical kernels
- Confidence Scoring: Each kernel includes confidence metrics
- Batch Processing: Supports single files and directories
Installation
# Required dependencies
pip install Pillow pytesseract pdf2image
# System dependencies (macOS)
brew install tesseract poppler
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
Usage
# Process a single PDF
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf
# Process a single image
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png
# Process a directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/
# Specify output directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output
# Run tests
python3 scripts/meaning-kernels/test_extraction.py
Output Structure
output_directory/
├── page_001.png # Converted page images
├── page_002.png
├── meaning_kernels.json # Structured kernel data
├── meaning_kernels.md # Human-readable report
└── extraction_stats.json # Processing statistics
Kernel Types
1. Text Kernels
Extracted from OCR processing of diagrams.
2. Structure Kernels
Diagram structure analysis (type, dimensions, aspect ratio).
3. Summary Kernels
Combined analysis summary including extracted text and structure.
4. Philosophical Kernels
Detected philosophical themes using keyword analysis.
Configuration
Create a JSON config file:
{
"ocr_confidence_threshold": 50,
"min_text_length": 10,
"diagram_types": ["flowchart", "hierarchy", "network"],
"extract_philosophical": true,
"philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
}
Limitations
- OCR quality depends on diagram clarity
- Structure analysis provides basic metadata only
- Philosophical extraction is keyword-based (not semantic)
- Large PDFs can be resource-intensive
Dependencies Notice
If you see warnings like:
Warning: pytesseract not available. Install with: pip install pytesseract
Warning: pdf2image not available. Install with: pip install pdf2image
Install the Python packages from requirements.txt and the system OCR engine as shown above.