Fix #493: Add multimodal meaning kernel extraction pipeline
- Added extract_meaning_kernels.py for processing PDF diagrams - Extracts text using OCR (Tesseract) when available - Analyzes diagram structure (type, dimensions, orientation) - Generates structured meaning kernels with metadata - Outputs JSON (machine-readable) and Markdown (human-readable) - Includes test pipeline and documentation - Supports single files and batch processing Pipeline components: - DiagramProcessor: Main processing engine - MeaningKernel: Structured kernel representation - PDF to image conversion - OCR text extraction - Structure analysis - Kernel generation with confidence scoring Acceptance criteria met: ✓ Processes academic PDF diagrams ✓ Extracts structured text meaning kernels ✓ Generates machine-readable JSON output ✓ Includes human-readable reports ✓ Supports batch processing ✓ Provides confidence scoring
This commit is contained in:
128
scripts/multimodal/README.md
Normal file
128
scripts/multimodal/README.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# Multimodal Meaning Kernel Extraction Pipeline
|
||||
|
||||
Extracts structured meaning kernels from academic PDF diagrams into text format.
|
||||
|
||||
## Issue #493
|
||||
|
||||
[Multimodal] Extract Meaning Kernels from Research Diagrams
|
||||
|
||||
## Overview
|
||||
|
||||
This pipeline processes academic PDF diagrams and images to extract structured "meaning kernels" - discrete units of meaning that can be stored, indexed, and analyzed.
|
||||
|
||||
## Features
|
||||
|
||||
- **PDF Processing**: Converts PDF pages to images and processes each page
|
||||
- **OCR Text Extraction**: Extracts text from diagrams using Tesseract OCR
|
||||
- **Structure Analysis**: Analyzes diagram structure (type, dimensions, orientation)
|
||||
- **Kernel Generation**: Creates structured meaning kernels with metadata
|
||||
- **Multiple Output Formats**: JSON for machine processing, Markdown for human readability
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Required dependencies
|
||||
pip install Pillow pytesseract pdf2image
|
||||
|
||||
# System dependencies (macOS)
|
||||
brew install tesseract poppler
|
||||
|
||||
# System dependencies (Ubuntu/Debian)
|
||||
sudo apt-get install tesseract-ocr poppler-utils
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Process a single PDF
|
||||
python3 scripts/multimodal/extract_meaning_kernels.py research_paper.pdf
|
||||
|
||||
# Process a single image
|
||||
python3 scripts/multimodal/extract_meaning_kernels.py diagram.png
|
||||
|
||||
# Process a directory of files
|
||||
python3 scripts/multimodal/extract_meaning_kernels.py /path/to/diagrams/
|
||||
|
||||
# Specify output directory
|
||||
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -o ./output
|
||||
|
||||
# Use configuration file
|
||||
python3 scripts/multimodal/extract_meaning_kernels.py paper.pdf -c config.json
|
||||
```
|
||||
|
||||
## Output Structure
|
||||
|
||||
For each processed file, the pipeline creates:
|
||||
|
||||
```
|
||||
output_directory/
|
||||
├── page_001.png # Converted page images
|
||||
├── page_002.png
|
||||
├── meaning_kernels.json # Structured kernel data
|
||||
├── meaning_kernels.md # Human-readable report
|
||||
└── extraction_stats.json # Processing statistics
|
||||
```
|
||||
|
||||
## Meaning Kernel Format
|
||||
|
||||
Each kernel contains:
|
||||
|
||||
```json
|
||||
{
|
||||
"kernel_id": "kernel_20260413_181234_p1_text",
|
||||
"content": "Extracted text content from the diagram",
|
||||
"source": "path/to/source/file.png",
|
||||
"confidence": 0.85,
|
||||
"metadata": {
|
||||
"type": "text_extraction",
|
||||
"word_count": 42,
|
||||
"line_count": 5,
|
||||
"structure": {...}
|
||||
},
|
||||
"timestamp": "2026-04-13T18:12:34.567890",
|
||||
"hash": "a1b2c3d4e5f6g7h8"
|
||||
}
|
||||
```
|
||||
|
||||
## Kernel Types
|
||||
|
||||
1. **Text Extraction**: Direct OCR text from the diagram
|
||||
2. **Structure Analysis**: Diagram type, dimensions, orientation
|
||||
3. **Summary**: Combined analysis of text and structure
|
||||
|
||||
## Configuration
|
||||
|
||||
Create a JSON config file:
|
||||
|
||||
```json
|
||||
{
|
||||
"ocr_confidence_threshold": 50,
|
||||
"min_text_length": 10,
|
||||
"diagram_types": ["flowchart", "hierarchy", "network"],
|
||||
"output_format": ["json", "markdown"],
|
||||
"verbose": true
|
||||
}
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
- **Research Analysis**: Extract key concepts from academic papers
|
||||
- **Knowledge Graphs**: Build structured knowledge from visual information
|
||||
- **Document Indexing**: Make diagram content searchable
|
||||
- **Content Summarization**: Generate text summaries of visual content
|
||||
- **Machine Learning**: Training data for multimodal AI models
|
||||
|
||||
## Limitations
|
||||
|
||||
- OCR quality depends on diagram clarity and resolution
|
||||
- Structure analysis is simplified (real CV would be more accurate)
|
||||
- Complex diagrams may need specialized processing
|
||||
- Large PDFs can be resource-intensive
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Computer vision for diagram element detection
|
||||
- Specialized processors for different diagram types
|
||||
- Integration with LLMs for semantic analysis
|
||||
- Batch processing with parallelization
|
||||
- API endpoint for web integration
|
||||
Reference in New Issue
Block a user