Files

Alexander Whitestone efdc0dc886 Improve #493 : Enhanced meaning kernel extraction pipeline

- Added 5 kernel types: text, structure, summary, philosophical, semantic
- Improved diagram type detection with content analysis
- Added color analysis and grayscale detection
- Enhanced philosophical keyword extraction
- Added semantic relationship detection
- Improved error handling for missing dependencies
- Added comprehensive testing with text-rich test images
- Enhanced metadata and tagging system

Key improvements:
✓ Semantic relationship detection (source → target patterns)
✓ Enhanced philosophical content extraction
✓ Color analysis and grayscale detection
✓ Better diagram type classification
✓ Comprehensive metadata and tagging
✓ Improved error handling and dependency warnings

Still requires OCR dependencies for text extraction:
- pytesseract for OCR
- pdf2image for PDF processing
- Tesseract OCR engine (see issue #563)

2026-04-14 11:44:55 -04:00

__pycache__

Improve #493 : Enhanced meaning kernel extraction pipeline

2026-04-14 11:44:55 -04:00

extract_meaning_kernels.py

Improve #493 : Enhanced meaning kernel extraction pipeline

2026-04-14 11:44:55 -04:00

README.md

Fix #493 : Extract meaning kernels from research diagrams

2026-04-13 22:32:17 -04:00

requirements.txt

Fix #493 : Extract meaning kernels from research diagrams

2026-04-13 22:32:17 -04:00

test_extraction.py

Improve #493 : Enhanced meaning kernel extraction pipeline

2026-04-14 11:44:55 -04:00

README.md

Meaning Kernel Extraction Pipeline

Issue #493: [Multimodal] Extract Meaning Kernels from Research Diagrams

Overview

This pipeline extracts structured meaning kernels from academic PDF diagrams and images. It processes visual content to generate machine-readable text representations.

Features

PDF Processing: Converts PDF pages to images for analysis
OCR Text Extraction: Extracts text from diagrams using Tesseract
Structure Analysis: Analyzes diagram type, dimensions, orientation
Multiple Kernel Types: Generates text, structure, summary, and philosophical kernels
Confidence Scoring: Each kernel includes confidence metrics
Batch Processing: Supports single files and directories

Installation

# Required dependencies
pip install Pillow pytesseract pdf2image

# System dependencies (macOS)
brew install tesseract poppler

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

Usage

# Process a single PDF
python3 scripts/meaning-kernels/extract_meaning_kernels.py research_paper.pdf

# Process a single image
python3 scripts/meaning-kernels/extract_meaning_kernels.py diagram.png

# Process a directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py /path/to/diagrams/

# Specify output directory
python3 scripts/meaning-kernels/extract_meaning_kernels.py paper.pdf -o ./output

# Run tests
python3 scripts/meaning-kernels/test_extraction.py

Output Structure

output_directory/
├── page_001.png              # Converted page images
├── page_002.png
├── meaning_kernels.json      # Structured kernel data
├── meaning_kernels.md        # Human-readable report
└── extraction_stats.json     # Processing statistics

Kernel Types

1. Text Kernels

Extracted from OCR processing of diagrams.

{
  "kernel_id": "kernel_20260413_123456_p1_text",
  "content": "Extracted text from diagram",
  "kernel_type": "text",
  "confidence": 0.85,
  "metadata": {
    "word_count": 42,
    "diagram_type": "flowchart"
  }
}

2. Structure Kernels

Diagram structure analysis.

{
  "kernel_id": "kernel_20260413_123456_p1_structure",
  "content": "Diagram type: flowchart. Dimensions: 800x600. Aspect ratio: 1.33.",
  "kernel_type": "structure",
  "confidence": 0.9,
  "metadata": {
    "dimensions": {"width": 800, "height": 600},
    "aspect_ratio": 1.33,
    "diagram_type": "flowchart"
  }
}

3. Summary Kernels

Combined analysis summary.

{
  "kernel_id": "kernel_20260413_123456_p1_summary",
  "content": "Research diagram analysis: flowchart diagram. Contains text: Input → Processing → Output...",
  "kernel_type": "summary",
  "confidence": 0.7,
  "metadata": {
    "has_text": true,
    "text_length": 150
  }
}

4. Philosophical Kernels

Extracted philosophical themes (when detected).

{
  "kernel_id": "kernel_20260413_123456_p1_philosophical",
  "content": "Philosophical themes detected: knowledge, truth. Source text explores concepts of knowledge.",
  "kernel_type": "philosophical",
  "confidence": 0.6,
  "metadata": {
    "extraction_method": "keyword_analysis",
    "source_text_length": 200
  }
}

Configuration

Create a JSON config file:

{
  "ocr_confidence_threshold": 50,
  "min_text_length": 10,
  "diagram_types": ["flowchart", "hierarchy", "network"],
  "extract_philosophical": true,
  "philosophical_keywords": ["truth", "knowledge", "wisdom", "meaning"]
}

Limitations

OCR quality depends on diagram clarity
Structure analysis is simplified
Philosophical extraction is keyword-based
Large PDFs can be resource-intensive

Future Enhancements

Computer vision for diagram element detection
LLM integration for semantic analysis
Specialized processors for different diagram types
Integration with knowledge graphs
API endpoint for web integration

Files

extract_meaning_kernels.py - Main extraction pipeline
test_extraction.py - Test script
requirements.txt - Python dependencies
README.md - This documentation