docs/visual-evidence-689.md

# Visual Evidence — Gemma 4 Multimodal Scene Description Generator

## Test Image: Coffee Beans (Macro Photo)

### Gemma 4 Vision Analysis (via Ollama)

**Model:** gemma4:latest (8B, Q4_K_M)
**Input:** sample_photo.jpg (46KB JPEG)

**Structured Output (JSONL):**
```json
{
  "mood": "dark",
  "colors": ["dark brown", "espresso", "black"],
  "composition": "close-up",
  "camera": "static",
  "lighting": "soft",
  "description": "An extreme close-up shot captures a dense pile of roasted coffee beans. The beans are a uniform, deep dark brown and appear slightly oily, filling the entire frame. The focus emphasizes the rich texture and individual shapes of the beans."
}
```

### Hermes Vision Analysis (Cross-Validation)

**Scene ID:** COFFEE_MACRO_001
**Mood:** Warm, aromatic, and comforting
**Dominant Colors:** Deep umber, burnt sienna, espresso black, mahogany
**Composition:** Full-frame fill, centrally weighted
**Camera:** High-angle, close-up (Macro)
**Lighting:** Soft, diffused top-lighting

## Test Image: Abstract Geometric Composition

### Gemma 4 Vision Analysis

**Input:** scene1.jpg (10KB, PIL-generated)

**Structured Output (JSONL):**
```json
{
  "mood": "energetic",
  "colors": ["deep blue", "yellow", "coral"],
  "composition": "wide-shot",
  "camera": "static",
  "lighting": "artificial",
  "description": "This is an abstract graphic composition set against a solid, deep blue background. A bright yellow square is placed in the upper left quadrant, while a large, solid coral-colored circle occupies the lower right quadrant. The geometric shapes create a high-contrast, minimalist visual balance."
}
```

## Verification Summary

| Test | Status | Details |
|------|--------|---------|
| Model detection | ✅ PASS | `gemma4:latest` auto-detected |
| Image scanning | ✅ PASS | 2 images found recursively |
| Vision analysis | ✅ PASS | Both images described accurately |
| JSON parsing | ✅ PASS | Structured output with all fields |
| Training format | ✅ PASS | JSONL with source, model, timestamp |
| ShareGPT format | ⚠️ PARTIAL | Works but needs retry on rate limit |

## Running the Generator

```bash
# Check model availability
python scripts/generate_scene_descriptions.py --check-model

# Generate scene descriptions from assets
python scripts/generate_scene_descriptions.py --input ./assets --output training-data/scene-descriptions-auto.jsonl

# Limit to 10 files with specific model
python scripts/generate_scene_descriptions.py --input ./assets --model gemma4:latest --limit 10

# ShareGPT format for training pipeline
python scripts/generate_scene_descriptions.py --input ./assets --format sharegpt
```
Merge PR #729: docs/visual-evidence-689.md (added) 2026-04-16 05:03:52 +00:00			`# Visual Evidence — Gemma 4 Multimodal Scene Description Generator`

			`## Test Image: Coffee Beans (Macro Photo)`

			`### Gemma 4 Vision Analysis (via Ollama)`

			`Model: gemma4:latest (8B, Q4_K_M)`
			`Input: sample_photo.jpg (46KB JPEG)`

			`Structured Output (JSONL):`
			```json
			`{`
			`"mood": "dark",`
			`"colors": ["dark brown", "espresso", "black"],`
			`"composition": "close-up",`
			`"camera": "static",`
			`"lighting": "soft",`
			`"description": "An extreme close-up shot captures a dense pile of roasted coffee beans. The beans are a uniform, deep dark brown and appear slightly oily, filling the entire frame. The focus emphasizes the rich texture and individual shapes of the beans."`
			`}`
			```

			`### Hermes Vision Analysis (Cross-Validation)`

			`Scene ID: COFFEE_MACRO_001`
			`Mood: Warm, aromatic, and comforting`
			`Dominant Colors: Deep umber, burnt sienna, espresso black, mahogany`
			`Composition: Full-frame fill, centrally weighted`
			`Camera: High-angle, close-up (Macro)`
			`Lighting: Soft, diffused top-lighting`

			`## Test Image: Abstract Geometric Composition`

			`### Gemma 4 Vision Analysis`

			`Input: scene1.jpg (10KB, PIL-generated)`

			`Structured Output (JSONL):`
			```json
			`{`
			`"mood": "energetic",`
			`"colors": ["deep blue", "yellow", "coral"],`
			`"composition": "wide-shot",`
			`"camera": "static",`
			`"lighting": "artificial",`
			`"description": "This is an abstract graphic composition set against a solid, deep blue background. A bright yellow square is placed in the upper left quadrant, while a large, solid coral-colored circle occupies the lower right quadrant. The geometric shapes create a high-contrast, minimalist visual balance."`
			`}`
			```

			`## Verification Summary`

			`\| Test \| Status \| Details \|`
			`\|------\|--------\|---------\|`
			\| Model detection \| ✅ PASS \| `gemma4:latest` auto-detected \|
			`\| Image scanning \| ✅ PASS \| 2 images found recursively \|`
			`\| Vision analysis \| ✅ PASS \| Both images described accurately \|`
			`\| JSON parsing \| ✅ PASS \| Structured output with all fields \|`
			`\| Training format \| ✅ PASS \| JSONL with source, model, timestamp \|`
			`\| ShareGPT format \| ⚠️ PARTIAL \| Works but needs retry on rate limit \|`

			`## Running the Generator`

			```bash
			`# Check model availability`
			`python scripts/generate_scene_descriptions.py --check-model`

			`# Generate scene descriptions from assets`
			`python scripts/generate_scene_descriptions.py --input ./assets --output training-data/scene-descriptions-auto.jsonl`

			`# Limit to 10 files with specific model`
			`python scripts/generate_scene_descriptions.py --input ./assets --model gemma4:latest --limit 10`

			`# ShareGPT format for training pipeline`
			`python scripts/generate_scene_descriptions.py --input ./assets --format sharegpt`
			```