Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
dda1e71029 wip: add the-testament genome analysis for #675
Some checks failed
Smoke Test / smoke (pull_request) Failing after 16s
2026-04-14 23:54:45 -04:00

666
the-testament-GENOME.md Normal file
View File

@@ -0,0 +1,666 @@
# GENOME.md — the-testament
Generated: 2026-04-15
Repo: Timmy_Foundation/the-testament
Analysis issue: timmy-home #675
---
## Project Overview
The Testament is not a conventional software repo and not just a manuscript dump.
It is a hybrid publishing system with four layers:
1. narrative source files
2. build/packaging pipelines
3. presentation surfaces
4. verification/quality gates
At the content layer, the repo holds a five-part novel with 18 chapter manuscripts, front/back matter, character sheets, worldbuilding notes, cover copy, soundtrack notes, and other companion artifacts.
At the software layer, it ships a small publishing toolchain that compiles the manuscript into:
- combined markdown
- EPUB
- HTML
- PDF
- web-reader JSON
- checksum manifest
It also includes:
- a static promotional/reader website (`website/index.html`)
- an interactive companion experience (`game/the-door.py` / `game/the-door.html`)
- audiobook helper scripts (`audiobook/`)
- validation and smoke-check automation (`scripts/` + `.gitea/workflows/`)
This makes the repo best understood as a sovereign multimedia book production system centered on a novel.
Runtime-confirmed facts from direct verification:
- `scripts/build-verify.py --json` passes and reports 18 chapters
- the verifier reports ~18,884 manuscript words in chapters and ~19,227 words in concatenated output
- `bash scripts/smoke.sh` passes and successfully builds markdown/epub/html
- `python3 build/build.py --md` succeeds
- `python3 compile_all.py --check` currently crashes due a qrcode version lookup bug
---
## Quick Facts
Repository composition from direct scan:
- 18 chapter manuscripts in `chapters/`
- top-level content/support directories include:
- `chapters/`
- `build/`
- `website/`
- `audiobook/`
- `game/`
- `characters/`
- `worldbuilding/`
- `cover/`
- `music/`
- primary code entrypoints are Python scripts plus a static HTML site
- no dedicated `tests/` directory
- validation is script-driven rather than unit-test-driven
Approximate non-output code inventory from `pygount` scan:
- ~3.6K lines of code-equivalent across Python/HTML/CSS/YAML/Bash/JSON
- code mass is concentrated in:
- `compile_all.py`
- `build/build.py`
- `compile.py`
- `scripts/build-verify.py`
- `website/index.html`
- `game/the-door.py`
---
## Architecture
```mermaid
flowchart TD
A[chapters/*.md] --> B[compile_markdown]
C[front-matter.md / build/frontmatter.md] --> B
D[back-matter.md / build/backmatter.md] --> B
E[build/metadata.yaml] --> F[pandoc/reportlab packaging]
G[book-style.css] --> F
H[cover/cover-art.jpg] --> F
B --> I[testament-complete.md]
I --> F
F --> J[testament.epub]
F --> K[testament.html]
F --> L[testament.pdf]
A --> M[compile_chapters_json / website/build-chapters.py]
M --> N[website/chapters.json]
I --> O[generate_manifest]
J --> O
K --> O
L --> O
N --> O
O --> P[build-manifest.json]
A --> Q[scripts/index_generator.py]
R[characters/*.md] --> Q
Q --> S[KNOWLEDGE_GRAPH.md]
A --> T[build/semantic_linker.py]
T --> U[build/cross_refs.json]
A --> V[audiobook/extract_text.py]
V --> W[text excerpts]
W --> X[audiobook/generate_samples.sh]
X --> Y[audiobook sample files]
Y --> Z[audiobook/create_manifest.py]
Z --> AA[audiobook/manifest.md]
AB[scripts/build-verify.py] --> A
AB --> I
AC[scripts/smoke.sh] --> AB
AD[.gitea workflows] --> AC
AE[website/index.html] --> AF[static landing/reading experience]
AG[game/the-door.py / game/the-door.html] --> AH[interactive companion artifact]
```
---
## Entry Points
### Primary build entrypoint
1. `compile_all.py`
This is the canonical unified pipeline.
It builds:
- combined markdown
- EPUB
- PDF
- HTML
- `website/chapters.json`
- `build-manifest.json`
It also exposes:
- `--check`
- `--clean`
- format-specific flags (`--md`, `--epub`, `--pdf`, `--html`, `--json`)
### Legacy build entrypoints
2. `build/build.py`
3. `compile.py`
These overlap with the unified pipeline and still work as alternate build surfaces.
`build/build.py` is the more structured legacy path.
`compile.py` is a simpler older compiler that still shells out to `scripts/index_generator.py` before building.
### Verification entrypoints
4. `scripts/build-verify.py`
5. `scripts/smoke.sh`
6. `.gitea/workflows/build.yml`
7. `.gitea/workflows/smoke.yml`
8. `.gitea/workflows/validate.yml`
These form the repos test/CI surface.
There are no unit tests; these scripts are the executable contract.
### Website/content export entrypoints
9. `website/build-chapters.py`
10. `website/index.html`
`build-chapters.py` converts chapter markdown into HTML snippets inside `website/chapters.json`.
`website/index.html` is a large static HTML/CSS/JS page used as the web-facing presentation layer.
### Audiobook entrypoints
11. `audiobook/extract_text.py`
12. `audiobook/create_manifest.py`
13. `audiobook/generate_samples.sh`
These scripts support excerpt extraction, sample generation, and audiobook manifest creation.
### Companion/interactive entrypoints
14. `game/the-door.py`
15. `game/the-door.html`
These are sidecar experiences, not part of the core build pipeline, but they are part of the repo architecture.
### Knowledge/indexing entrypoints
16. `scripts/index_generator.py`
17. `build/semantic_linker.py`
These create graph-like auxiliary artifacts from the manuscript corpus.
---
## Data Flow
### Main book build flow
```text
chapter markdown + front matter + back matter
compile_markdown()
combined manuscript: testament-complete.md
format-specific compilers
├─ pandoc -> EPUB
├─ pandoc -> standalone HTML
├─ xelatex / weasyprint / reportlab -> PDF
└─ metadata/css/cover integrated where available
optional output hashing
build-manifest.json
```
### Website/export flow
```text
chapters/*.md
website/build-chapters.py or compile_all.py::compile_chapters_json()
extract heading + convert paragraphs/quotes/headings to HTML fragments
website/chapters.json
```
Important nuance:
- `website/chapters.json` is produced by the toolchain
- current `website/index.html` appears to be a static landing/presentation page
- no direct `fetch('chapters.json')` usage was found in the current website HTML
So the JSON output is a generated artifact for a web-reader/export path, but not obviously consumed by the checked-in landing page itself.
### Verification flow
```text
chapter files + required support files
scripts/build-verify.py
├─ count files
├─ validate heading format
├─ compute word counts
├─ check markdown integrity
├─ concatenate outputs
└─ write build-report.json when asked
```
### Knowledge graph / semantic link flow
```text
characters/*.md + chapters/*.md
scripts/index_generator.py
KNOWLEDGE_GRAPH.md
chapters/*.md
build/semantic_linker.py
build/cross_refs.json
```
### Audiobook flow
```text
chapter markdown
audiobook/extract_text.py
trimmed text excerpt
audiobook/generate_samples.sh
audio sample files
audiobook/create_manifest.py
audiobook/manifest.md
```
---
## Key Abstractions
### 1. Chapter corpus
The core domain object of the repo is the ordered chapter set:
- `chapters/chapter-01.md` ... `chapters/chapter-18.md`
- exact numbering matters
- heading format matters
- concatenation order matters
Almost every script assumes this ordered corpus is the canonical source of truth.
### 2. Part boundaries (`PARTS`)
Both `compile.py`, `build/build.py`, and `compile_all.py` define a `PARTS` mapping.
This injects higher-level narrative structure into the build output by adding part headers and descriptions at fixed chapter boundaries.
### 3. Compiled manuscript
`testament-complete.md` is the normalized intermediate artifact.
It is the manuscript assembly layer from which downstream formats are built.
This is the closest thing the repo has to an internal IR (intermediate representation).
### 4. Multi-backend packaging
The build system supports multiple packaging backends:
- pandoc for EPUB and HTML
- xelatex for PDF when available
- weasyprint fallback
- reportlab fallback for fully local pure-Python PDF generation
This is a resilience pattern: the repo prefers multiple production paths rather than a single brittle dependency chain.
### 5. Manifested outputs
`build-manifest.json` stores output metadata and SHA256 checksums.
That turns built artifacts into auditable objects rather than opaque files.
### 6. Verification-as-tests
Because there is no `tests/` suite, `scripts/build-verify.py` is effectively the main automated specification for integrity.
It asserts:
- chapter count
- naming/ordering
- heading format
- word-count sanity
- markdown integrity
- concatenation success
- required support files
### 7. Companion surfaces
The repo has non-manuscript presentation surfaces:
- static website
- interactive game/experience (`The Door`)
- audiobook assets and scripts
These make the repo a narrative system, not just a book build.
### 8. Knowledge graph / semantic linking
The repo contains lightweight symbolic tooling:
- regex-based character-to-chapter index generation
- capitalized-phrase cross-reference detection between chapters
This is a GOFAI-like layer over literary content.
---
## API Surface
This repos API surface is mostly CLI-based rather than network-based.
### Canonical CLI surface
#### `compile_all.py`
Commands:
- `python3 compile_all.py`
- `python3 compile_all.py --md`
- `python3 compile_all.py --epub`
- `python3 compile_all.py --pdf`
- `python3 compile_all.py --html`
- `python3 compile_all.py --json`
- `python3 compile_all.py --check`
- `python3 compile_all.py --clean`
Outputs:
- `testament-complete.md`
- `testament.epub`
- `testament.html`
- `testament.pdf`
- `website/chapters.json`
- `build-manifest.json`
#### `build/build.py`
Commands:
- `python3 build/build.py --md`
- `python3 build/build.py --epub`
- `python3 build/build.py --pdf`
- `python3 build/build.py --html`
- default full build behavior
#### `compile.py`
Commands documented:
- `python3 compile.py`
- `python3 compile.py --md`
- `python3 compile.py --epub`
- `python3 compile.py --html`
- `python3 compile.py --check`
Observed quirk:
- `scripts/smoke.sh` calls `python3 compile.py --validate`
- no `--validate` handling exists in source
- the script still exits 0 because `compile.py` ignores unknown args and runs its default build path
That is a real contract quirk/drift worth remembering.
#### `scripts/build-verify.py`
Commands:
- `python3 scripts/build-verify.py`
- `python3 scripts/build-verify.py --ci`
- `python3 scripts/build-verify.py --json`
#### Other tooling
- `python3 website/build-chapters.py`
- `python3 scripts/index_generator.py`
- `python3 build/semantic_linker.py`
- `python3 audiobook/extract_text.py <input.md> <output.txt>`
- `python3 audiobook/create_manifest.py`
- `bash audiobook/generate_samples.sh`
- `bash scripts/smoke.sh`
- `python3 game/the-door.py`
### Data contracts
#### Chapter heading contract
`build-verify.py` expects each chapter to start with:
- `# Chapter N — Title`
#### File naming contract
- chapter files must match `chapter-XX.md`
- exactly 18 chapters are expected by the verifier
#### Output manifest contract
`build-manifest.json` includes, per file:
- path
- size_bytes
- sha256
#### Website chapters JSON contract
Entries include:
- `number`
- `title`
- `html`
---
## Test Coverage Gaps
### Current state
There is no unit-test suite and no `tests/` directory.
Coverage is currently provided by:
- shell smoke checks
- build verification script
- CI workflow checks
That means the repo has verification, but not isolated regression tests.
### What is already covered by script-based checks
- chapter count and naming
- heading format
- minimum word-count sanity
- markdown delimiter/link integrity
- concatenation success
- required-file existence
- basic syntax parsing for Python/YAML/shell/JSON
- secret-pattern grep scanning
### Highest-value missing tests
1. `compile_all.py` dependency-check behavior
- there should be a regression test for `--check`
- current runtime already revealed a concrete failure when `qrcode.__version__` is missing
2. `compile_chapters_json()` correctness
- verify all 18 chapters are emitted
- verify blockquotes/headings/italics render as expected
- verify title extraction stays stable
3. Manifest generation
- verify `build-manifest.json` includes every built artifact actually present
- verify sha256 and size fields are correct
4. Build backend selection
- verify fallback order for PDF generation behaves correctly when xelatex/weasyprint/reportlab availability changes
5. `scripts/index_generator.py`
- verify character mention detection and markdown output determinism
6. `build/semantic_linker.py`
- verify the proper-noun extraction and common-word filtering do not produce obviously bad edges
7. Website/output parity
- verify `website/chapters.json` matches chapter headings and ordering from source manuscripts
8. Companion experience smoke tests
- `game/the-door.py` has no automated behavior coverage
- `game/the-door.html` has no structural or syntax verification
### Recommended first tests
If this repo gets a `tests/` directory, start here:
1. `test_compile_all_check_does_not_crash`
2. `test_build_chapters_emits_18_ordered_entries`
3. `test_manifest_contains_existing_outputs`
4. `test_build_verify_rejects_missing_chapter`
---
## Security Considerations
### 1. Shelling out to external toolchains
The build system uses subprocess execution for:
- pandoc
- xelatex
- weasyprint-related flows
- helper scripts
This is reasonable for a publishing repo, but it means path handling and shell assumptions matter.
### 2. Remote font dependency in website HTML
`website/index.html` imports Google Fonts via CSS `@import`.
That means the website is not fully sovereign/local-first at render time.
If strict offline/local hosting matters, font bundling would be required.
### 3. Secret scanning exists, but is grep-based
Both CI and `scripts/smoke.sh` perform simple pattern scanning.
That is better than nothing, but it is heuristic rather than structured secret detection.
### 4. Artifact integrity is a strength
`build-manifest.json` with SHA256 hashes is a strong integrity pattern.
It gives the repo a lightweight provenance layer for distributables.
### 5. Build check path currently has a reliability bug
Runtime-confirmed:
- `python3 compile_all.py --check` crashes with:
- `AttributeError: module 'qrcode' has no attribute '__version__'`
This is not a remote exploit issue, but it is an operational integrity issue because the advertised safe preflight check is not robust.
Follow-up issue filed:
- the-testament #51
- https://forge.alexanderwhitestone.com/Timmy_Foundation/the-testament/issues/51
---
## Drift / Contradictions
### 1. README vs runtime word count
README says:
- ~70,000 word target
- ~19,000 words drafted
Runtime verification says:
- ~18,884 words in chapter corpus
- ~19,227 words in concatenated output
This is close enough to be directionally aligned, but the verifier is the stronger factual source for current draft size.
### 2. `compile_all.py --check` is documented but currently broken
Documented behavior:
- dependency verification
Observed behavior:
- crashes on qrcode version lookup
### 3. `scripts/smoke.sh` depends on undocumented `compile.py --validate`
- `compile.py` docs do not list `--validate`
- source contains no explicit `--validate` path
- smoke still passes because the script ignores unknown flags and performs its default build path
This is a subtle contract mismatch.
### 4. `website/chapters.json` generation is present, but current website landing page does not appear to consume it directly
That suggests either:
- a future/planned reader path
- an external consumer
- or leftover infrastructure from an earlier website design
---
## Practical Mental Model
Think of the-testament as three repos living inside one repository:
1. the manuscript repo
- chapters
- front/back matter
- worldbuilding
- character sheets
2. the publishing pipeline repo
- compile scripts
- verification scripts
- CI workflows
- manifest generation
3. the companion media repo
- website
- audiobook helpers
- interactive game experience
- soundtrack/cover assets
The connective tissue is the manuscript corpus. Almost everything else either:
- transforms it
- packages it
- validates it
- or re-presents it in another medium
---
## Source Files of Highest Importance
1. `compile_all.py`
- canonical unified pipeline
- best single source of repo architecture
2. `scripts/build-verify.py`
- real executable quality contract
3. `build/build.py`
- structured legacy builder still in active use
4. `compile.py`
- older build entrypoint still referenced by smoke flow
5. `website/index.html`
- primary web presentation artifact
6. `website/build-chapters.py`
- chapter-to-web JSON transform
7. `build/metadata.yaml`
- publication metadata contract
8. `build/semantic_linker.py`
- symbolic/literary relationship extraction
---
## Recommended Next Refactors
1. Make `compile_all.py` the only documented build entrypoint
- de-emphasize or retire duplicated legacy flows once parity is confirmed
2. Add real regression tests around build helpers
- especially `compile_all.py --check`
- chapter JSON generation
- manifest generation
3. Clarify the role of `website/chapters.json`
- either wire it into the site, document its consumer, or remove the dead path
4. Fix the undocumented `compile.py --validate` dependency in smoke
- either implement the flag or stop invoking it
5. Decide whether the companion game and website should remain in the same repo or be treated as first-class subprojects with their own tests
---
## Bottom Line
the-testament is a sovereign novel-production repo with a manuscript at the center and a light but real software system around it.
Its architecture is not application-server-centric.
It is pipeline-centric:
- content in
- validated compilation
- multi-format outputs
- integrity metadata
- companion experiences around the text
The strongest technical asset is the layered publishing pipeline plus manuscript verification.
The biggest weakness is the absence of dedicated regression tests around the build system itself.
Source basis for this genome:
- README and manuscript structure docs
- direct source inspection of `compile_all.py`, `build/build.py`, `compile.py`, website/audiobook/indexing/verification scripts
- runtime verification of build and validation commands
- repo scan of content/build/workflow layout