Add session_metadata.py — structured session metadata extractor (#6) #52

Closed
Rockachopa wants to merge 0 commits from feat/session-metadata into main
Owner

Closes #6

Adds session_metadata.py module that works alongside the existing session_reader.py to provide structured session metadata extraction.

Design

This module complements (does not replace) session_reader.py:

  • session_reader.py: Core JSONL reading functions (read_session, extract_conversation, truncate_for_context, messages_to_text) that harvester.py depends on
  • session_metadata.py: Higher-level metadata extraction (SessionSummary dataclass, repo classification, outcome classification, duration estimation)

Features

  • SessionSummary dataclass with session metrics
  • Repo classification from message content (regex patterns for common repos)
  • Outcome classification (success/partial/failure/unknown)
  • Duration estimation from timestamps
  • Token count estimation
  • Uses session_reader.read_session() for file reading
  • CLI: python session_metadata.py path/to/session.jsonl

Compatibility

  • No changes to session_reader.py
  • No changes to harvester.py
  • Works alongside existing pipeline
  • SessionSummary provides structured data for knowledge store
Closes #6 Adds session_metadata.py module that works alongside the existing session_reader.py to provide structured session metadata extraction. ## Design This module complements (does not replace) session_reader.py: - **session_reader.py**: Core JSONL reading functions (read_session, extract_conversation, truncate_for_context, messages_to_text) that harvester.py depends on - **session_metadata.py**: Higher-level metadata extraction (SessionSummary dataclass, repo classification, outcome classification, duration estimation) ## Features - SessionSummary dataclass with session metrics - Repo classification from message content (regex patterns for common repos) - Outcome classification (success/partial/failure/unknown) - Duration estimation from timestamps - Token count estimation - Uses session_reader.read_session() for file reading - CLI: `python session_metadata.py path/to/session.jsonl` ## Compatibility - No changes to session_reader.py - No changes to harvester.py - Works alongside existing pipeline - SessionSummary provides structured data for knowledge store
Rockachopa added 1 commit 2026-04-14 19:07:33 +00:00
Timmy requested changes 2026-04-14 20:12:31 +00:00
Dismissed
Timmy left a comment
Owner

Review: session_metadata.py (#6)

Overall this is a clean, well-structured metadata extractor. A few issues to address:

Bug — read_session import mismatch:
from session_reader import read_session imports a bare function, but session_reader.py (from PR #29) defines read_session as a method on the SessionReader class, not as a module-level function. This import will fail at runtime with ImportError. Either add a module-level read_session() wrapper to session_reader.py, or change this to instantiate SessionReader and call reader.read_session(path).

Bug — read_session return type assumption:
The code assumes read_session() returns a flat list of message dicts, but SessionReader.read_session() returns a dict with a messages key. The subsequent for entry in messages loop would iterate over dict keys instead of message objects.

Minor — duplicate import re:
re is imported at the top of the file, but also imported again inside extract_session_metadata in the regex blocks. Remove the inner imports.

Minor — timestamp parsing:
The timestamp parsing loop with multiple strptime formats is fine but could use datetime.fromisoformat() as a fallback (like session_reader.py does) to handle timezone-aware ISO strings.

Positive:

  • Good use of dataclass for structured output
  • Sensible deduplication and limiting of actions/errors
  • CLI with argparse is well done
  • Outcome classification logic is reasonable

The import/API mismatch with session_reader is a blocking issue — this will crash on any invocation. Please fix the integration with SessionReader before merge.

## Review: session_metadata.py (#6) Overall this is a clean, well-structured metadata extractor. A few issues to address: **Bug — `read_session` import mismatch:** `from session_reader import read_session` imports a bare function, but `session_reader.py` (from PR #29) defines `read_session` as a method on the `SessionReader` class, not as a module-level function. This import will fail at runtime with `ImportError`. Either add a module-level `read_session()` wrapper to `session_reader.py`, or change this to instantiate `SessionReader` and call `reader.read_session(path)`. **Bug — `read_session` return type assumption:** The code assumes `read_session()` returns a flat list of message dicts, but `SessionReader.read_session()` returns a dict with a `messages` key. The subsequent `for entry in messages` loop would iterate over dict keys instead of message objects. **Minor — duplicate `import re`:** `re` is imported at the top of the file, but also imported again inside `extract_session_metadata` in the regex blocks. Remove the inner imports. **Minor — timestamp parsing:** The timestamp parsing loop with multiple `strptime` formats is fine but could use `datetime.fromisoformat()` as a fallback (like session_reader.py does) to handle timezone-aware ISO strings. **Positive:** - Good use of dataclass for structured output - Sensible deduplication and limiting of actions/errors - CLI with argparse is well done - Outcome classification logic is reasonable The import/API mismatch with session_reader is a blocking issue — this will crash on any invocation. Please fix the integration with SessionReader before merge.
Timmy reviewed 2026-04-14 22:12:31 +00:00
Timmy left a comment
Owner

Review: session_metadata.py

Overall: Well-structured module that complements session_reader.py with higher-level metadata extraction. Good separation of concerns.

Strengths:

  • Clean dataclass design with SessionSummary — good use of typed fields with sensible defaults
  • Robust timestamp parsing with multiple format fallback
  • Error deduplication and action deduplication with limits (5 and 10 respectively) — prevents unbounded growth
  • Proper error handling for missing files, returning a well-formed error summary rather than crashing

Issues to address:

  1. Token estimation is naivelen(content.split()) counts words, not tokens. The comment says ~4 chars/token but the code counts words. Either use len(content) // 4 or rename the field to word_count_estimate.
  2. Repo detection stops at first match — the break after finding a repo match means if the first content match is a false positive (e.g., a URL in an error message), the real repo is never found. Consider scanning all messages and taking the most frequent match.
  3. Outcome classification is fragile — checking for words like "done" or "success" in the last assistant message will false-positive on messages like "I was unable to get this done" or "This was not a success". Consider requiring these at the start of a sentence or using a more structured signal.
  4. Missing __main__ guard — the script has CLI functionality but no if __name__ == "__main__": guard, which means importing it will trigger argparse.
  5. total_tokens_estimate counted as words — the field name says tokens but the accumulator counts split() results (words). This will undercount by ~1.3x.

Minor:

  • The asdict import is unused in the visible code — confirm it is used in the full file or remove it.
  • Consider adding a to_json() convenience method on SessionSummary.

Approve with the token estimation fix.

## Review: session_metadata.py **Overall**: Well-structured module that complements session_reader.py with higher-level metadata extraction. Good separation of concerns. **Strengths:** - Clean dataclass design with `SessionSummary` — good use of typed fields with sensible defaults - Robust timestamp parsing with multiple format fallback - Error deduplication and action deduplication with limits (5 and 10 respectively) — prevents unbounded growth - Proper error handling for missing files, returning a well-formed error summary rather than crashing **Issues to address:** 1. **Token estimation is naive** — `len(content.split())` counts words, not tokens. The comment says ~4 chars/token but the code counts words. Either use `len(content) // 4` or rename the field to `word_count_estimate`. 2. **Repo detection stops at first match** — the `break` after finding a repo match means if the first content match is a false positive (e.g., a URL in an error message), the real repo is never found. Consider scanning all messages and taking the most frequent match. 3. **Outcome classification is fragile** — checking for words like "done" or "success" in the last assistant message will false-positive on messages like "I was unable to get this done" or "This was not a success". Consider requiring these at the start of a sentence or using a more structured signal. 4. **Missing `__main__` guard** — the script has CLI functionality but no `if __name__ == "__main__":` guard, which means importing it will trigger argparse. 5. **`total_tokens_estimate` counted as words** — the field name says tokens but the accumulator counts `split()` results (words). This will undercount by ~1.3x. **Minor:** - The `asdict` import is unused in the visible code — confirm it is used in the full file or remove it. - Consider adding a `to_json()` convenience method on `SessionSummary`. Approve with the token estimation fix.
Timmy requested changes 2026-04-15 00:18:45 +00:00
Timmy left a comment
Owner

The session_metadata.py module is well-structured with a clear dataclass, proper error handling for missing files, and sensible defaults. A few issues to address:

  1. Import dependency: from session_reader import read_session will fail unless session_reader.py is on sys.path. The script does not add its own directory to sys.path like other scripts in this repo do (e.g., harvester.py uses sys.path.insert(0, str(SCRIPT_DIR))). This will cause an ImportError when run from outside the scripts/ directory.

  2. Outcome classification is fragile: Checking for the presence of words like "done" or "completed" in the last assistant message is prone to false positives (e.g., "I could not get this done"). Consider more robust heuristics or at minimum checking for negative modifiers.

  3. Error message truncation inconsistency: Errors are filtered to len(error_msg) < 200 then truncated to [:200] — the filter makes the truncation redundant. Pick one approach.

  4. Token estimation by word count: total_tokens += len(content.split()) counts words, not tokens. The field is named total_tokens_estimate which is fine, but word count significantly underestimates token count (typically 1.3x). Consider using a chars/4 heuristic for consistency with bootstrapper.py.

The import path issue (#1) is a real bug that will break usage. Please fix that before merge.

The session_metadata.py module is well-structured with a clear dataclass, proper error handling for missing files, and sensible defaults. A few issues to address: 1. **Import dependency**: `from session_reader import read_session` will fail unless session_reader.py is on sys.path. The script does not add its own directory to sys.path like other scripts in this repo do (e.g., harvester.py uses `sys.path.insert(0, str(SCRIPT_DIR))`). This will cause an ImportError when run from outside the scripts/ directory. 2. **Outcome classification is fragile**: Checking for the presence of words like "done" or "completed" in the last assistant message is prone to false positives (e.g., "I could not get this done"). Consider more robust heuristics or at minimum checking for negative modifiers. 3. **Error message truncation inconsistency**: Errors are filtered to `len(error_msg) < 200` then truncated to `[:200]` — the filter makes the truncation redundant. Pick one approach. 4. **Token estimation by word count**: `total_tokens += len(content.split())` counts words, not tokens. The field is named `total_tokens_estimate` which is fine, but word count significantly underestimates token count (typically 1.3x). Consider using a chars/4 heuristic for consistency with bootstrapper.py. The import path issue (#1) is a real bug that will break usage. Please fix that before merge.
Author
Owner

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Rockachopa closed this pull request 2026-04-16 01:54:58 +00:00
Author
Owner

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Author
Owner

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Rockachopa reopened this pull request 2026-04-16 02:04:26 +00:00
Rockachopa closed this pull request 2026-04-16 02:14:33 +00:00
Rockachopa reopened this pull request 2026-04-16 02:41:55 +00:00
Rockachopa closed this pull request 2026-04-16 02:52:44 +00:00

Pull request closed

Sign in to join this conversation.