Files
hermes-agent/optional-skills/security/oss-forensics/references/github-archive-guide.md
Teknium c30505dddd feat: add OSS Security Forensics skill (Skills Hub) (#1482)
* feat: add OSS Security Forensics skill (Skills Hub)

Salvaged from PR #1066 by zagiscoming. Adds a 7-phase multi-agent
investigation framework for GitHub supply chain attack forensics.

Skill contents (optional-skills/security/oss-forensics/):
- SKILL.md: 420-line investigation framework with 8 anti-hallucination
  guardrails, 5 specialist investigators, ethical use guidelines,
  and API rate limiting guidance
- evidence-store.py: CLI evidence manager with add/list/verify/query/
  export/summary + SHA-256 integrity + chain of custody
- references/: evidence types, GH Archive BigQuery guide (expanded with
  12 event types and 6 query templates), recovery techniques (4 methods),
  investigation templates (5 attack patterns)
- templates/: forensic report template (151 lines), malicious package
  report template

Changes from original PR:
- Dropped unrelated core tool changes (delegate_tool.py role parameter,
  AGENTS.md, README.md modifications)
- Removed duplicate skills/security/oss-forensics/ placement
- Fixed github-archive-guide.md (missing from optional-skills/, expanded
  from 33 to 160+ lines with all 12 event types and query templates)
- Added ethical use guidelines and API rate limiting sections
- Rewrote tests to match the v2 evidence store API (12 tests, all pass)

Closes #384

* fix: use python3 and SKILL_DIR paths throughout oss-forensics skill

- Replace all 'python' invocations with 'python3' for portability
  (Ubuntu doesn't ship 'python' by default)
- Replace relative '../scripts/' and '../templates/' paths with
  SKILL_DIR/scripts/ and SKILL_DIR/templates/ convention
- Add path convention note before Phase 0 explaining SKILL_DIR
- Fix double --- separator (cosmetic)
- Applies to SKILL.md, evidence-store.py docstring,
  recovery-techniques.md, and forensic-report.md template

---------

Co-authored-by: zagiscoming <zagiscoming@users.noreply.github.com>
2026-03-15 21:59:53 -07:00

5.9 KiB

GitHub Archive Query Guide (BigQuery)

GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.

Public Dataset

  • Project: githubarchive
  • Tables: day.YYYYMMDD, month.YYYYMM, year.YYYY
  • Cost: $6.25 per TiB scanned. Always run dry runs first.
  • Access: Requires a Google Cloud account with BigQuery enabled. Free tier includes 1 TiB/month of queries.

The 12 GitHub Event Types

Event Type What It Records Forensic Value
PushEvent Commits pushed to a branch Force-push detection, commit timeline, author attribution
PullRequestEvent PR opened, closed, merged, reopened Deleted PR recovery, review timeline
IssuesEvent Issue opened, closed, reopened, labeled Deleted issue recovery, social engineering traces
IssueCommentEvent Comments on issues and PRs Deleted comment recovery, communication patterns
CreateEvent Branch, tag, or repository creation Suspicious branch creation, tag timing
DeleteEvent Branch or tag deletion Evidence of cleanup after compromise
MemberEvent Collaborator added or removed Permission changes, access escalation
PublicEvent Repository made public Accidental exposure of private repos
WatchEvent User stars a repository Actor reconnaissance patterns
ForkEvent Repository forked Exfiltration of code before cleanup
ReleaseEvent Release published, edited, deleted Malicious release injection, deleted release recovery
WorkflowRunEvent GitHub Actions workflow triggered CI/CD abuse, unauthorized workflow runs

Query Templates

Basic: All Events for a Repository

SELECT
  created_at,
  type,
  actor.login,
  repo.name,
  payload
FROM
  `githubarchive.day.20240101`  -- Adjust date
WHERE
  repo.name = 'owner/repo'
  AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
ORDER BY
  created_at ASC

Force-Push Detection

Force-pushes produce PushEvents where commits are overwritten. Key indicators:

  • payload.distinct_size = 0 with payload.size > 0 → commits were erased
  • payload.before contains the SHA before the rewrite (recoverable)
SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
  JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
  JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
  JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
  JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'PushEvent'
  AND repo.name = 'owner/repo'
  AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
ORDER BY
  created_at ASC

Deleted Branch/Tag Detection

SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
  JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'DeleteEvent'
  AND repo.name = 'owner/repo'
ORDER BY
  created_at ASC

Collaborator Permission Changes

SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
  JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'MemberEvent'
  AND repo.name = 'owner/repo'
ORDER BY
  created_at ASC

CI/CD Workflow Activity

SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
  JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
  JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
  JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'WorkflowRunEvent'
  AND repo.name = 'owner/repo'
ORDER BY
  created_at ASC

Actor Activity Profiling

SELECT
  type,
  COUNT(*) AS event_count,
  MIN(created_at) AS first_event,
  MAX(created_at) AS last_event
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202301' AND '202412'
  AND actor.login = 'suspicious-username'
GROUP BY type
ORDER BY event_count DESC

Cost Optimization (MANDATORY)

  1. Always dry run first: Add --dry_run flag to bq query to see estimated bytes scanned before executing.
  2. Use _TABLE_SUFFIX: Narrow the date range as much as possible. day.* tables are cheapest for narrow windows; month.* for broader sweeps.
  3. Select only needed columns: Avoid SELECT *. The payload column is large — only select specific JSON paths.
  4. Add LIMIT: Use LIMIT 1000 during exploration. Remove only for final exhaustive queries.
  5. Column filtering in WHERE: Filter on indexed columns (type, repo.name, actor.login) before payload extraction.

Cost estimation: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with _TABLE_SUFFIX typically scans 1-10 GiB ($0.006-$0.06).


Accessing via Hermes

Option A: BigQuery CLI (if gcloud is installed)

bq query --use_legacy_sql=false --format=json "YOUR QUERY"

Option B: Python (via execute_code)

from google.cloud import bigquery
client = bigquery.Client()
query = "YOUR QUERY"
results = client.query(query).result()
for row in results:
    print(dict(row))

Option C: No GCP credentials available If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.