* feat: add OSS Security Forensics skill (Skills Hub) Salvaged from PR #1066 by zagiscoming. Adds a 7-phase multi-agent investigation framework for GitHub supply chain attack forensics. Skill contents (optional-skills/security/oss-forensics/): - SKILL.md: 420-line investigation framework with 8 anti-hallucination guardrails, 5 specialist investigators, ethical use guidelines, and API rate limiting guidance - evidence-store.py: CLI evidence manager with add/list/verify/query/ export/summary + SHA-256 integrity + chain of custody - references/: evidence types, GH Archive BigQuery guide (expanded with 12 event types and 6 query templates), recovery techniques (4 methods), investigation templates (5 attack patterns) - templates/: forensic report template (151 lines), malicious package report template Changes from original PR: - Dropped unrelated core tool changes (delegate_tool.py role parameter, AGENTS.md, README.md modifications) - Removed duplicate skills/security/oss-forensics/ placement - Fixed github-archive-guide.md (missing from optional-skills/, expanded from 33 to 160+ lines with all 12 event types and query templates) - Added ethical use guidelines and API rate limiting sections - Rewrote tests to match the v2 evidence store API (12 tests, all pass) Closes #384 * fix: use python3 and SKILL_DIR paths throughout oss-forensics skill - Replace all 'python' invocations with 'python3' for portability (Ubuntu doesn't ship 'python' by default) - Replace relative '../scripts/' and '../templates/' paths with SKILL_DIR/scripts/ and SKILL_DIR/templates/ convention - Add path convention note before Phase 0 explaining SKILL_DIR - Fix double --- separator (cosmetic) - Applies to SKILL.md, evidence-store.py docstring, recovery-techniques.md, and forensic-report.md template --------- Co-authored-by: zagiscoming <zagiscoming@users.noreply.github.com>
5.9 KiB
GitHub Archive Query Guide (BigQuery)
GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.
Public Dataset
- Project:
githubarchive - Tables:
day.YYYYMMDD,month.YYYYMM,year.YYYY - Cost: $6.25 per TiB scanned. Always run dry runs first.
- Access: Requires a Google Cloud account with BigQuery enabled. Free tier includes 1 TiB/month of queries.
The 12 GitHub Event Types
| Event Type | What It Records | Forensic Value |
|---|---|---|
PushEvent |
Commits pushed to a branch | Force-push detection, commit timeline, author attribution |
PullRequestEvent |
PR opened, closed, merged, reopened | Deleted PR recovery, review timeline |
IssuesEvent |
Issue opened, closed, reopened, labeled | Deleted issue recovery, social engineering traces |
IssueCommentEvent |
Comments on issues and PRs | Deleted comment recovery, communication patterns |
CreateEvent |
Branch, tag, or repository creation | Suspicious branch creation, tag timing |
DeleteEvent |
Branch or tag deletion | Evidence of cleanup after compromise |
MemberEvent |
Collaborator added or removed | Permission changes, access escalation |
PublicEvent |
Repository made public | Accidental exposure of private repos |
WatchEvent |
User stars a repository | Actor reconnaissance patterns |
ForkEvent |
Repository forked | Exfiltration of code before cleanup |
ReleaseEvent |
Release published, edited, deleted | Malicious release injection, deleted release recovery |
WorkflowRunEvent |
GitHub Actions workflow triggered | CI/CD abuse, unauthorized workflow runs |
Query Templates
Basic: All Events for a Repository
SELECT
created_at,
type,
actor.login,
repo.name,
payload
FROM
`githubarchive.day.20240101` -- Adjust date
WHERE
repo.name = 'owner/repo'
AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
ORDER BY
created_at ASC
Force-Push Detection
Force-pushes produce PushEvents where commits are overwritten. Key indicators:
payload.distinct_size = 0withpayload.size > 0→ commits were erasedpayload.beforecontains the SHA before the rewrite (recoverable)
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'PushEvent'
AND repo.name = 'owner/repo'
AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
ORDER BY
created_at ASC
Deleted Branch/Tag Detection
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'DeleteEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
Collaborator Permission Changes
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'MemberEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
CI/CD Workflow Activity
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'WorkflowRunEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
Actor Activity Profiling
SELECT
type,
COUNT(*) AS event_count,
MIN(created_at) AS first_event,
MAX(created_at) AS last_event
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202301' AND '202412'
AND actor.login = 'suspicious-username'
GROUP BY type
ORDER BY event_count DESC
Cost Optimization (MANDATORY)
- Always dry run first: Add
--dry_runflag tobq queryto see estimated bytes scanned before executing. - Use
_TABLE_SUFFIX: Narrow the date range as much as possible.day.*tables are cheapest for narrow windows;month.*for broader sweeps. - Select only needed columns: Avoid
SELECT *. Thepayloadcolumn is large — only select specific JSON paths. - Add LIMIT: Use
LIMIT 1000during exploration. Remove only for final exhaustive queries. - Column filtering in WHERE: Filter on indexed columns (
type,repo.name,actor.login) before payload extraction.
Cost estimation: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with _TABLE_SUFFIX typically scans 1-10 GiB ($0.006-$0.06).
Accessing via Hermes
Option A: BigQuery CLI (if gcloud is installed)
bq query --use_legacy_sql=false --format=json "YOUR QUERY"
Option B: Python (via execute_code)
from google.cloud import bigquery
client = bigquery.Client()
query = "YOUR QUERY"
results = client.query(query).result()
for row in results:
print(dict(row))
Option C: No GCP credentials available If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.