Files
hermes-agent/optional-skills/security/oss-forensics/references/github-archive-guide.md
Teknium c30505dddd feat: add OSS Security Forensics skill (Skills Hub) (#1482)
* feat: add OSS Security Forensics skill (Skills Hub)

Salvaged from PR #1066 by zagiscoming. Adds a 7-phase multi-agent
investigation framework for GitHub supply chain attack forensics.

Skill contents (optional-skills/security/oss-forensics/):
- SKILL.md: 420-line investigation framework with 8 anti-hallucination
  guardrails, 5 specialist investigators, ethical use guidelines,
  and API rate limiting guidance
- evidence-store.py: CLI evidence manager with add/list/verify/query/
  export/summary + SHA-256 integrity + chain of custody
- references/: evidence types, GH Archive BigQuery guide (expanded with
  12 event types and 6 query templates), recovery techniques (4 methods),
  investigation templates (5 attack patterns)
- templates/: forensic report template (151 lines), malicious package
  report template

Changes from original PR:
- Dropped unrelated core tool changes (delegate_tool.py role parameter,
  AGENTS.md, README.md modifications)
- Removed duplicate skills/security/oss-forensics/ placement
- Fixed github-archive-guide.md (missing from optional-skills/, expanded
  from 33 to 160+ lines with all 12 event types and query templates)
- Added ethical use guidelines and API rate limiting sections
- Rewrote tests to match the v2 evidence store API (12 tests, all pass)

Closes #384

* fix: use python3 and SKILL_DIR paths throughout oss-forensics skill

- Replace all 'python' invocations with 'python3' for portability
  (Ubuntu doesn't ship 'python' by default)
- Replace relative '../scripts/' and '../templates/' paths with
  SKILL_DIR/scripts/ and SKILL_DIR/templates/ convention
- Add path convention note before Phase 0 explaining SKILL_DIR
- Fix double --- separator (cosmetic)
- Applies to SKILL.md, evidence-store.py docstring,
  recovery-techniques.md, and forensic-report.md template

---------

Co-authored-by: zagiscoming <zagiscoming@users.noreply.github.com>
2026-03-15 21:59:53 -07:00

185 lines
5.9 KiB
Markdown

# GitHub Archive Query Guide (BigQuery)
GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.
## Public Dataset
- **Project**: `githubarchive`
- **Tables**: `day.YYYYMMDD`, `month.YYYYMM`, `year.YYYY`
- **Cost**: $6.25 per TiB scanned. Always run dry runs first.
- **Access**: Requires a Google Cloud account with BigQuery enabled. Free tier includes 1 TiB/month of queries.
---
## The 12 GitHub Event Types
| Event Type | What It Records | Forensic Value |
|------------|-----------------|----------------|
| `PushEvent` | Commits pushed to a branch | Force-push detection, commit timeline, author attribution |
| `PullRequestEvent` | PR opened, closed, merged, reopened | Deleted PR recovery, review timeline |
| `IssuesEvent` | Issue opened, closed, reopened, labeled | Deleted issue recovery, social engineering traces |
| `IssueCommentEvent` | Comments on issues and PRs | Deleted comment recovery, communication patterns |
| `CreateEvent` | Branch, tag, or repository creation | Suspicious branch creation, tag timing |
| `DeleteEvent` | Branch or tag deletion | Evidence of cleanup after compromise |
| `MemberEvent` | Collaborator added or removed | Permission changes, access escalation |
| `PublicEvent` | Repository made public | Accidental exposure of private repos |
| `WatchEvent` | User stars a repository | Actor reconnaissance patterns |
| `ForkEvent` | Repository forked | Exfiltration of code before cleanup |
| `ReleaseEvent` | Release published, edited, deleted | Malicious release injection, deleted release recovery |
| `WorkflowRunEvent` | GitHub Actions workflow triggered | CI/CD abuse, unauthorized workflow runs |
---
## Query Templates
### Basic: All Events for a Repository
```sql
SELECT
created_at,
type,
actor.login,
repo.name,
payload
FROM
`githubarchive.day.20240101` -- Adjust date
WHERE
repo.name = 'owner/repo'
AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
ORDER BY
created_at ASC
```
### Force-Push Detection
Force-pushes produce PushEvents where commits are overwritten. Key indicators:
- `payload.distinct_size = 0` with `payload.size > 0` → commits were erased
- `payload.before` contains the SHA before the rewrite (recoverable)
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'PushEvent'
AND repo.name = 'owner/repo'
AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
ORDER BY
created_at ASC
```
### Deleted Branch/Tag Detection
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'DeleteEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
```
### Collaborator Permission Changes
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'MemberEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
```
### CI/CD Workflow Activity
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'WorkflowRunEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
```
### Actor Activity Profiling
```sql
SELECT
type,
COUNT(*) AS event_count,
MIN(created_at) AS first_event,
MAX(created_at) AS last_event
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202301' AND '202412'
AND actor.login = 'suspicious-username'
GROUP BY type
ORDER BY event_count DESC
```
---
## Cost Optimization (MANDATORY)
1. **Always dry run first**: Add `--dry_run` flag to `bq query` to see estimated bytes scanned before executing.
2. **Use `_TABLE_SUFFIX`**: Narrow the date range as much as possible. `day.*` tables are cheapest for narrow windows; `month.*` for broader sweeps.
3. **Select only needed columns**: Avoid `SELECT *`. The `payload` column is large — only select specific JSON paths.
4. **Add LIMIT**: Use `LIMIT 1000` during exploration. Remove only for final exhaustive queries.
5. **Column filtering in WHERE**: Filter on indexed columns (`type`, `repo.name`, `actor.login`) before payload extraction.
**Cost estimation**: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with `_TABLE_SUFFIX` typically scans 1-10 GiB ($0.006-$0.06).
---
## Accessing via Hermes
**Option A: BigQuery CLI** (if `gcloud` is installed)
```bash
bq query --use_legacy_sql=false --format=json "YOUR QUERY"
```
**Option B: Python** (via `execute_code`)
```python
from google.cloud import bigquery
client = bigquery.Client()
query = "YOUR QUERY"
results = client.query(query).result()
for row in results:
print(dict(row))
```
**Option C: No GCP credentials available**
If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.