* feat: add OSS Security Forensics skill (Skills Hub) Salvaged from PR #1066 by zagiscoming. Adds a 7-phase multi-agent investigation framework for GitHub supply chain attack forensics. Skill contents (optional-skills/security/oss-forensics/): - SKILL.md: 420-line investigation framework with 8 anti-hallucination guardrails, 5 specialist investigators, ethical use guidelines, and API rate limiting guidance - evidence-store.py: CLI evidence manager with add/list/verify/query/ export/summary + SHA-256 integrity + chain of custody - references/: evidence types, GH Archive BigQuery guide (expanded with 12 event types and 6 query templates), recovery techniques (4 methods), investigation templates (5 attack patterns) - templates/: forensic report template (151 lines), malicious package report template Changes from original PR: - Dropped unrelated core tool changes (delegate_tool.py role parameter, AGENTS.md, README.md modifications) - Removed duplicate skills/security/oss-forensics/ placement - Fixed github-archive-guide.md (missing from optional-skills/, expanded from 33 to 160+ lines with all 12 event types and query templates) - Added ethical use guidelines and API rate limiting sections - Rewrote tests to match the v2 evidence store API (12 tests, all pass) Closes #384 * fix: use python3 and SKILL_DIR paths throughout oss-forensics skill - Replace all 'python' invocations with 'python3' for portability (Ubuntu doesn't ship 'python' by default) - Replace relative '../scripts/' and '../templates/' paths with SKILL_DIR/scripts/ and SKILL_DIR/templates/ convention - Add path convention note before Phase 0 explaining SKILL_DIR - Fix double --- separator (cosmetic) - Applies to SKILL.md, evidence-store.py docstring, recovery-techniques.md, and forensic-report.md template --------- Co-authored-by: zagiscoming <zagiscoming@users.noreply.github.com>
185 lines
5.9 KiB
Markdown
185 lines
5.9 KiB
Markdown
# GitHub Archive Query Guide (BigQuery)
|
|
|
|
GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.
|
|
|
|
## Public Dataset
|
|
|
|
- **Project**: `githubarchive`
|
|
- **Tables**: `day.YYYYMMDD`, `month.YYYYMM`, `year.YYYY`
|
|
- **Cost**: $6.25 per TiB scanned. Always run dry runs first.
|
|
- **Access**: Requires a Google Cloud account with BigQuery enabled. Free tier includes 1 TiB/month of queries.
|
|
|
|
---
|
|
|
|
## The 12 GitHub Event Types
|
|
|
|
| Event Type | What It Records | Forensic Value |
|
|
|------------|-----------------|----------------|
|
|
| `PushEvent` | Commits pushed to a branch | Force-push detection, commit timeline, author attribution |
|
|
| `PullRequestEvent` | PR opened, closed, merged, reopened | Deleted PR recovery, review timeline |
|
|
| `IssuesEvent` | Issue opened, closed, reopened, labeled | Deleted issue recovery, social engineering traces |
|
|
| `IssueCommentEvent` | Comments on issues and PRs | Deleted comment recovery, communication patterns |
|
|
| `CreateEvent` | Branch, tag, or repository creation | Suspicious branch creation, tag timing |
|
|
| `DeleteEvent` | Branch or tag deletion | Evidence of cleanup after compromise |
|
|
| `MemberEvent` | Collaborator added or removed | Permission changes, access escalation |
|
|
| `PublicEvent` | Repository made public | Accidental exposure of private repos |
|
|
| `WatchEvent` | User stars a repository | Actor reconnaissance patterns |
|
|
| `ForkEvent` | Repository forked | Exfiltration of code before cleanup |
|
|
| `ReleaseEvent` | Release published, edited, deleted | Malicious release injection, deleted release recovery |
|
|
| `WorkflowRunEvent` | GitHub Actions workflow triggered | CI/CD abuse, unauthorized workflow runs |
|
|
|
|
---
|
|
|
|
## Query Templates
|
|
|
|
### Basic: All Events for a Repository
|
|
|
|
```sql
|
|
SELECT
|
|
created_at,
|
|
type,
|
|
actor.login,
|
|
repo.name,
|
|
payload
|
|
FROM
|
|
`githubarchive.day.20240101` -- Adjust date
|
|
WHERE
|
|
repo.name = 'owner/repo'
|
|
AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
|
|
ORDER BY
|
|
created_at ASC
|
|
```
|
|
|
|
### Force-Push Detection
|
|
|
|
Force-pushes produce PushEvents where commits are overwritten. Key indicators:
|
|
- `payload.distinct_size = 0` with `payload.size > 0` → commits were erased
|
|
- `payload.before` contains the SHA before the rewrite (recoverable)
|
|
|
|
```sql
|
|
SELECT
|
|
created_at,
|
|
actor.login,
|
|
JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
|
|
JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
|
|
JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
|
|
JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
|
|
JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
|
|
FROM
|
|
`githubarchive.month.*`
|
|
WHERE
|
|
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
|
|
AND type = 'PushEvent'
|
|
AND repo.name = 'owner/repo'
|
|
AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
|
|
ORDER BY
|
|
created_at ASC
|
|
```
|
|
|
|
### Deleted Branch/Tag Detection
|
|
|
|
```sql
|
|
SELECT
|
|
created_at,
|
|
actor.login,
|
|
JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
|
|
JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
|
|
FROM
|
|
`githubarchive.month.*`
|
|
WHERE
|
|
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
|
|
AND type = 'DeleteEvent'
|
|
AND repo.name = 'owner/repo'
|
|
ORDER BY
|
|
created_at ASC
|
|
```
|
|
|
|
### Collaborator Permission Changes
|
|
|
|
```sql
|
|
SELECT
|
|
created_at,
|
|
actor.login,
|
|
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
|
|
JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
|
|
FROM
|
|
`githubarchive.month.*`
|
|
WHERE
|
|
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
|
|
AND type = 'MemberEvent'
|
|
AND repo.name = 'owner/repo'
|
|
ORDER BY
|
|
created_at ASC
|
|
```
|
|
|
|
### CI/CD Workflow Activity
|
|
|
|
```sql
|
|
SELECT
|
|
created_at,
|
|
actor.login,
|
|
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
|
|
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
|
|
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
|
|
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
|
|
FROM
|
|
`githubarchive.month.*`
|
|
WHERE
|
|
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
|
|
AND type = 'WorkflowRunEvent'
|
|
AND repo.name = 'owner/repo'
|
|
ORDER BY
|
|
created_at ASC
|
|
```
|
|
|
|
### Actor Activity Profiling
|
|
|
|
```sql
|
|
SELECT
|
|
type,
|
|
COUNT(*) AS event_count,
|
|
MIN(created_at) AS first_event,
|
|
MAX(created_at) AS last_event
|
|
FROM
|
|
`githubarchive.month.*`
|
|
WHERE
|
|
_TABLE_SUFFIX BETWEEN '202301' AND '202412'
|
|
AND actor.login = 'suspicious-username'
|
|
GROUP BY type
|
|
ORDER BY event_count DESC
|
|
```
|
|
|
|
---
|
|
|
|
## Cost Optimization (MANDATORY)
|
|
|
|
1. **Always dry run first**: Add `--dry_run` flag to `bq query` to see estimated bytes scanned before executing.
|
|
2. **Use `_TABLE_SUFFIX`**: Narrow the date range as much as possible. `day.*` tables are cheapest for narrow windows; `month.*` for broader sweeps.
|
|
3. **Select only needed columns**: Avoid `SELECT *`. The `payload` column is large — only select specific JSON paths.
|
|
4. **Add LIMIT**: Use `LIMIT 1000` during exploration. Remove only for final exhaustive queries.
|
|
5. **Column filtering in WHERE**: Filter on indexed columns (`type`, `repo.name`, `actor.login`) before payload extraction.
|
|
|
|
**Cost estimation**: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with `_TABLE_SUFFIX` typically scans 1-10 GiB ($0.006-$0.06).
|
|
|
|
---
|
|
|
|
## Accessing via Hermes
|
|
|
|
**Option A: BigQuery CLI** (if `gcloud` is installed)
|
|
```bash
|
|
bq query --use_legacy_sql=false --format=json "YOUR QUERY"
|
|
```
|
|
|
|
**Option B: Python** (via `execute_code`)
|
|
```python
|
|
from google.cloud import bigquery
|
|
client = bigquery.Client()
|
|
query = "YOUR QUERY"
|
|
results = client.query(query).result()
|
|
for row in results:
|
|
print(dict(row))
|
|
```
|
|
|
|
**Option C: No GCP credentials available**
|
|
If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.
|