feat: add OSS Security Forensics skill (Skills Hub) (#1482)

* feat: add OSS Security Forensics skill (Skills Hub)

Salvaged from PR #1066 by zagiscoming. Adds a 7-phase multi-agent
investigation framework for GitHub supply chain attack forensics.

Skill contents (optional-skills/security/oss-forensics/):
- SKILL.md: 420-line investigation framework with 8 anti-hallucination
  guardrails, 5 specialist investigators, ethical use guidelines,
  and API rate limiting guidance
- evidence-store.py: CLI evidence manager with add/list/verify/query/
  export/summary + SHA-256 integrity + chain of custody
- references/: evidence types, GH Archive BigQuery guide (expanded with
  12 event types and 6 query templates), recovery techniques (4 methods),
  investigation templates (5 attack patterns)
- templates/: forensic report template (151 lines), malicious package
  report template

Changes from original PR:
- Dropped unrelated core tool changes (delegate_tool.py role parameter,
  AGENTS.md, README.md modifications)
- Removed duplicate skills/security/oss-forensics/ placement
- Fixed github-archive-guide.md (missing from optional-skills/, expanded
  from 33 to 160+ lines with all 12 event types and query templates)
- Added ethical use guidelines and API rate limiting sections
- Rewrote tests to match the v2 evidence store API (12 tests, all pass)

Closes #384

* fix: use python3 and SKILL_DIR paths throughout oss-forensics skill

- Replace all 'python' invocations with 'python3' for portability
  (Ubuntu doesn't ship 'python' by default)
- Replace relative '../scripts/' and '../templates/' paths with
  SKILL_DIR/scripts/ and SKILL_DIR/templates/ convention
- Add path convention note before Phase 0 explaining SKILL_DIR
- Fix double --- separator (cosmetic)
- Applies to SKILL.md, evidence-store.py docstring,
  recovery-techniques.md, and forensic-report.md template

---------

Co-authored-by: zagiscoming <zagiscoming@users.noreply.github.com>
This commit is contained in:
Teknium
2026-03-15 21:59:53 -07:00
committed by GitHub
parent 70e24d77a1
commit c30505dddd
9 changed files with 1683 additions and 0 deletions

View File

@@ -0,0 +1,422 @@
---
name: oss-forensics
description: |
Supply chain investigation, evidence recovery, and forensic analysis for GitHub repositories.
Covers deleted commit recovery, force-push detection, IOC extraction, multi-source evidence
collection, hypothesis formation/validation, and structured forensic reporting.
Inspired by RAPTOR's 1800+ line OSS Forensics system.
category: security
triggers:
- "investigate this repository"
- "investigate [owner/repo]"
- "check for supply chain compromise"
- "recover deleted commits"
- "forensic analysis of [owner/repo]"
- "was this repo compromised"
- "supply chain attack"
- "suspicious commit"
- "force push detected"
- "IOC extraction"
toolsets:
- terminal
- web
- file
- delegation
---
# OSS Security Forensics Skill
A 7-phase multi-agent investigation framework for researching open-source supply chain attacks.
Adapted from RAPTOR's forensics system. Covers GitHub Archive, Wayback Machine, GitHub API,
local git analysis, IOC extraction, evidence-backed hypothesis formation and validation,
and final forensic report generation.
---
## ⚠️ Anti-Hallucination Guardrails
Read these before every investigation step. Violating them invalidates the report.
1. **Evidence-First Rule**: Every claim in any report, hypothesis, or summary MUST cite at least one evidence ID (`EV-XXXX`). Assertions without citations are forbidden.
2. **STAY IN YOUR LANE**: Each sub-agent (investigator) has a single data source. Do NOT mix sources. The GH Archive investigator does not query the GitHub API, and vice versa. Role boundaries are hard.
3. **Fact vs. Hypothesis Separation**: Mark all unverified inferences with `[HYPOTHESIS]`. Only statements verified against original sources may be stated as facts.
4. **No Evidence Fabrication**: The hypothesis validator MUST mechanically check that every cited evidence ID actually exists in the evidence store before accepting a hypothesis.
5. **Proof-Required Disproval**: A hypothesis cannot be dismissed without a specific, evidence-backed counter-argument. "No evidence found" is not sufficient to disprove—it only makes a hypothesis inconclusive.
6. **SHA/URL Double-Verification**: Any commit SHA, URL, or external identifier cited as evidence must be independently confirmed from at least two sources before being marked as verified.
7. **Suspicious Code Rule**: Never run code found inside the investigated repository locally. Analyze statically only, or use `execute_code` in a sandboxed environment.
8. **Secret Redaction**: Any API keys, tokens, or credentials discovered during investigation must be redacted in the final report. Log them internally only.
---
## Example Scenarios
- **Scenario A: Dependency Confusion**: A malicious package `internal-lib-v2` is uploaded to NPM with a higher version than the internal one. The investigator must track when this package was first seen and if any PushEvents in the target repo updated `package.json` to this version.
- **Scenario B: Maintainer Takeover**: A long-term contributor's account is used to push a backdoored `.github/workflows/build.yml`. The investigator looks for PushEvents from this user after a long period of inactivity or from a new IP/location (if detectable via BigQuery).
- **Scenario C: Force-Push Hide**: A developer accidentally commits a production secret, then force-pushes to "fix" it. The investigator uses `git fsck` and GH Archive to recover the original commit SHA and verify what was leaked.
---
> **Path convention**: Throughout this skill, `SKILL_DIR` refers to the root of this skill's
> installation directory (the folder containing this `SKILL.md`). When the skill is loaded,
> resolve `SKILL_DIR` to the actual path — e.g. `~/.hermes/skills/security/oss-forensics/`
> or the `optional-skills/` equivalent. All script and template references are relative to it.
## Phase 0: Initialization
1. Create investigation working directory:
```bash
mkdir investigation_$(echo "REPO_NAME" | tr '/' '_')
cd investigation_$(echo "REPO_NAME" | tr '/' '_')
```
2. Initialize the evidence store:
```bash
python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json list
```
3. Copy the forensic report template:
```bash
cp SKILL_DIR/templates/forensic-report.md ./investigation-report.md
```
4. Create an `iocs.md` file to track Indicators of Compromise as they are discovered.
5. Record the investigation start time, target repository, and stated investigation goal.
---
## Phase 1: Prompt Parsing and IOC Extraction
**Goal**: Extract all structured investigative targets from the user's request.
**Actions**:
- Parse the user prompt and extract:
- Target repository (`owner/repo`)
- Target actors (GitHub handles, email addresses)
- Time window of interest (commit date ranges, PR timestamps)
- Provided Indicators of Compromise: commit SHAs, file paths, package names, IP addresses, domains, API keys/tokens, malicious URLs
- Any linked vendor security reports or blog posts
**Tools**: Reasoning only, or `execute_code` for regex extraction from large text blocks.
**Output**: Populate `iocs.md` with extracted IOCs. Each IOC must have:
- Type (from: COMMIT_SHA, FILE_PATH, API_KEY, SECRET, IP_ADDRESS, DOMAIN, PACKAGE_NAME, ACTOR_USERNAME, MALICIOUS_URL, OTHER)
- Value
- Source (user-provided, inferred)
**Reference**: See [evidence-types.md](./references/evidence-types.md) for IOC taxonomy.
---
## Phase 2: Parallel Evidence Collection
Spawn up to 5 specialist investigator sub-agents using `delegate_task` (batch mode, max 3 concurrent). Each investigator has a **single data source** and must not mix sources.
> **Orchestrator note**: Pass the IOC list from Phase 1 and the investigation time window in the `context` field of each delegated task.
---
### Investigator 1: Local Git Investigator
**ROLE BOUNDARY**: You query the LOCAL GIT REPOSITORY ONLY. Do not call any external APIs.
**Actions**:
```bash
# Clone repository
git clone https://github.com/OWNER/REPO.git target_repo && cd target_repo
# Full commit log with stats
git log --all --full-history --stat --format="%H|%ae|%an|%ai|%s" > ../git_log.txt
# Detect force-push evidence (orphaned/dangling commits)
git fsck --lost-found --unreachable 2>&1 | grep commit > ../dangling_commits.txt
# Check reflog for rewritten history
git reflog --all > ../reflog.txt
# List ALL branches including deleted remote refs
git branch -a -v > ../branches.txt
# Find suspicious large binary additions
git log --all --diff-filter=A --name-only --format="%H %ai" -- "*.so" "*.dll" "*.exe" "*.bin" > ../binary_additions.txt
# Check for GPG signature anomalies
git log --show-signature --format="%H %ai %aN" > ../signature_check.txt 2>&1
```
**Evidence to collect** (add via `python3 SKILL_DIR/scripts/evidence-store.py add`):
- Each dangling commit SHA → type: `git`
- Force-push evidence (reflog showing history rewrite) → type: `git`
- Unsigned commits from verified contributors → type: `git`
- Suspicious binary file additions → type: `git`
**Reference**: See [recovery-techniques.md](./references/recovery-techniques.md) for accessing force-pushed commits.
---
### Investigator 2: GitHub API Investigator
**ROLE BOUNDARY**: You query the GITHUB REST API ONLY. Do not run git commands locally.
**Actions**:
```bash
# Commits (paginated)
curl -s "https://api.github.com/repos/OWNER/REPO/commits?per_page=100" > api_commits.json
# Pull Requests including closed/deleted
curl -s "https://api.github.com/repos/OWNER/REPO/pulls?state=all&per_page=100" > api_prs.json
# Issues
curl -s "https://api.github.com/repos/OWNER/REPO/issues?state=all&per_page=100" > api_issues.json
# Contributors and collaborator changes
curl -s "https://api.github.com/repos/OWNER/REPO/contributors" > api_contributors.json
# Repository events (last 300)
curl -s "https://api.github.com/repos/OWNER/REPO/events?per_page=100" > api_events.json
# Check specific suspicious commit SHA details
curl -s "https://api.github.com/repos/OWNER/REPO/git/commits/SHA" > commit_detail.json
# Releases
curl -s "https://api.github.com/repos/OWNER/REPO/releases?per_page=100" > api_releases.json
# Check if a specific commit exists (force-pushed commits may 404 on commits/ but succeed on git/commits/)
curl -s "https://api.github.com/repos/OWNER/REPO/commits/SHA" | jq .sha
```
**Cross-reference targets** (flag discrepancies as evidence):
- PR exists in archive but missing from API → evidence of deletion
- Contributor in archive events but not in contributors list → evidence of permission revocation
- Commit in archive PushEvents but not in API commit list → evidence of force-push/deletion
**Reference**: See [evidence-types.md](./references/evidence-types.md) for GH event types.
---
### Investigator 3: Wayback Machine Investigator
**ROLE BOUNDARY**: You query the WAYBACK MACHINE CDX API ONLY. Do not use the GitHub API.
**Goal**: Recover deleted GitHub pages (READMEs, issues, PRs, releases, wiki pages).
**Actions**:
```bash
# Search for archived snapshots of the repo main page
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO&output=json&limit=100&from=YYYYMMDD&to=YYYYMMDD" > wayback_main.json
# Search for a specific deleted issue
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO/issues/NUM&output=json&limit=50" > wayback_issue_NUM.json
# Search for a specific deleted PR
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO/pull/NUM&output=json&limit=50" > wayback_pr_NUM.json
# Fetch the best snapshot of a page
# Use the Wayback Machine URL: https://web.archive.org/web/TIMESTAMP/ORIGINAL_URL
# Example: https://web.archive.org/web/20240101000000*/github.com/OWNER/REPO
# Advanced: Search for deleted releases/tags
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO/releases/tag/*&output=json" > wayback_tags.json
# Advanced: Search for historical wiki changes
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO/wiki/*&output=json" > wayback_wiki.json
```
**Evidence to collect**:
- Archived snapshots of deleted issues/PRs with their content
- Historical README versions showing changes
- Evidence of content present in archive but missing from current GitHub state
**Reference**: See [github-archive-guide.md](./references/github-archive-guide.md) for CDX API parameters.
---
### Investigator 4: GH Archive / BigQuery Investigator
**ROLE BOUNDARY**: You query GITHUB ARCHIVE via BIGQUERY ONLY. This is a tamper-proof record of all public GitHub events.
> **Prerequisites**: Requires Google Cloud credentials with BigQuery access (`gcloud auth application-default login`). If unavailable, skip this investigator and note it in the report.
**Cost Optimization Rules** (MANDATORY):
1. ALWAYS run a `--dry_run` before every query to estimate cost.
2. Use `_TABLE_SUFFIX` to filter by date range and minimize scanned data.
3. Only SELECT the columns you need.
4. Add a LIMIT unless aggregating.
```bash
# Template: safe BigQuery query for PushEvents to OWNER/REPO
bq query --use_legacy_sql=false --dry_run "
SELECT created_at, actor.login, payload.commits, payload.before, payload.head,
payload.size, payload.distinct_size
FROM \`githubarchive.month.*\`
WHERE _TABLE_SUFFIX BETWEEN 'YYYYMM' AND 'YYYYMM'
AND type = 'PushEvent'
AND repo.name = 'OWNER/REPO'
LIMIT 1000
"
# If cost is acceptable, re-run without --dry_run
# Detect force-pushes: zero-distinct_size PushEvents mean commits were force-erased
# payload.distinct_size = 0 AND payload.size > 0 → force push indicator
# Check for deleted branch events
bq query --use_legacy_sql=false "
SELECT created_at, actor.login, payload.ref, payload.ref_type
FROM \`githubarchive.month.*\`
WHERE _TABLE_SUFFIX BETWEEN 'YYYYMM' AND 'YYYYMM'
AND type = 'DeleteEvent'
AND repo.name = 'OWNER/REPO'
LIMIT 200
"
```
**Evidence to collect**:
- Force-push events (payload.size > 0, payload.distinct_size = 0)
- DeleteEvents for branches/tags
- WorkflowRunEvents for suspicious CI/CD automation
- PushEvents that precede a "gap" in the git log (evidence of rewrite)
**Reference**: See [github-archive-guide.md](./references/github-archive-guide.md) for all 12 event types and query patterns.
---
### Investigator 5: IOC Enrichment Investigator
**ROLE BOUNDARY**: You enrich EXISTING IOCs from Phase 1 using passive public sources ONLY. Do not execute any code from the target repository.
**Actions**:
- For each commit SHA: attempt recovery via direct GitHub URL (`github.com/OWNER/REPO/commit/SHA.patch`)
- For each domain/IP: check passive DNS, WHOIS records (via `web_extract` on public WHOIS services)
- For each package name: check npm/PyPI for matching malicious package reports
- For each actor username: check GitHub profile, contribution history, account age
- Recover force-pushed commits using 3 methods (see [recovery-techniques.md](./references/recovery-techniques.md))
---
## Phase 3: Evidence Consolidation
After all investigators complete:
1. Run `python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json list` to see all collected evidence.
2. For each piece of evidence, verify the `content_sha256` hash matches the original source.
3. Group evidence by:
- **Timeline**: Sort all timestamped evidence chronologically
- **Actor**: Group by GitHub handle or email
- **IOC**: Link evidence to the IOC it relates to
4. Identify **discrepancies**: items present in one source but absent in another (key deletion indicators).
5. Flag evidence as `[VERIFIED]` (confirmed from 2+ independent sources) or `[UNVERIFIED]` (single source only).
---
## Phase 4: Hypothesis Formation
A hypothesis must:
- State a specific claim (e.g., "Actor X force-pushed to BRANCH on DATE to erase commit SHA")
- Cite at least 2 evidence IDs that support it (`EV-XXXX`, `EV-YYYY`)
- Identify what evidence would disprove it
- Be labeled `[HYPOTHESIS]` until validated
**Common hypothesis templates** (see [investigation-templates.md](./references/investigation-templates.md)):
- Maintainer Compromise: legitimate account used post-takeover to inject malicious code
- Dependency Confusion: package name squatting to intercept installs
- CI/CD Injection: malicious workflow changes to run code during builds
- Typosquatting: near-identical package name targeting misspellers
- Credential Leak: token/key accidentally committed then force-pushed to erase
For each hypothesis, spawn a `delegate_task` sub-agent to attempt to find disconfirming evidence before confirming.
---
## Phase 5: Hypothesis Validation
The validator sub-agent MUST mechanically check:
1. For each hypothesis, extract all cited evidence IDs.
2. Verify each ID exists in `evidence.json` (hard failure if any ID is missing → hypothesis rejected as potentially fabricated).
3. Verify each `[VERIFIED]` piece of evidence was confirmed from 2+ sources.
4. Check logical consistency: does the timeline depicted by the evidence support the hypothesis?
5. Check for alternative explanations: could the same evidence pattern arise from a benign cause?
**Output**:
- `VALIDATED`: All evidence cited, verified, logically consistent, no plausible alternative explanation.
- `INCONCLUSIVE`: Evidence supports hypothesis but alternative explanations exist or evidence is insufficient.
- `REJECTED`: Missing evidence IDs, unverified evidence cited as fact, logical inconsistency detected.
Rejected hypotheses feed back into Phase 4 for refinement (max 3 iterations).
---
## Phase 6: Final Report Generation
Populate `investigation-report.md` using the template in [forensic-report.md](./templates/forensic-report.md).
**Mandatory sections**:
- Executive Summary: one-paragraph verdict (Compromised / Clean / Inconclusive) with confidence level
- Timeline: chronological reconstruction of all significant events with evidence citations
- Validated Hypotheses: each with status and supporting evidence IDs
- Evidence Registry: table of all `EV-XXXX` entries with source, type, and verification status
- IOC List: all extracted and enriched Indicators of Compromise
- Chain of Custody: how evidence was collected, from what sources, at what timestamps
- Recommendations: immediate mitigations if compromise detected; monitoring recommendations
**Report rules**:
- Every factual claim must have at least one `[EV-XXXX]` citation
- Executive Summary must state confidence level (High / Medium / Low)
- All secrets/credentials must be redacted to `[REDACTED]`
---
## Phase 7: Completion
1. Run final evidence count: `python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json list`
2. Archive the full investigation directory.
3. If compromise is confirmed:
- List immediate mitigations (rotate credentials, pin dependency hashes, notify affected users)
- Identify affected versions/packages
- Note disclosure obligations (if a public package: coordinate with the package registry)
4. Present the final `investigation-report.md` to the user.
---
## Ethical Use Guidelines
This skill is designed for **defensive security investigation** — protecting open-source software from supply chain attacks. It must not be used for:
- **Harassment or stalking** of contributors or maintainers
- **Doxing** — correlating GitHub activity to real identities for malicious purposes
- **Competitive intelligence** — investigating proprietary or internal repositories without authorization
- **False accusations** — publishing investigation results without validated evidence (see anti-hallucination guardrails)
Investigations should be conducted with the principle of **minimal intrusion**: collect only the evidence necessary to validate or refute the hypothesis. When publishing results, follow responsible disclosure practices and coordinate with affected maintainers before public disclosure.
If the investigation reveals a genuine compromise, follow the coordinated vulnerability disclosure process:
1. Notify the repository maintainers privately first
2. Allow reasonable time for remediation (typically 90 days)
3. Coordinate with package registries (npm, PyPI, etc.) if published packages are affected
4. File a CVE if appropriate
---
## API Rate Limiting
GitHub REST API enforces rate limits that will interrupt large investigations if not managed.
**Authenticated requests**: 5,000/hour (requires `GITHUB_TOKEN` env var or `gh` CLI auth)
**Unauthenticated requests**: 60/hour (unusable for investigations)
**Best practices**:
- Always authenticate: `export GITHUB_TOKEN=ghp_...` or use `gh` CLI (auto-authenticates)
- Use conditional requests (`If-None-Match` / `If-Modified-Since` headers) to avoid consuming quota on unchanged data
- For paginated endpoints, fetch all pages in sequence — don't parallelize against the same endpoint
- Check `X-RateLimit-Remaining` header; if below 100, pause for `X-RateLimit-Reset` timestamp
- BigQuery has its own quotas (10 TiB/day free tier) — always dry-run first
- Wayback Machine CDX API: no formal rate limit, but be courteous (1-2 req/sec max)
If rate-limited mid-investigation, record the partial results in the evidence store and note the limitation in the report.
---
## Reference Materials
- [github-archive-guide.md](./references/github-archive-guide.md) — BigQuery queries, CDX API, 12 event types
- [evidence-types.md](./references/evidence-types.md) — IOC taxonomy, evidence source types, observation types
- [recovery-techniques.md](./references/recovery-techniques.md) — Recovering deleted commits, PRs, issues
- [investigation-templates.md](./references/investigation-templates.md) — Pre-built hypothesis templates per attack type
- [evidence-store.py](./scripts/evidence-store.py) — CLI tool for managing the evidence JSON store
- [forensic-report.md](./templates/forensic-report.md) — Structured report template

View File

@@ -0,0 +1,89 @@
# Evidence Types Reference
Taxonomy of all evidence types, IOC types, GitHub event types, and observation types
used in OSS forensic investigations.
---
## Evidence Source Types
| Type | Description | Example Sources |
|------|-------------|-----------------|
| `git` | Data from local git repository analysis | `git log`, `git fsck`, `git reflog`, `git blame` |
| `gh_api` | Data from GitHub REST API responses | `/repos/.../commits`, `/repos/.../pulls`, `/repos/.../events` |
| `gh_archive` | Data from GitHub Archive (BigQuery) | `githubarchive.month.*` BigQuery tables |
| `web_archive` | Archived web pages from Wayback Machine | CDX API results, `web.archive.org/web/...` snapshots |
| `ioc` | Indicator of Compromise from any source | Extracted from vendor reports, git history, network traces |
| `analysis` | Derived insight from cross-source correlation | "SHA present in archive but absent from API" |
| `vendor_report` | External security vendor or researcher report | CVE advisories, blog posts, NVD records |
| `manual` | Manually recorded observation by investigator | Notes on behavioral patterns, timeline gaps |
---
## IOC Types
| Type | Description | Example |
|------|-------------|---------|
| `COMMIT_SHA` | A git commit hash linked to malicious activity | `abc123def456...` |
| `FILE_PATH` | A suspicious file inside the repository | `src/utils/crypto.js`, `dist/index.min.js` |
| `API_KEY` | An API key accidentally committed | `AKIA...` (AWS), `ghp_...` (GitHub PAT) |
| `SECRET` | A generic secret / credential | Database password, private key blob |
| `IP_ADDRESS` | A C2 server or attacker IP | `192.0.2.1` |
| `DOMAIN` | A malicious or suspicious domain | `evil-cdn.io`, typosquatted package registry domain |
| `PACKAGE_NAME` | A malicious or squatted package name | `colo-rs` (typosquatting `color`), `lodash-utils` |
| `ACTOR_USERNAME` | A GitHub handle linked to the attack | `malicious-bot-account` |
| `MALICIOUS_URL` | A URL to a malicious resource | `https://evil.example.com/payload.sh` |
| `WORKFLOW_FILE` | A suspicious CI/CD workflow file | `.github/workflows/release.yml` |
| `BRANCH_NAME` | A suspicious branch | `refs/heads/temp-fix-do-not-merge` |
| `TAG_NAME` | A suspicious git tag | `v1.0.0-security-patch` |
| `RELEASE_NAME` | A suspicious release | Release with no associated tag or changelog |
| `OTHER` | Catch-all for unclassified IOCs | — |
---
## GitHub Archive Event Types (12 Types)
| Event Type | Forensic Relevance |
|------------|-------------------|
| `PushEvent` | Core: `payload.distinct_size=0` with `payload.size>0` → force push. `payload.before`/`payload.head` shows rewritten history. |
| `PullRequestEvent` | Detects deleted PRs, rapid open→close patterns, PRs from new accounts |
| `IssueEvent` | Detects deleted issues, coordinated labeling, rapid closure of vulnerability reports |
| `IssueCommentEvent` | Deleted comments, rapid activity bursts |
| `WatchEvent` | Star-farming campaigns (coordinated starring from new accounts) |
| `ForkEvent` | Unusual fork patterns before malicious commit |
| `CreateEvent` | Branch/tag creation: signals new release or code injection point |
| `DeleteEvent` | Branch/tag deletion: critical — often used to hide traces |
| `ReleaseEvent` | Unauthorized releases, release artifacts modified post-publish |
| `MemberEvent` | Collaborator added/removed: maintainer compromise indicator |
| `PublicEvent` | Repository made public (sometimes to drop malicious code briefly) |
| `WorkflowRunEvent` | CI/CD pipeline executions: workflow injection, secret exfiltration |
---
## Evidence Verification States
| State | Meaning |
|-------|---------|
| `unverified` | Collected from a single source, not cross-referenced |
| `single_source` | The primary source has been confirmed directly (e.g., SHA resolves on GitHub), but no second source |
| `multi_source_verified` | Confirmed from 2+ independent sources (e.g., GH Archive AND GitHub API both show the same event) |
Only `multi_source_verified` evidence may be cited as fact in validated hypotheses.
`unverified` and `single_source` evidence must be labeled `[UNVERIFIED]` or `[SINGLE-SOURCE]`.
---
## Observation Types (Patterned after RAPTOR)
| Type | Description |
|------|-------------|
| `CommitObservation` | Specific commit SHA with metadata (author, date, files changed) |
| `ForceWashObservation` | Evidence that commits were force-erased from a branch |
| `DanglingCommitObservation` | SHA present in git object store but unreachable from any ref |
| `IssueObservation` | A GitHub issue (current or archived) with title, body, timestamp |
| `PRObservation` | A GitHub PR (current or archived) with diff summary, reviewers |
| `IOC` | A single Indicator of Compromise with context |
| `TimelineGap` | A period with unusual absence of expected activity |
| `ActorAnomalyObservation` | Behavioral anomaly for a specific GitHub actor |
| `WorkflowAnomalyObservation` | Suspicious CI/CD workflow change or unexpected run |
| `CrossSourceDiscrepancy` | Item present in one source but absent in another (strong deletion indicator) |

View File

@@ -0,0 +1,184 @@
# GitHub Archive Query Guide (BigQuery)
GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.
## Public Dataset
- **Project**: `githubarchive`
- **Tables**: `day.YYYYMMDD`, `month.YYYYMM`, `year.YYYY`
- **Cost**: $6.25 per TiB scanned. Always run dry runs first.
- **Access**: Requires a Google Cloud account with BigQuery enabled. Free tier includes 1 TiB/month of queries.
---
## The 12 GitHub Event Types
| Event Type | What It Records | Forensic Value |
|------------|-----------------|----------------|
| `PushEvent` | Commits pushed to a branch | Force-push detection, commit timeline, author attribution |
| `PullRequestEvent` | PR opened, closed, merged, reopened | Deleted PR recovery, review timeline |
| `IssuesEvent` | Issue opened, closed, reopened, labeled | Deleted issue recovery, social engineering traces |
| `IssueCommentEvent` | Comments on issues and PRs | Deleted comment recovery, communication patterns |
| `CreateEvent` | Branch, tag, or repository creation | Suspicious branch creation, tag timing |
| `DeleteEvent` | Branch or tag deletion | Evidence of cleanup after compromise |
| `MemberEvent` | Collaborator added or removed | Permission changes, access escalation |
| `PublicEvent` | Repository made public | Accidental exposure of private repos |
| `WatchEvent` | User stars a repository | Actor reconnaissance patterns |
| `ForkEvent` | Repository forked | Exfiltration of code before cleanup |
| `ReleaseEvent` | Release published, edited, deleted | Malicious release injection, deleted release recovery |
| `WorkflowRunEvent` | GitHub Actions workflow triggered | CI/CD abuse, unauthorized workflow runs |
---
## Query Templates
### Basic: All Events for a Repository
```sql
SELECT
created_at,
type,
actor.login,
repo.name,
payload
FROM
`githubarchive.day.20240101` -- Adjust date
WHERE
repo.name = 'owner/repo'
AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
ORDER BY
created_at ASC
```
### Force-Push Detection
Force-pushes produce PushEvents where commits are overwritten. Key indicators:
- `payload.distinct_size = 0` with `payload.size > 0` → commits were erased
- `payload.before` contains the SHA before the rewrite (recoverable)
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'PushEvent'
AND repo.name = 'owner/repo'
AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
ORDER BY
created_at ASC
```
### Deleted Branch/Tag Detection
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'DeleteEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
```
### Collaborator Permission Changes
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'MemberEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
```
### CI/CD Workflow Activity
```sql
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'WorkflowRunEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
```
### Actor Activity Profiling
```sql
SELECT
type,
COUNT(*) AS event_count,
MIN(created_at) AS first_event,
MAX(created_at) AS last_event
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202301' AND '202412'
AND actor.login = 'suspicious-username'
GROUP BY type
ORDER BY event_count DESC
```
---
## Cost Optimization (MANDATORY)
1. **Always dry run first**: Add `--dry_run` flag to `bq query` to see estimated bytes scanned before executing.
2. **Use `_TABLE_SUFFIX`**: Narrow the date range as much as possible. `day.*` tables are cheapest for narrow windows; `month.*` for broader sweeps.
3. **Select only needed columns**: Avoid `SELECT *`. The `payload` column is large — only select specific JSON paths.
4. **Add LIMIT**: Use `LIMIT 1000` during exploration. Remove only for final exhaustive queries.
5. **Column filtering in WHERE**: Filter on indexed columns (`type`, `repo.name`, `actor.login`) before payload extraction.
**Cost estimation**: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with `_TABLE_SUFFIX` typically scans 1-10 GiB ($0.006-$0.06).
---
## Accessing via Hermes
**Option A: BigQuery CLI** (if `gcloud` is installed)
```bash
bq query --use_legacy_sql=false --format=json "YOUR QUERY"
```
**Option B: Python** (via `execute_code`)
```python
from google.cloud import bigquery
client = bigquery.Client()
query = "YOUR QUERY"
results = client.query(query).result()
for row in results:
print(dict(row))
```
**Option C: No GCP credentials available**
If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.

View File

@@ -0,0 +1,131 @@
# Investigation Templates
Pre-built hypothesis and investigation templates for common supply chain attack scenarios.
Each template includes: attack pattern, key evidence to collect, and hypothesis starters.
---
## Template 1: Maintainer Account Compromise
**Pattern**: Attacker gains access to a legitimate maintainer account (phishing, credential stuffing)
and uses it to push malicious code, create backdoored releases, or exfiltrate CI secrets.
**Real-world examples**: XZ Utils (2024), Codecov (2021), event-stream (2018)
**Key Evidence to Collect**:
- [ ] Push events from maintainer account outside normal working hours/timezone
- [ ] Commits adding new dependencies, obfuscated code, or modified build scripts
- [ ] Release creation immediately after suspicious push (to maximize package distribution)
- [ ] MemberEvent adding unknown collaborators (attacker adding backup access)
- [ ] WorkflowRunEvent with unexpected secret access or exfiltration-like behavior
- [ ] Account login location changes (check social media, conference talks for corroboration)
**Hypothesis Starters**:
```
[HYPOTHESIS] Actor <HANDLE>'s account was compromised on or around <DATE>,
based on anomalous commit timing [EV-XXXX] and geographic access patterns [EV-YYYY].
```
```
[HYPOTHESIS] Release <VERSION> was published by the compromised account to push
malicious code to downstream users, evidenced by the malicious commit [EV-XXXX]
being added <N> hours before the release [EV-YYYY].
```
---
## Template 2: Malicious Dependency Injection
**Pattern**: A trusted package is modified to include malicious code in a dependency,
or a new malicious dependency is injected into an existing package.
**Key Evidence to Collect**:
- [ ] Diff of `package.json`/`requirements.txt`/`go.mod` before and after suspicious commit
- [ ] The new dependency's publication timestamp vs. the injection commit timestamp
- [ ] Whether the new dependency exists on npm/PyPI and who owns it
- [ ] Any obfuscation patterns in the injected dependency code
- [ ] Install-time scripts (`postinstall`, `setup.py`, etc.) that execute code on install
**Hypothesis Starters**:
```
[HYPOTHESIS] Commit <SHA> [EV-XXXX] introduced dependency <PACKAGE@VERSION>
which appears to be a malicious package published by actor <HANDLE> [EV-YYYY],
designed to execute <BEHAVIOR> during installation.
```
---
## Template 3: CI/CD Pipeline Injection
**Pattern**: Attacker modifies GitHub Actions workflows to steal secrets, exfiltrate code,
or inject malicious artifacts into the build output.
**Key Evidence to Collect**:
- [ ] Diff of all `.github/workflows/*.yml` files before/after suspicious period
- [ ] WorkflowRunEvents triggered by the modified workflows
- [ ] Any `curl`, `wget`, or network calls added to workflow steps
- [ ] New or modified `env:` sections referencing `secrets.*`
- [ ] Artifacts produced by modified workflow runs
**Hypothesis Starters**:
```
[HYPOTHESIS] Workflow file <FILE> was modified in commit <SHA> [EV-XXXX] to
exfiltrate repository secrets via <METHOD>, as evidenced by the added network
call pattern [EV-YYYY].
```
---
## Template 4: Typosquatting / Dependency Confusion
**Pattern**: Attacker registers a package with a name similar to a popular package
(or an internal package name) to intercept installs from users who mistype.
**Key Evidence to Collect**:
- [ ] Registration timestamp of the suspicious package on the registry
- [ ] Package content: does it contain malicious code or is it a stub?
- [ ] Download statistics for the suspicious package
- [ ] Names of internal packages that could be targeted (if private repo scope)
- [ ] Any references to the legitimate package in the malicious one's metadata
**Hypothesis Starters**:
```
[HYPOTHESIS] Package <MALICIOUS_NAME> was registered on <DATE> [EV-XXXX] to
typosquat on <LEGITIMATE_NAME>, targeting users who misspell the package name.
The package contains <BEHAVIOR> [EV-YYYY].
```
---
## Template 5: Force-Push History Rewrite (Evidence Erasure)
**Pattern**: After a malicious commit is detected (or before wider notice), the attacker
force-pushes to remove the malicious commit from branch history.
**Detection is key** — this template focuses on proving the erasure happened.
**Key Evidence to Collect**:
- [ ] GH Archive PushEvent with `distinct_size=0` (force push indicator) [EV-XXXX]
- [ ] The SHA of the commit BEFORE the force push (from GH Archive `payload.before`)
- [ ] Recovery of the erased commit via direct URL or `git fetch origin SHA`
- [ ] Wayback Machine snapshot of the commit page before erasure
- [ ] Timeline gap in git log (N commits visible in archive but M < N in current repo)
**Hypothesis Starters**:
```
[HYPOTHESIS] Actor <HANDLE> force-pushed branch <BRANCH> on <DATE> [EV-XXXX]
to erase commit <SHA> [EV-YYYY], which contained <MALICIOUS_CONTENT>.
The erased commit was recovered via <METHOD> [EV-ZZZZ].
```
---
## Cross-Cutting Investigation Checklist
Apply to every investigation regardless of template:
- [ ] Check all contributors for newly created accounts (< 30 days old at time of malicious activity)
- [ ] Check if any maintainer account changed email in the period (sign of account takeover)
- [ ] Verify GPG signatures on suspicious commits match known maintainer keys
- [ ] Check if the repository changed ownership or transferred orgs near the incident
- [ ] Look for "cleanup" commits immediately after the malicious commit (cover-up pattern)
- [ ] Check related packages/repos by the same author for similar patterns

View File

@@ -0,0 +1,164 @@
# Deleted Content Recovery Techniques
## Key Insight: GitHub Never Fully Deletes Force-Pushed Commits
Force-pushed commits are removed from the branch history but REMAIN on GitHub's servers until garbage collection runs (which can take weeks to months). This is the foundation of deleted commit recovery.
---
## Method 1: Direct GitHub URL (Fastest — No Auth Required)
If you have a commit SHA, access it directly even if it was force-pushed off a branch:
```bash
# View commit metadata
curl -s "https://github.com/OWNER/REPO/commit/SHA"
# Download as patch (includes full diff)
curl -s "https://github.com/OWNER/REPO/commit/SHA.patch" > recovered_commit.patch
# Download as diff
curl -s "https://github.com/OWNER/REPO/commit/SHA.diff" > recovered_commit.diff
# Example (Istio credential leak - real incident):
curl -s "https://github.com/istio/istio/commit/FORCE_PUSHED_SHA.patch"
```
**When this works**: SHA is known (from GH Archive, Wayback Machine, or `git fsck`)
**When this fails**: GitHub has already garbage-collected the object (rare, typically 3090 days post-force-push)
---
## Method 2: GitHub REST API
```bash
# Works for commits force-pushed off branches but still on server
# Note: /commits/SHA may 404, but /git/commits/SHA often succeeds for orphaned commits
curl -s "https://api.github.com/repos/OWNER/REPO/git/commits/SHA" | jq .
# Get the tree (file listing) of a force-pushed commit
curl -s "https://api.github.com/repos/OWNER/REPO/git/trees/SHA?recursive=1" | jq .
# Get a specific file from a force-pushed commit
curl -s "https://api.github.com/repos/OWNER/REPO/contents/PATH?ref=SHA" | jq .content | base64 -d
```
---
## Method 3: Git Fetch by SHA (Local — Requires Clone)
```bash
# Fetch an orphaned commit directly by SHA into local repo
cd target_repo
git fetch origin SHA
git log FETCH_HEAD -1 # view the commit
git diff FETCH_HEAD~1 FETCH_HEAD # view the diff
# If the SHA was recently force-pushed it will still be fetchable
# This stops working once GitHub GC runs
```
---
## Method 4: Dangling Commits via git fsck
```bash
cd target_repo
# Find all unreachable objects (includes force-pushed commits)
git fsck --unreachable --no-reflogs 2>&1 | grep "unreachable commit" | awk '{print $3}' > dangling_shas.txt
# For each dangling commit, get its metadata
while read sha; do
echo "=== $sha ===" >> dangling_details.txt
git show --stat "$sha" >> dangling_details.txt 2>&1
done < dangling_shas.txt
# Note: dangling objects only exist in LOCAL clone — not the same as GitHub's copies
# GitHub's copies are accessible via Methods 1-3 until GC runs
```
---
## Recovering Deleted GitHub Issues and PRs
### Via Wayback Machine CDX API
```bash
# Find all archived snapshots of a specific issue
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO/issues/NUMBER&output=json&limit=50&fl=timestamp,statuscode,original" | python3 -m json.tool
# Fetch the best snapshot
# Use the timestamp from the CDX result:
# https://web.archive.org/web/TIMESTAMP/https://github.com/OWNER/REPO/issues/NUMBER
curl -s "https://web.archive.org/web/TIMESTAMP/https://github.com/OWNER/REPO/issues/NUMBER" > issue_NUMBER_archived.html
# Find all snapshots of the repo in a date range
curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/OWNER/REPO*&output=json&from=20240101&to=20240201&limit=200&fl=timestamp,urlkey,statuscode" | python3 -m json.tool
```
### Via GitHub API (Limited — Only Non-Deleted Content)
```bash
# Closed issues (not deleted) are retrievable
curl -s "https://api.github.com/repos/OWNER/REPO/issues?state=closed&per_page=100" | jq '.[].number'
# Note: DELETED issues/PRs do NOT appear in the API. Use Wayback Machine or GH Archive for those.
```
### Via GitHub Archive (For Event History — Not Content)
```sql
-- Find all IssueEvents for a repo in a date range
SELECT created_at, actor.login, payload.action, payload.issue.number, payload.issue.title
FROM `githubarchive.day.*`
WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240201'
AND type = 'IssuesEvent'
AND repo.name = 'OWNER/REPO'
ORDER BY created_at
```
---
## Recovering Deleted Files from a Known Commit
```bash
# If you have the commit SHA (even force-pushed):
git show SHA:path/to/file.py > recovered_file.py
# Or via API (base64 encoded content):
curl -s "https://api.github.com/repos/OWNER/REPO/contents/path/to/file.py?ref=SHA" | python3 -c "
import sys, json, base64
d = json.load(sys.stdin)
print(base64.b64decode(d['content']).decode())
"
```
---
## Evidence Recording
After recovering any deleted content, immediately record it:
```bash
python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json add \
--source "git fetch origin FORCE_PUSHED_SHA" \
--content "Recovered commit: FORCE_PUSHED_SHA | Author: attacker@example.com | Date: 2024-01-15 | Added file: malicious.sh" \
--type git \
--actor "attacker-handle" \
--url "https://github.com/OWNER/REPO/commit/FORCE_PUSHED_SHA.patch" \
--timestamp "2024-01-15T00:00:00Z" \
--verification single_source \
--notes "Commit force-pushed off main branch on 2024-01-16. Recovered via direct fetch."
```
---
## Recovery Failure Modes
| Failure | Cause | Workaround |
|---------|-------|------------|
| `git fetch origin SHA` returns "not our ref" | GitHub GC already ran | Try Method 1/2, search Wayback Machine |
| `github.com/OWNER/REPO/commit/SHA` returns 404 | GC ran or SHA is wrong | Verify SHA via GH Archive; try partial SHA search |
| Wayback Machine has no snapshots | Page was never crawled by IA | Check `commoncrawl.org`, check Google Cache |
| BigQuery shows event but no content | GH Archive stores event metadata, not file contents | Recovery only reveals the event occurred, not the content |

View File

@@ -0,0 +1,313 @@
#!/usr/bin/env python3
"""
OSS Forensics Evidence Store Manager
Manages a JSON-based evidence store for forensic investigations.
Commands:
add - Add a piece of evidence
list - List all evidence (optionally filter by type or actor)
verify - Re-check SHA-256 hashes for integrity
query - Search evidence by keyword
export - Export evidence as a Markdown table
summary - Print investigation statistics
Usage example:
python3 evidence-store.py --store evidence.json add \
--source "git fsck output" --content "dangling commit abc123" \
--type git --actor "malicious-user" --url "https://github.com/owner/repo/commit/abc123"
python3 evidence-store.py --store evidence.json list --type git
python3 evidence-store.py --store evidence.json verify
python3 evidence-store.py --store evidence.json export > evidence-table.md
"""
import json
import argparse
import os
import datetime
import hashlib
import sys
EVIDENCE_TYPES = [
"git", # Local git repository data (commits, reflog, fsck)
"gh_api", # GitHub REST API responses
"gh_archive", # GitHub Archive / BigQuery query results
"web_archive", # Wayback Machine snapshots
"ioc", # Indicator of Compromise (SHA, domain, IP, package name, etc.)
"analysis", # Derived analysis / cross-source correlation result
"manual", # Manually noted observation
"vendor_report", # External security vendor report excerpt
]
VERIFICATION_STATES = ["unverified", "single_source", "multi_source_verified"]
IOC_TYPES = [
"COMMIT_SHA", "FILE_PATH", "API_KEY", "SECRET", "IP_ADDRESS",
"DOMAIN", "PACKAGE_NAME", "ACTOR_USERNAME", "MALICIOUS_URL",
"WORKFLOW_FILE", "BRANCH_NAME", "TAG_NAME", "RELEASE_NAME", "OTHER",
]
def _now_iso():
return datetime.datetime.now(datetime.timezone.utc).isoformat(timespec="seconds") + "Z"
def _sha256(content: str) -> str:
return hashlib.sha256(content.encode("utf-8")).hexdigest()
class EvidenceStore:
def __init__(self, filepath: str):
self.filepath = filepath
self.data = {
"metadata": {
"version": "2.0",
"created_at": _now_iso(),
"last_updated": _now_iso(),
"investigation": "",
"target_repo": "",
},
"evidence": [],
"chain_of_custody": [],
}
if os.path.exists(filepath):
try:
with open(filepath, "r", encoding="utf-8") as f:
self.data = json.load(f)
except (json.JSONDecodeError, IOError) as e:
print(f"Error loading evidence store '{filepath}': {e}", file=sys.stderr)
print("Hint: The file might be corrupted. Check for manual edits or syntax errors.", file=sys.stderr)
sys.exit(1)
def _save(self):
self.data["metadata"]["last_updated"] = _now_iso()
with open(self.filepath, "w", encoding="utf-8") as f:
json.dump(self.data, f, indent=2, ensure_ascii=False)
def _next_id(self) -> str:
return f"EV-{len(self.data['evidence']) + 1:04d}"
def add(
self,
source: str,
content: str,
evidence_type: str,
actor: str = None,
url: str = None,
timestamp: str = None,
ioc_type: str = None,
verification: str = "unverified",
notes: str = None,
) -> str:
evidence_id = self._next_id()
entry = {
"id": evidence_id,
"type": evidence_type,
"source": source,
"content": content,
"content_sha256": _sha256(content),
"actor": actor,
"url": url,
"event_timestamp": timestamp,
"collected_at": _now_iso(),
"ioc_type": ioc_type,
"verification": verification,
"notes": notes,
}
self.data["evidence"].append(entry)
self.data["chain_of_custody"].append({
"action": "add",
"evidence_id": evidence_id,
"timestamp": _now_iso(),
"source": source,
})
self._save()
return evidence_id
def list_evidence(self, filter_type: str = None, filter_actor: str = None):
results = self.data["evidence"]
if filter_type:
results = [e for e in results if e.get("type") == filter_type]
if filter_actor:
results = [e for e in results if e.get("actor") == filter_actor]
return results
def verify_integrity(self):
"""Re-compute SHA-256 for all entries and report mismatches."""
issues = []
for entry in self.data["evidence"]:
expected = _sha256(entry["content"])
stored = entry.get("content_sha256", "")
if expected != stored:
issues.append({
"id": entry["id"],
"stored_sha256": stored,
"computed_sha256": expected,
})
return issues
def query(self, keyword: str):
"""Search for keyword in content, source, actor, or url."""
keyword_lower = keyword.lower()
return [
e for e in self.data["evidence"]
if keyword_lower in (e.get("content", "") or "").lower()
or keyword_lower in (e.get("source", "") or "").lower()
or keyword_lower in (e.get("actor", "") or "").lower()
or keyword_lower in (e.get("url", "") or "").lower()
]
def export_markdown(self) -> str:
lines = [
"# Evidence Registry",
"",
f"**Store**: `{self.filepath}`",
f"**Last Updated**: {self.data['metadata'].get('last_updated', 'N/A')}",
f"**Total Evidence Items**: {len(self.data['evidence'])}",
"",
"| ID | Type | Source | Actor | Verification | Event Timestamp | URL |",
"|----|------|--------|-------|--------------|-----------------|-----|",
]
for e in self.data["evidence"]:
url = e.get("url") or ""
url_display = f"[link]({url})" if url else ""
lines.append(
f"| {e['id']} | {e.get('type','')} | {e.get('source','')} "
f"| {e.get('actor') or ''} | {e.get('verification','')} "
f"| {e.get('event_timestamp') or ''} | {url_display} |"
)
lines.append("")
lines.append("## Chain of Custody")
lines.append("")
lines.append("| Evidence ID | Action | Timestamp | Source |")
lines.append("|-------------|--------|-----------|--------|")
for c in self.data["chain_of_custody"]:
lines.append(
f"| {c.get('evidence_id','')} | {c.get('action','')} "
f"| {c.get('timestamp','')} | {c.get('source','')} |"
)
return "\n".join(lines)
def summary(self) -> dict:
by_type = {}
by_verification = {}
actors = set()
for e in self.data["evidence"]:
t = e.get("type", "unknown")
by_type[t] = by_type.get(t, 0) + 1
v = e.get("verification", "unverified")
by_verification[v] = by_verification.get(v, 0) + 1
if e.get("actor"):
actors.add(e["actor"])
return {
"total": len(self.data["evidence"]),
"by_type": by_type,
"by_verification": by_verification,
"unique_actors": sorted(actors),
}
def main():
parser = argparse.ArgumentParser(
description="OSS Forensics Evidence Store Manager v2.0",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--store", default="evidence.json", help="Path to evidence JSON file (default: evidence.json)")
subparsers = parser.add_subparsers(dest="command", metavar="COMMAND")
# --- add ---
add_p = subparsers.add_parser("add", help="Add a new evidence entry")
add_p.add_argument("--source", required=True, help="Where this evidence came from (e.g. 'git fsck', 'GH API /commits')")
add_p.add_argument("--content", required=True, help="The evidence content (commit SHA, API response excerpt, etc.)")
add_p.add_argument("--type", required=True, choices=EVIDENCE_TYPES, dest="evidence_type", help="Evidence type")
add_p.add_argument("--actor", help="GitHub handle or email of associated actor")
add_p.add_argument("--url", help="URL to original source")
add_p.add_argument("--timestamp", help="When the event occurred (ISO 8601)")
add_p.add_argument("--ioc-type", choices=IOC_TYPES, help="IOC subtype (for --type ioc)")
add_p.add_argument("--verification", choices=VERIFICATION_STATES, default="unverified")
add_p.add_argument("--notes", help="Additional investigator notes")
add_p.add_argument("--quiet", action="store_true", help="Suppress success message")
# --- list ---
list_p = subparsers.add_parser("list", help="List all evidence entries")
list_p.add_argument("--type", dest="filter_type", choices=EVIDENCE_TYPES, help="Filter by type")
list_p.add_argument("--actor", dest="filter_actor", help="Filter by actor")
# --- verify ---
subparsers.add_parser("verify", help="Verify SHA-256 integrity of all evidence content")
# --- query ---
query_p = subparsers.add_parser("query", help="Search evidence by keyword")
query_p.add_argument("keyword", help="Keyword to search for")
# --- export ---
subparsers.add_parser("export", help="Export evidence as a Markdown table (stdout)")
# --- summary ---
subparsers.add_parser("summary", help="Print investigation statistics")
args = parser.parse_args()
if not args.command:
parser.print_help()
sys.exit(0)
store = EvidenceStore(args.store)
if args.command == "add":
eid = store.add(
source=args.source,
content=args.content,
evidence_type=args.evidence_type,
actor=args.actor,
url=args.url,
timestamp=args.timestamp,
ioc_type=args.ioc_type,
verification=args.verification,
notes=args.notes,
)
if not getattr(args, "quiet", False):
print(f"✓ Added evidence: {eid}")
elif args.command == "list":
items = store.list_evidence(
filter_type=getattr(args, "filter_type", None),
filter_actor=getattr(args, "filter_actor", None),
)
if not items:
print("No evidence found.")
for e in items:
actor_str = f" | actor: {e['actor']}" if e.get("actor") else ""
url_str = f" | {e['url']}" if e.get("url") else ""
print(f"[{e['id']}] {e['type']:12s} | {e['verification']:20s} | {e['source']}{actor_str}{url_str}")
elif args.command == "verify":
issues = store.verify_integrity()
if not issues:
print(f"✓ All {len(store.data['evidence'])} evidence entries passed SHA-256 integrity check.")
else:
print(f"{len(issues)} integrity issue(s) detected:")
for i in issues:
print(f" [{i['id']}] stored={i['stored_sha256'][:16]}... computed={i['computed_sha256'][:16]}...")
sys.exit(1)
elif args.command == "query":
results = store.query(args.keyword)
print(f"Found {len(results)} result(s) for '{args.keyword}':")
for e in results:
print(f" [{e['id']}] {e['type']} | {e['source']} | {e['content'][:80]}")
elif args.command == "export":
print(store.export_markdown())
elif args.command == "summary":
s = store.summary()
print(f"Total evidence items : {s['total']}")
print(f"By type : {json.dumps(s['by_type'], indent=2)}")
print(f"By verification : {json.dumps(s['by_verification'], indent=2)}")
print(f"Unique actors : {s['unique_actors']}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,151 @@
# Forensic Investigation Report
> **Instructions**: Fill in all sections. Every factual claim must cite at least one `[EV-XXXX]` evidence ID.
> Remove placeholder text and instruction notes before finalizing. Redact all secrets to `[REDACTED]`.
---
## Executive Summary
**Target Repository**: `OWNER/REPO`
**Investigation Period**: YYYY-MM-DD to YYYY-MM-DD
**Verdict**: <!-- Compromised / Clean / Inconclusive -->
**Confidence Level**: <!-- High / Medium / Low -->
**Report Date**: YYYY-MM-DD
**Investigator**: <!-- Agent session ID or analyst name -->
<!-- One paragraph: what was investigated, what was found, what is recommended. -->
---
## Timeline of Events
> All timestamps in UTC. Each event must cite at least one evidence ID.
| Timestamp (UTC) | Event | Evidence IDs | Source |
|-----------------|-------|--------------|--------|
| YYYY-MM-DDTHH:MM:SSZ | _Describe event_ | [EV-XXXX] | git / gh_api / gh_archive / web_archive |
| | | | |
---
## Validated Hypotheses
### Hypothesis 1: <!-- Short title -->
**Status**: <!-- VALIDATED / INCONCLUSIVE / REJECTED -->
**Claim**: _Full statement of the hypothesis._
**Supporting Evidence**:
- [EV-XXXX]: _What this evidence shows_
- [EV-YYYY]: _What this evidence shows_
**Counter-Evidence Considered**: _What might disprove this, and why it was ruled out or not._
**Confidence**: <!-- High / Medium / Low, and why -->
---
## Indicators of Compromise (IOC List)
| Type | Value | Status | Evidence |
|------|-------|--------|----------|
| COMMIT_SHA | `abc123...` | Confirmed malicious | [EV-XXXX] |
| ACTOR_USERNAME | `handle` | Suspected compromised | [EV-YYYY] |
| FILE_PATH | `src/evil.js` | Confirmed malicious | [EV-ZZZZ] |
| DOMAIN | `evil-cdn.io` | Confirmed C2 | [EV-WWWW] |
---
## Affected Versions
| Version / Tag | Published | Contains Malicious Code | Evidence |
|---------------|-----------|------------------------|----------|
| `v1.2.3` | YYYY-MM-DD | Yes / No / Unknown | [EV-XXXX] |
---
## Evidence Registry
> Generated by: `python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json export`
<!-- Paste the Markdown table output from the evidence-store.py export command here -->
| ID | Type | Source | Actor | Verification | Event Timestamp | URL |
|----|------|--------|-------|--------------|-----------------|-----|
| EV-0001 | | | | | | |
---
## Chain of Custody
> Generated by: `python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json export`
<!-- Paste the chain of custody section from the export output here -->
| Evidence ID | Action | Timestamp | Source |
|-------------|--------|-----------|--------|
| EV-0001 | add | | |
---
## Technical Findings
### Git History Analysis
_Summarize findings from local git analysis: dangling commits, reflog anomalies, unsigned commits, binary additions, etc._
### GitHub API Analysis
_Summarize findings from GitHub REST API: deleted PRs/issues, contributor changes, release anomalies, etc._
### GitHub Archive Analysis
_Summarize findings from BigQuery: force-push events, delete events, workflow anomalies, member changes, etc._
_Note: If BigQuery was unavailable, state this explicitly._
### Wayback Machine Analysis
_Summarize findings from archive.org: recovered deleted pages, historical content differences, etc._
### IOC Enrichment
_Summarize enrichment results: WHOIS data for domains, recovered commit content, actor account analysis, etc._
---
## Recommendations
### Immediate Actions (If Compromise Confirmed)
- [ ] Rotate all GitHub tokens, API keys, and credentials that may have been exposed
- [ ] Pin dependency versions to hashes in all affected packages
- [ ] Publish a security advisory / CVE if applicable
- [ ] Notify downstream users/package registries (npm, PyPI, etc.)
- [ ] Revoke access for the compromised account and re-secure with hardware 2FA
- [ ] Audit all CI/CD workflow files for unauthorized modifications
- [ ] Review all releases published during the compromise window
### Monitoring Recommendations
- [ ] Enable branch protection on `main`/`master` (require code review, disallow force-push)
- [ ] Enable required commit signing (GPG/SSH)
- [ ] Set up GitHub audit log streaming for future monitoring
- [ ] Pin critical dependencies to known-good SHAs in lock files
---
## Limitations and Caveats
- _List any data sources that were unavailable (e.g., no BigQuery access)_
- _Note any evidence that is single-source only (not independently verified)_
- _Note any hypotheses that could not be confirmed or denied_
---
## References
- Evidence store: `evidence.json` (SHA-256 integrity: run `python3 SKILL_DIR/scripts/evidence-store.py --store evidence.json verify`)
- Related issues: <!-- Link to GitHub issues, CVEs, security advisories -->
- RAPTOR framework: https://github.com/gadievron/raptor

View File

@@ -0,0 +1,43 @@
# Malicious Package Investigation Report
---
## 📦 Package Metadata
- **Package Name**:
- **Registry**: [NPM / PyPI / RubyGems / etc.]
- **Affected Versions**:
- **Malicious Version(s)**:
- **Downloads at Time of Detection**:
- **Package URL**:
---
## 🚩 Indicators of Compromise (IOCs)
- **Malicious URL(s)**:
- **Exfiltrated Data Types**: [Environment variables, ~/.ssh/id_rsa, /etc/shadow, etc.]
- **Exfiltration Method**: [DNS tunneling, HTTP POST to C2, etc.]
- **C2 IP/Domain**:
---
## 🛠️ Analysis Summary
- **Primary Mechanism**: [Typosquatting / Dependency Confusion / Maintainer Takeover]
- **Behavior Description**:
- [Example: Installs a postinstall script that exfiltrates environment variables.]
- [Example: Patches `setup.py` to download a secondary payload.]
---
## 🔍 Evidence Registry
| Evidence ID | Type | Source | Description |
|-------------|------|--------|-------------|
| EV-XXXX | ioc | NPM | Package install script snapshot |
| EV-YYYY | web | Wayback| Historical version comparison |
---
## 🛡️ Recommended Mitigations
1. [ ] Unpublish/Report the package to the registry.
2. [ ] Audit `package-lock.json` or `requirements.txt` across all projects.
3. [ ] Rotate secrets exfiltrated via environment variables.
4. [ ] Pin specific hashes (SHASUM) for mission-critical dependencies.

View File

@@ -0,0 +1,186 @@
import os
import json
import pytest
from pathlib import Path
import importlib.util
# Load the hyphenated script name dynamically
repo_root = Path(__file__).parent.parent
script_path = repo_root / "optional-skills" / "security" / "oss-forensics" / "scripts" / "evidence-store.py"
spec = importlib.util.spec_from_file_location("evidence_store", str(script_path))
evidence_store = importlib.util.module_from_spec(spec)
spec.loader.exec_module(evidence_store)
EvidenceStore = evidence_store.EvidenceStore
def test_evidence_store_init(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
assert store.filepath == str(store_file)
assert len(store.data["evidence"]) == 0
assert "metadata" in store.data
assert store.data["metadata"]["version"] == "2.0"
assert "chain_of_custody" in store.data
def test_evidence_store_add(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
eid = store.add(
source="test_source",
content="test_content",
evidence_type="git",
actor="test_actor",
notes="test_notes",
)
assert eid == "EV-0001"
assert len(store.data["evidence"]) == 1
assert store.data["evidence"][0]["content"] == "test_content"
assert store.data["evidence"][0]["id"] == "EV-0001"
assert store.data["evidence"][0]["actor"] == "test_actor"
assert store.data["evidence"][0]["notes"] == "test_notes"
# Verify SHA-256 was computed
assert store.data["evidence"][0]["content_sha256"] is not None
assert len(store.data["evidence"][0]["content_sha256"]) == 64
def test_evidence_store_add_persists(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="s1", content="c1", evidence_type="git")
# Reload from disk
store2 = EvidenceStore(str(store_file))
assert len(store2.data["evidence"]) == 1
assert store2.data["evidence"][0]["id"] == "EV-0001"
def test_evidence_store_sequential_ids(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
eid1 = store.add(source="s1", content="c1", evidence_type="git")
eid2 = store.add(source="s2", content="c2", evidence_type="gh_api")
eid3 = store.add(source="s3", content="c3", evidence_type="ioc")
assert eid1 == "EV-0001"
assert eid2 == "EV-0002"
assert eid3 == "EV-0003"
def test_evidence_store_list(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="s1", content="c1", evidence_type="git", actor="a1")
store.add(source="s2", content="c2", evidence_type="gh_api", actor="a2")
all_evidence = store.list_evidence()
assert len(all_evidence) == 2
git_evidence = store.list_evidence(filter_type="git")
assert len(git_evidence) == 1
assert git_evidence[0]["actor"] == "a1"
actor_evidence = store.list_evidence(filter_actor="a2")
assert len(actor_evidence) == 1
assert actor_evidence[0]["type"] == "gh_api"
def test_evidence_store_verify_integrity(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="s1", content="c1", evidence_type="git")
assert len(store.verify_integrity()) == 0
# Manually corrupt the content to trigger a hash mismatch
store.data["evidence"][0]["content"] = "corrupted_content"
issues = store.verify_integrity()
assert len(issues) == 1
assert issues[0]["id"] == "EV-0001"
def test_evidence_store_query(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="github_api", content="malicious activity detected", evidence_type="gh_api")
store.add(source="manual", content="clean observation", evidence_type="manual")
results = store.query("malicious")
assert len(results) == 1
assert results[0]["source"] == "github_api"
# Query should be case-insensitive
results = store.query("MALICIOUS")
assert len(results) == 1
def test_evidence_store_query_searches_multiple_fields(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="git_fsck", content="dangling commit abc123", evidence_type="git", actor="attacker")
store.add(source="manual", content="clean", evidence_type="manual")
# Search by source
assert len(store.query("fsck")) == 1
# Search by actor
assert len(store.query("attacker")) == 1
# Search returns nothing for non-matching
assert len(store.query("nonexistent")) == 0
def test_evidence_store_chain_of_custody(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="s1", content="c1", evidence_type="git")
store.add(source="s2", content="c2", evidence_type="gh_api")
chain = store.data["chain_of_custody"]
assert len(chain) == 2
assert chain[0]["evidence_id"] == "EV-0001"
assert chain[0]["action"] == "add"
assert chain[1]["evidence_id"] == "EV-0002"
def test_evidence_store_export_markdown(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="git_log", content="suspicious commit", evidence_type="git", actor="actor1")
md = store.export_markdown()
assert "# Evidence Registry" in md
assert "EV-0001" in md
assert "Chain of Custody" in md
assert "actor1" in md
def test_evidence_store_summary(tmp_path):
store_file = tmp_path / "test_evidence.json"
store = EvidenceStore(str(store_file))
store.add(source="s1", content="c1", evidence_type="git", actor="a1")
store.add(source="s2", content="c2", evidence_type="git", actor="a2")
store.add(source="s3", content="c3", evidence_type="gh_api", actor="a1")
s = store.summary()
assert s["total"] == 3
assert s["by_type"]["git"] == 2
assert s["by_type"]["gh_api"] == 1
assert "a1" in s["unique_actors"]
assert "a2" in s["unique_actors"]
def test_evidence_store_corrupted_file(tmp_path):
store_file = tmp_path / "test_evidence.json"
store_file.write_text("NOT VALID JSON {{{")
with pytest.raises(SystemExit):
EvidenceStore(str(store_file))