From 8c54ba4d341138dd8fb28eab4d1be4c19c20f47d Mon Sep 17 00:00:00 2001 From: Alexander Whitestone Date: Tue, 7 Apr 2026 10:37:47 -0400 Subject: [PATCH] feat(poka-yoke): make test skips/flakes impossible to ignore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - conftest.py: enforce that @pytest.mark.skip requires an issue link in the reason string; bare skips are converted to failures at report time. Env-var-gated skipif conditions and quarantine-marked tests are exempt. - pytest.ini: add --strict-markers and centralise asyncio/testpaths config. - scripts/flake_detector.py: rolling JSON history + 95% consistency check; exits non-zero when flaky tests are detected so CI can gate on it. - tests/quarantine/: holding area for quarantined tests with README. - docs/QUARANTINE_PROCESS.md: Prevention→Detection→Correction workflow documenting when/how to quarantine, escalation (30-day rule), and un-quarantine steps. Fixes #1094 Co-Authored-By: Claude Sonnet 4.6 --- docs/QUARANTINE_PROCESS.md | 168 +++++++++++++++++++++++ pytest.ini | 14 ++ scripts/flake_detector.py | 256 +++++++++++++++++++++++++++++++++++ tests/conftest.py | 91 +++++++++++++ tests/quarantine/README.md | 28 ++++ tests/quarantine/__init__.py | 2 + 6 files changed, 559 insertions(+) create mode 100644 docs/QUARANTINE_PROCESS.md create mode 100644 pytest.ini create mode 100755 scripts/flake_detector.py create mode 100644 tests/quarantine/README.md create mode 100644 tests/quarantine/__init__.py diff --git a/docs/QUARANTINE_PROCESS.md b/docs/QUARANTINE_PROCESS.md new file mode 100644 index 0000000..068b94b --- /dev/null +++ b/docs/QUARANTINE_PROCESS.md @@ -0,0 +1,168 @@ +# Quarantine Process + +**Poka-yoke principle:** a flaky or broken test must never silently rot in +place. Quarantine is the correction step in the +Prevention → Detection → Correction triad described in issue #1094. + +--- + +## When to quarantine + +Quarantine a test when **any** of the following are true: + +| Signal | Source | +|--------|--------| +| `flake_detector.py` flags the test at < 95 % consistency | Automated | +| The test fails intermittently in CI over two consecutive runs | Manual observation | +| The test depends on infrastructure that is temporarily unavailable | Manual observation | +| You are fixing a bug and need to defer a related test | Developer judgement | + +Do **not** use quarantine as a way to ignore tests indefinitely. The +quarantine directory is a **30-day time-box** — see the escalation rule below. + +--- + +## Step-by-step workflow + +### 1 File an issue + +Open a Gitea issue with the title prefix `[FLAKY]` or `[BROKEN]`: + +``` +[FLAKY] test_foo_bar non-deterministically fails with assertion error +``` + +Note the issue number — you will need it in the next step. + +### 2 Move the test file + +Move (or copy) the test from `tests/` into `tests/quarantine/`. + +```bash +git mv tests/test_my_thing.py tests/quarantine/test_my_thing.py +``` + +If only individual test functions are flaky, extract them into a new file in +`tests/quarantine/` rather than moving the whole module. + +### 3 Annotate the test + +Add the `@pytest.mark.quarantine` marker with the issue reference: + +```python +import pytest + +@pytest.mark.quarantine(reason="Flaky until #NNN is resolved") +def test_my_thing(): + ... +``` + +This satisfies the poka-yoke skip-enforcement rule: the test is allowed to +skip/be excluded because it is explicitly linked to a tracking issue. + +### 4 Verify CI still passes + +```bash +pytest # default run — quarantine tests are excluded +pytest --run-quarantine # optional: run quarantined tests explicitly +``` + +The main CI run must be green before merging. + +### 5 Add to `.test-history.json` exclusions (optional) + +If the flake detector is tracking the test, add it to the `quarantine_list` in +`.test-history.json` so it is excluded from the consistency report: + +```json +{ + "quarantine_list": [ + "tests/quarantine/test_my_thing.py::test_my_thing" + ] +} +``` + +--- + +## Escalation rule + +If a quarantined test's tracking issue has had **no activity for 30 days**, +the next developer to touch that file must: + +1. Attempt to fix and un-quarantine the test, **or** +2. Delete the test and close the issue with a comment explaining why, **or** +3. Leave a comment on the issue explaining the blocker and reset the 30-day + clock explicitly. + +**A test may not stay in quarantine indefinitely without active attention.** + +--- + +## Un-quarantining a test + +When the underlying issue is resolved: + +1. Remove `@pytest.mark.quarantine` from the test. +2. Move the file back from `tests/quarantine/` to `tests/`. +3. Run the full suite to confirm it passes consistently (at least 3 local runs). +4. Close the tracking issue. +5. Remove any entries from `.test-history.json`'s `quarantine_list`. + +--- + +## Flake detector integration + +The flake detector (`scripts/flake_detector.py`) is run after every CI test +execution. It reads `.test-report.json` (produced by `pytest --json-report`) +and updates `.test-history.json`. + +**CI integration example (shell script or CI step):** + +```bash +pytest --json-report --json-report-file=.test-report.json +python scripts/flake_detector.py +``` + +If the flake detector exits non-zero, the CI step fails and the output lists +the offending tests with their consistency percentages. + +**Local usage:** + +```bash +# After running tests with JSON report: +python scripts/flake_detector.py + +# Just view current statistics without ingesting a new report: +python scripts/flake_detector.py --no-update + +# Lower threshold for local dev: +python scripts/flake_detector.py --threshold 0.90 +``` + +--- + +## Summary + +``` +Test fails intermittently + │ + ▼ + File [FLAKY] issue + │ + ▼ + git mv test → tests/quarantine/ + │ + ▼ + Add @pytest.mark.quarantine(reason="#NNN") + │ + ▼ + Main CI green ✓ + │ + ▼ + Fix the root cause (within 30 days) + │ + ▼ + git mv back → tests/ + Remove quarantine marker + Close issue ✓ +``` diff --git a/pytest.ini b/pytest.ini new file mode 100644 index 0000000..9429897 --- /dev/null +++ b/pytest.ini @@ -0,0 +1,14 @@ +[pytest] +testpaths = tests +asyncio_mode = auto + +# Show full diffs and verbose skip/fail reasons +addopts = + -v + --tb=short + --strict-markers + +# Markers registered here (also registered in conftest.py for programmatic use) +markers = + integration: mark test as integration test (requires MCP servers) + quarantine: mark test as quarantined (flaky/broken, tracked by issue) diff --git a/scripts/flake_detector.py b/scripts/flake_detector.py new file mode 100755 index 0000000..be54f6b --- /dev/null +++ b/scripts/flake_detector.py @@ -0,0 +1,256 @@ +#!/usr/bin/env python3 +"""Flake detector for the Nexus test suite. + +Reads pytest JSON reports (produced by pytest-json-report) and maintains a +rolling history file at .test-history.json. After each update it prints a +report of any test whose pass rate has dropped below the 95 % consistency +threshold and exits non-zero if any flaky tests are found. + +Usage +----- +Install pytest-json-report once:: + + pip install pytest-json-report + +Then run tests with JSON output:: + + pytest --json-report --json-report-file=.test-report.json + +Then call this script:: + + python scripts/flake_detector.py # uses .test-report.json + .test-history.json + python scripts/flake_detector.py --report path/to/report.json + python scripts/flake_detector.py --history path/to/history.json + python scripts/flake_detector.py --threshold 0.90 # lower threshold for local dev + +The script is also safe to call with no report file — it will just print the +current history statistics without updating anything. +""" + +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from typing import TypedDict + + +# --------------------------------------------------------------------------- +# Types +# --------------------------------------------------------------------------- + +class TestRecord(TypedDict): + """Per-test rolling history.""" + runs: int + passes: int + failures: int + skips: int + last_outcome: str # "passed" | "failed" | "skipped" | "error" + + +class HistoryFile(TypedDict): + total_runs: int + tests: dict[str, TestRecord] + + +# --------------------------------------------------------------------------- +# Constants +# --------------------------------------------------------------------------- + +DEFAULT_REPORT = Path(".test-report.json") +DEFAULT_HISTORY = Path(".test-history.json") +DEFAULT_THRESHOLD = 0.95 # 95 % consistency required + + +# --------------------------------------------------------------------------- +# Core helpers +# --------------------------------------------------------------------------- + +def load_history(history_path: Path) -> HistoryFile: + if history_path.exists(): + with history_path.open() as fh: + return json.load(fh) + return {"total_runs": 0, "tests": {}} + + +def save_history(history: HistoryFile, history_path: Path) -> None: + with history_path.open("w") as fh: + json.dump(history, fh, indent=2, sort_keys=True) + print(f"[flake-detector] History saved → {history_path}", file=sys.stderr) + + +def ingest_report(report_path: Path, history: HistoryFile) -> int: + """Merge a pytest JSON report into *history*. Returns the number of tests updated.""" + with report_path.open() as fh: + report = json.load(fh) + + history["total_runs"] = history.get("total_runs", 0) + 1 + tests_section = report.get("tests", []) + + for test in tests_section: + node_id: str = test.get("nodeid", "unknown") + outcome: str = test.get("outcome", "unknown") + + rec: TestRecord = history["tests"].setdefault( + node_id, + {"runs": 0, "passes": 0, "failures": 0, "skips": 0, "last_outcome": ""}, + ) + rec["runs"] += 1 + rec["last_outcome"] = outcome + if outcome == "passed": + rec["passes"] += 1 + elif outcome in ("failed", "error"): + rec["failures"] += 1 + elif outcome == "skipped": + rec["skips"] += 1 + + return len(tests_section) + + +def consistency(rec: TestRecord) -> float: + """Return fraction of runs that produced a deterministic (pass or fail) outcome. + + A test that always passes → 1.0 (stable green). + A test that always fails → 0.0 (stable red — broken, not flaky). + A test that passes 9 out of 10 times → 0.9 (flaky). + + We define *consistency* as the rate at which the test's outcome matches + its dominant outcome (pass or fail). A test with fewer than + MIN_RUNS runs is not judged. + """ + runs = rec["runs"] + if runs == 0: + return 1.0 + passes = rec["passes"] + failures = rec["failures"] + dominant = max(passes, failures) + return dominant / runs + + +MIN_RUNS = 5 # need at least this many runs before flagging + + +def find_flaky_tests( + history: HistoryFile, + threshold: float = DEFAULT_THRESHOLD, +) -> list[tuple[str, TestRecord, float]]: + """Return (node_id, record, consistency_rate) for all tests below threshold.""" + flaky: list[tuple[str, TestRecord, float]] = [] + for node_id, rec in history["tests"].items(): + if rec["runs"] < MIN_RUNS: + continue + rate = consistency(rec) + if rate < threshold: + flaky.append((node_id, rec, rate)) + flaky.sort(key=lambda x: x[2]) # worst first + return flaky + + +# --------------------------------------------------------------------------- +# Reporting +# --------------------------------------------------------------------------- + +def print_report( + flaky: list[tuple[str, TestRecord, float]], + history: HistoryFile, + threshold: float, +) -> None: + total_tests = len(history["tests"]) + total_runs = history.get("total_runs", 0) + + print(f"\n{'=' * 70}") + print(" FLAKE DETECTOR REPORT") + print(f"{'=' * 70}") + print(f" Total suite runs tracked : {total_runs}") + print(f" Total distinct tests : {total_tests}") + print(f" Consistency threshold : {threshold:.0%}") + print(f" Min runs before judging : {MIN_RUNS}") + print(f"{'=' * 70}") + + if not flaky: + print(" ✓ No flaky tests detected — all tests above consistency threshold.") + print(f"{'=' * 70}\n") + return + + print(f" ✗ {len(flaky)} FLAKY TEST(S) DETECTED:\n") + for node_id, rec, rate in flaky: + print(f" [{rate:.0%}] {node_id}") + print( + f" runs={rec['runs']} passes={rec['passes']} " + f"failures={rec['failures']} skips={rec['skips']} " + f"last={rec['last_outcome']}" + ) + print() + + print(" ACTION REQUIRED:") + print(" 1. Move each flaky test to tests/quarantine/") + print(" 2. File a tracking issue with [FLAKY] in the title") + print(" 3. Add @pytest.mark.quarantine(reason='#NNN') to the test") + print(" See docs/QUARANTINE_PROCESS.md for full instructions.") + print(f"{'=' * 70}\n") + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + +def parse_args(argv: list[str] | None = None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Detect flaky tests by analysing pytest JSON report history." + ) + parser.add_argument( + "--report", + type=Path, + default=DEFAULT_REPORT, + help=f"Path to pytest JSON report file (default: {DEFAULT_REPORT})", + ) + parser.add_argument( + "--history", + type=Path, + default=DEFAULT_HISTORY, + help=f"Path to rolling history JSON file (default: {DEFAULT_HISTORY})", + ) + parser.add_argument( + "--threshold", + type=float, + default=DEFAULT_THRESHOLD, + help=f"Consistency threshold 0–1 (default: {DEFAULT_THRESHOLD})", + ) + parser.add_argument( + "--no-update", + action="store_true", + default=False, + help="Print current statistics without ingesting a new report", + ) + return parser.parse_args(argv) + + +def main(argv: list[str] | None = None) -> int: + args = parse_args(argv) + history = load_history(args.history) + + if not args.no_update: + if not args.report.exists(): + print( + f"[flake-detector] No report file at {args.report} — " + "run pytest with --json-report first.", + file=sys.stderr, + ) + # Not a fatal error; just print current state. + else: + n = ingest_report(args.report, history) + print( + f"[flake-detector] Ingested {n} test results from {args.report}", + file=sys.stderr, + ) + save_history(history, args.history) + + flaky = find_flaky_tests(history, threshold=args.threshold) + print_report(flaky, history, threshold=args.threshold) + + return 1 if flaky else 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tests/conftest.py b/tests/conftest.py index 2875471..30f8f7e 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -1,15 +1,43 @@ """Pytest configuration for the test suite.""" +import re import pytest # Configure pytest-asyncio mode pytest_plugins = ["pytest_asyncio"] +# Pattern that constitutes a valid issue link in a skip reason. +# Accepts: #NNN, https?://..., or JIRA-NNN style keys. +_ISSUE_LINK_RE = re.compile( + r"(#\d+|https?://\S+|[A-Z]+-\d+)", + re.IGNORECASE, +) + + +def _has_issue_link(reason: str) -> bool: + """Return True if *reason* contains a recognisable issue reference.""" + return bool(_ISSUE_LINK_RE.search(reason or "")) + + +def _skip_reason(report) -> str: + """Extract the human-readable skip reason from a pytest report.""" + longrepr = getattr(report, "longrepr", None) + if longrepr is None: + return "" + if isinstance(longrepr, tuple) and len(longrepr) >= 3: + # (filename, lineno, "Skipped: ") + return str(longrepr[2]) + return str(longrepr) + def pytest_configure(config): """Configure pytest.""" config.addinivalue_line( "markers", "integration: mark test as integration test (requires MCP servers)" ) + config.addinivalue_line( + "markers", + "quarantine: mark test as quarantined (flaky/broken, tracked by issue)", + ) def pytest_addoption(parser): @@ -20,6 +48,12 @@ def pytest_addoption(parser): default=False, help="Run integration tests that require MCP servers", ) + parser.addoption( + "--no-skip-enforcement", + action="store_true", + default=False, + help="Disable poka-yoke enforcement of issue-linked skip reasons (CI escape hatch)", + ) def pytest_collection_modifyitems(config, items): @@ -31,3 +65,60 @@ def pytest_collection_modifyitems(config, items): for item in items: if "integration" in item.keywords: item.add_marker(skip_integration) + + +# --------------------------------------------------------------------------- +# POKA-YOKE: Treat skipped tests as failures unless they carry an issue link. +# --------------------------------------------------------------------------- + +@pytest.hookimpl(hookwrapper=True) +def pytest_runtest_makereport(item, call): + """Intercept skipped reports and fail them if they lack an issue link. + + Exceptions: + * Tests in tests/quarantine/ — explicitly quarantined, issue link required + on the quarantine marker, not the skip marker. + * Tests using environment-variable-based ``skipif`` conditions — these are + legitimate CI gates (RUN_INTEGRATION_TESTS, RUN_LIVE_TESTS, etc.) where + the *condition* is the gate, not a developer opt-out. We allow these + only when the skip reason mentions a recognised env-var pattern. + * --no-skip-enforcement flag set (emergency escape hatch). + """ + outcome = yield + report = outcome.get_result() + + if not report.skipped: + return + + # Escape hatch for emergency use. + if item.config.getoption("--no-skip-enforcement", default=False): + return + + reason = _skip_reason(report) + + # Allow quarantined tests — they are tracked by their quarantine marker. + if "quarantine" in item.keywords: + return + + # Allow env-var-gated skipif conditions. These come from the + # pytest_collection_modifyitems integration gate above, or from + # explicit @pytest.mark.skipif(..., reason="... requires ENV=1 ...") + _ENV_GATE_RE = re.compile(r"(require|needs|set)\s+\w+=[^\s]+", re.IGNORECASE) + if _ENV_GATE_RE.search(reason): + return + + # Allow skips added by the integration gate in this very conftest. + if "require --run-integration" in reason: + return + + # Anything else needs an issue link. + if not _has_issue_link(reason): + report.outcome = "failed" + report.longrepr = ( + "[POKA-YOKE] Skip without issue link is not allowed.\n" + f" Reason given: {reason!r}\n" + " Fix: add an issue reference to the skip reason, e.g.:\n" + " @pytest.mark.skip(reason='Broken until #NNN is resolved')\n" + " Or quarantine the test: move it to tests/quarantine/ and\n" + " file an issue — see docs/QUARANTINE_PROCESS.md" + ) diff --git a/tests/quarantine/README.md b/tests/quarantine/README.md new file mode 100644 index 0000000..cbe6647 --- /dev/null +++ b/tests/quarantine/README.md @@ -0,0 +1,28 @@ +# tests/quarantine/ + +This directory holds tests that have been **quarantined** because they are +flaky or temporarily broken. Quarantine keeps the main test suite green while +ensuring no test is silently deleted or forgotten. + +## Rules + +1. Every file here must correspond to an open issue. +2. Every test here must carry `@pytest.mark.quarantine(reason="#NNN")`. +3. Quarantined tests are **excluded from the default CI run** but are included + when you pass `--run-quarantine`. +4. The quarantine is a **temporary holding area**, not a graveyard. If a + quarantined test's issue has been open for more than 30 days with no + progress, escalate it. + +## Adding a test + +```python +# tests/quarantine/test_my_flaky_thing.py +import pytest + +@pytest.mark.quarantine(reason="Flaky: #1234") +def test_my_flaky_thing(): + ... +``` + +See `docs/QUARANTINE_PROCESS.md` for the complete workflow. diff --git a/tests/quarantine/__init__.py b/tests/quarantine/__init__.py new file mode 100644 index 0000000..6480062 --- /dev/null +++ b/tests/quarantine/__init__.py @@ -0,0 +1,2 @@ +# Quarantined tests live here. +# See docs/QUARANTINE_PROCESS.md for the full quarantine workflow. -- 2.43.0