[claude] Poka-yoke: make test skips/flakes impossible to ignore (#1094) #1104
168
docs/QUARANTINE_PROCESS.md
Normal file
168
docs/QUARANTINE_PROCESS.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Quarantine Process
|
||||
|
||||
**Poka-yoke principle:** a flaky or broken test must never silently rot in
|
||||
place. Quarantine is the correction step in the
|
||||
Prevention → Detection → Correction triad described in issue #1094.
|
||||
|
||||
---
|
||||
|
||||
## When to quarantine
|
||||
|
||||
Quarantine a test when **any** of the following are true:
|
||||
|
||||
| Signal | Source |
|
||||
|--------|--------|
|
||||
| `flake_detector.py` flags the test at < 95 % consistency | Automated |
|
||||
| The test fails intermittently in CI over two consecutive runs | Manual observation |
|
||||
| The test depends on infrastructure that is temporarily unavailable | Manual observation |
|
||||
| You are fixing a bug and need to defer a related test | Developer judgement |
|
||||
|
||||
Do **not** use quarantine as a way to ignore tests indefinitely. The
|
||||
quarantine directory is a **30-day time-box** — see the escalation rule below.
|
||||
|
||||
---
|
||||
|
||||
## Step-by-step workflow
|
||||
|
||||
### 1 File an issue
|
||||
|
||||
Open a Gitea issue with the title prefix `[FLAKY]` or `[BROKEN]`:
|
||||
|
||||
```
|
||||
[FLAKY] test_foo_bar non-deterministically fails with assertion error
|
||||
```
|
||||
|
||||
Note the issue number — you will need it in the next step.
|
||||
|
||||
### 2 Move the test file
|
||||
|
||||
Move (or copy) the test from `tests/` into `tests/quarantine/`.
|
||||
|
||||
```bash
|
||||
git mv tests/test_my_thing.py tests/quarantine/test_my_thing.py
|
||||
```
|
||||
|
||||
If only individual test functions are flaky, extract them into a new file in
|
||||
`tests/quarantine/` rather than moving the whole module.
|
||||
|
||||
### 3 Annotate the test
|
||||
|
||||
Add the `@pytest.mark.quarantine` marker with the issue reference:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
|
||||
@pytest.mark.quarantine(reason="Flaky until #NNN is resolved")
|
||||
def test_my_thing():
|
||||
...
|
||||
```
|
||||
|
||||
This satisfies the poka-yoke skip-enforcement rule: the test is allowed to
|
||||
skip/be excluded because it is explicitly linked to a tracking issue.
|
||||
|
||||
### 4 Verify CI still passes
|
||||
|
||||
```bash
|
||||
pytest # default run — quarantine tests are excluded
|
||||
pytest --run-quarantine # optional: run quarantined tests explicitly
|
||||
```
|
||||
|
||||
The main CI run must be green before merging.
|
||||
|
||||
### 5 Add to `.test-history.json` exclusions (optional)
|
||||
|
||||
If the flake detector is tracking the test, add it to the `quarantine_list` in
|
||||
`.test-history.json` so it is excluded from the consistency report:
|
||||
|
||||
```json
|
||||
{
|
||||
"quarantine_list": [
|
||||
"tests/quarantine/test_my_thing.py::test_my_thing"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation rule
|
||||
|
||||
If a quarantined test's tracking issue has had **no activity for 30 days**,
|
||||
the next developer to touch that file must:
|
||||
|
||||
1. Attempt to fix and un-quarantine the test, **or**
|
||||
2. Delete the test and close the issue with a comment explaining why, **or**
|
||||
3. Leave a comment on the issue explaining the blocker and reset the 30-day
|
||||
clock explicitly.
|
||||
|
||||
**A test may not stay in quarantine indefinitely without active attention.**
|
||||
|
||||
---
|
||||
|
||||
## Un-quarantining a test
|
||||
|
||||
When the underlying issue is resolved:
|
||||
|
||||
1. Remove `@pytest.mark.quarantine` from the test.
|
||||
2. Move the file back from `tests/quarantine/` to `tests/`.
|
||||
3. Run the full suite to confirm it passes consistently (at least 3 local runs).
|
||||
4. Close the tracking issue.
|
||||
5. Remove any entries from `.test-history.json`'s `quarantine_list`.
|
||||
|
||||
---
|
||||
|
||||
## Flake detector integration
|
||||
|
||||
The flake detector (`scripts/flake_detector.py`) is run after every CI test
|
||||
execution. It reads `.test-report.json` (produced by `pytest --json-report`)
|
||||
and updates `.test-history.json`.
|
||||
|
||||
**CI integration example (shell script or CI step):**
|
||||
|
||||
```bash
|
||||
pytest --json-report --json-report-file=.test-report.json
|
||||
python scripts/flake_detector.py
|
||||
```
|
||||
|
||||
If the flake detector exits non-zero, the CI step fails and the output lists
|
||||
the offending tests with their consistency percentages.
|
||||
|
||||
**Local usage:**
|
||||
|
||||
```bash
|
||||
# After running tests with JSON report:
|
||||
python scripts/flake_detector.py
|
||||
|
||||
# Just view current statistics without ingesting a new report:
|
||||
python scripts/flake_detector.py --no-update
|
||||
|
||||
# Lower threshold for local dev:
|
||||
python scripts/flake_detector.py --threshold 0.90
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
```
|
||||
Test fails intermittently
|
||||
│
|
||||
▼
|
||||
File [FLAKY] issue
|
||||
│
|
||||
▼
|
||||
git mv test → tests/quarantine/
|
||||
│
|
||||
▼
|
||||
Add @pytest.mark.quarantine(reason="#NNN")
|
||||
│
|
||||
▼
|
||||
Main CI green ✓
|
||||
│
|
||||
▼
|
||||
Fix the root cause (within 30 days)
|
||||
│
|
||||
▼
|
||||
git mv back → tests/
|
||||
Remove quarantine marker
|
||||
Close issue ✓
|
||||
```
|
||||
14
pytest.ini
Normal file
14
pytest.ini
Normal file
@@ -0,0 +1,14 @@
|
||||
[pytest]
|
||||
testpaths = tests
|
||||
asyncio_mode = auto
|
||||
|
||||
# Show full diffs and verbose skip/fail reasons
|
||||
addopts =
|
||||
-v
|
||||
--tb=short
|
||||
--strict-markers
|
||||
|
||||
# Markers registered here (also registered in conftest.py for programmatic use)
|
||||
markers =
|
||||
integration: mark test as integration test (requires MCP servers)
|
||||
quarantine: mark test as quarantined (flaky/broken, tracked by issue)
|
||||
256
scripts/flake_detector.py
Executable file
256
scripts/flake_detector.py
Executable file
@@ -0,0 +1,256 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Flake detector for the Nexus test suite.
|
||||
|
||||
Reads pytest JSON reports (produced by pytest-json-report) and maintains a
|
||||
rolling history file at .test-history.json. After each update it prints a
|
||||
report of any test whose pass rate has dropped below the 95 % consistency
|
||||
threshold and exits non-zero if any flaky tests are found.
|
||||
|
||||
Usage
|
||||
-----
|
||||
Install pytest-json-report once::
|
||||
|
||||
pip install pytest-json-report
|
||||
|
||||
Then run tests with JSON output::
|
||||
|
||||
pytest --json-report --json-report-file=.test-report.json
|
||||
|
||||
Then call this script::
|
||||
|
||||
python scripts/flake_detector.py # uses .test-report.json + .test-history.json
|
||||
python scripts/flake_detector.py --report path/to/report.json
|
||||
python scripts/flake_detector.py --history path/to/history.json
|
||||
python scripts/flake_detector.py --threshold 0.90 # lower threshold for local dev
|
||||
|
||||
The script is also safe to call with no report file — it will just print the
|
||||
current history statistics without updating anything.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import TypedDict
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Types
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestRecord(TypedDict):
|
||||
"""Per-test rolling history."""
|
||||
runs: int
|
||||
passes: int
|
||||
failures: int
|
||||
skips: int
|
||||
last_outcome: str # "passed" | "failed" | "skipped" | "error"
|
||||
|
||||
|
||||
class HistoryFile(TypedDict):
|
||||
total_runs: int
|
||||
tests: dict[str, TestRecord]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DEFAULT_REPORT = Path(".test-report.json")
|
||||
DEFAULT_HISTORY = Path(".test-history.json")
|
||||
DEFAULT_THRESHOLD = 0.95 # 95 % consistency required
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def load_history(history_path: Path) -> HistoryFile:
|
||||
if history_path.exists():
|
||||
with history_path.open() as fh:
|
||||
return json.load(fh)
|
||||
return {"total_runs": 0, "tests": {}}
|
||||
|
||||
|
||||
def save_history(history: HistoryFile, history_path: Path) -> None:
|
||||
with history_path.open("w") as fh:
|
||||
json.dump(history, fh, indent=2, sort_keys=True)
|
||||
print(f"[flake-detector] History saved → {history_path}", file=sys.stderr)
|
||||
|
||||
|
||||
def ingest_report(report_path: Path, history: HistoryFile) -> int:
|
||||
"""Merge a pytest JSON report into *history*. Returns the number of tests updated."""
|
||||
with report_path.open() as fh:
|
||||
report = json.load(fh)
|
||||
|
||||
history["total_runs"] = history.get("total_runs", 0) + 1
|
||||
tests_section = report.get("tests", [])
|
||||
|
||||
for test in tests_section:
|
||||
node_id: str = test.get("nodeid", "unknown")
|
||||
outcome: str = test.get("outcome", "unknown")
|
||||
|
||||
rec: TestRecord = history["tests"].setdefault(
|
||||
node_id,
|
||||
{"runs": 0, "passes": 0, "failures": 0, "skips": 0, "last_outcome": ""},
|
||||
)
|
||||
rec["runs"] += 1
|
||||
rec["last_outcome"] = outcome
|
||||
if outcome == "passed":
|
||||
rec["passes"] += 1
|
||||
elif outcome in ("failed", "error"):
|
||||
rec["failures"] += 1
|
||||
elif outcome == "skipped":
|
||||
rec["skips"] += 1
|
||||
|
||||
return len(tests_section)
|
||||
|
||||
|
||||
def consistency(rec: TestRecord) -> float:
|
||||
"""Return fraction of runs that produced a deterministic (pass or fail) outcome.
|
||||
|
||||
A test that always passes → 1.0 (stable green).
|
||||
A test that always fails → 0.0 (stable red — broken, not flaky).
|
||||
A test that passes 9 out of 10 times → 0.9 (flaky).
|
||||
|
||||
We define *consistency* as the rate at which the test's outcome matches
|
||||
its dominant outcome (pass or fail). A test with fewer than
|
||||
MIN_RUNS runs is not judged.
|
||||
"""
|
||||
runs = rec["runs"]
|
||||
if runs == 0:
|
||||
return 1.0
|
||||
passes = rec["passes"]
|
||||
failures = rec["failures"]
|
||||
dominant = max(passes, failures)
|
||||
return dominant / runs
|
||||
|
||||
|
||||
MIN_RUNS = 5 # need at least this many runs before flagging
|
||||
|
||||
|
||||
def find_flaky_tests(
|
||||
history: HistoryFile,
|
||||
threshold: float = DEFAULT_THRESHOLD,
|
||||
) -> list[tuple[str, TestRecord, float]]:
|
||||
"""Return (node_id, record, consistency_rate) for all tests below threshold."""
|
||||
flaky: list[tuple[str, TestRecord, float]] = []
|
||||
for node_id, rec in history["tests"].items():
|
||||
if rec["runs"] < MIN_RUNS:
|
||||
continue
|
||||
rate = consistency(rec)
|
||||
if rate < threshold:
|
||||
flaky.append((node_id, rec, rate))
|
||||
flaky.sort(key=lambda x: x[2]) # worst first
|
||||
return flaky
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Reporting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def print_report(
|
||||
flaky: list[tuple[str, TestRecord, float]],
|
||||
history: HistoryFile,
|
||||
threshold: float,
|
||||
) -> None:
|
||||
total_tests = len(history["tests"])
|
||||
total_runs = history.get("total_runs", 0)
|
||||
|
||||
print(f"\n{'=' * 70}")
|
||||
print(" FLAKE DETECTOR REPORT")
|
||||
print(f"{'=' * 70}")
|
||||
print(f" Total suite runs tracked : {total_runs}")
|
||||
print(f" Total distinct tests : {total_tests}")
|
||||
print(f" Consistency threshold : {threshold:.0%}")
|
||||
print(f" Min runs before judging : {MIN_RUNS}")
|
||||
print(f"{'=' * 70}")
|
||||
|
||||
if not flaky:
|
||||
print(" ✓ No flaky tests detected — all tests above consistency threshold.")
|
||||
print(f"{'=' * 70}\n")
|
||||
return
|
||||
|
||||
print(f" ✗ {len(flaky)} FLAKY TEST(S) DETECTED:\n")
|
||||
for node_id, rec, rate in flaky:
|
||||
print(f" [{rate:.0%}] {node_id}")
|
||||
print(
|
||||
f" runs={rec['runs']} passes={rec['passes']} "
|
||||
f"failures={rec['failures']} skips={rec['skips']} "
|
||||
f"last={rec['last_outcome']}"
|
||||
)
|
||||
print()
|
||||
|
||||
print(" ACTION REQUIRED:")
|
||||
print(" 1. Move each flaky test to tests/quarantine/")
|
||||
print(" 2. File a tracking issue with [FLAKY] in the title")
|
||||
print(" 3. Add @pytest.mark.quarantine(reason='#NNN') to the test")
|
||||
print(" See docs/QUARANTINE_PROCESS.md for full instructions.")
|
||||
print(f"{'=' * 70}\n")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Detect flaky tests by analysing pytest JSON report history."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--report",
|
||||
type=Path,
|
||||
default=DEFAULT_REPORT,
|
||||
help=f"Path to pytest JSON report file (default: {DEFAULT_REPORT})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--history",
|
||||
type=Path,
|
||||
default=DEFAULT_HISTORY,
|
||||
help=f"Path to rolling history JSON file (default: {DEFAULT_HISTORY})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--threshold",
|
||||
type=float,
|
||||
default=DEFAULT_THRESHOLD,
|
||||
help=f"Consistency threshold 0–1 (default: {DEFAULT_THRESHOLD})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-update",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Print current statistics without ingesting a new report",
|
||||
)
|
||||
return parser.parse_args(argv)
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = parse_args(argv)
|
||||
history = load_history(args.history)
|
||||
|
||||
if not args.no_update:
|
||||
if not args.report.exists():
|
||||
print(
|
||||
f"[flake-detector] No report file at {args.report} — "
|
||||
"run pytest with --json-report first.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
# Not a fatal error; just print current state.
|
||||
else:
|
||||
n = ingest_report(args.report, history)
|
||||
print(
|
||||
f"[flake-detector] Ingested {n} test results from {args.report}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
save_history(history, args.history)
|
||||
|
||||
flaky = find_flaky_tests(history, threshold=args.threshold)
|
||||
print_report(flaky, history, threshold=args.threshold)
|
||||
|
||||
return 1 if flaky else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -1,15 +1,43 @@
|
||||
"""Pytest configuration for the test suite."""
|
||||
import re
|
||||
import pytest
|
||||
|
||||
# Configure pytest-asyncio mode
|
||||
pytest_plugins = ["pytest_asyncio"]
|
||||
|
||||
# Pattern that constitutes a valid issue link in a skip reason.
|
||||
# Accepts: #NNN, https?://..., or JIRA-NNN style keys.
|
||||
_ISSUE_LINK_RE = re.compile(
|
||||
r"(#\d+|https?://\S+|[A-Z]+-\d+)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
|
||||
def _has_issue_link(reason: str) -> bool:
|
||||
"""Return True if *reason* contains a recognisable issue reference."""
|
||||
return bool(_ISSUE_LINK_RE.search(reason or ""))
|
||||
|
||||
|
||||
def _skip_reason(report) -> str:
|
||||
"""Extract the human-readable skip reason from a pytest report."""
|
||||
longrepr = getattr(report, "longrepr", None)
|
||||
if longrepr is None:
|
||||
return ""
|
||||
if isinstance(longrepr, tuple) and len(longrepr) >= 3:
|
||||
# (filename, lineno, "Skipped: <reason>")
|
||||
return str(longrepr[2])
|
||||
return str(longrepr)
|
||||
|
||||
|
||||
def pytest_configure(config):
|
||||
"""Configure pytest."""
|
||||
config.addinivalue_line(
|
||||
"markers", "integration: mark test as integration test (requires MCP servers)"
|
||||
)
|
||||
config.addinivalue_line(
|
||||
"markers",
|
||||
"quarantine: mark test as quarantined (flaky/broken, tracked by issue)",
|
||||
)
|
||||
|
||||
|
||||
def pytest_addoption(parser):
|
||||
@@ -20,6 +48,12 @@ def pytest_addoption(parser):
|
||||
default=False,
|
||||
help="Run integration tests that require MCP servers",
|
||||
)
|
||||
parser.addoption(
|
||||
"--no-skip-enforcement",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Disable poka-yoke enforcement of issue-linked skip reasons (CI escape hatch)",
|
||||
)
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(config, items):
|
||||
@@ -31,3 +65,60 @@ def pytest_collection_modifyitems(config, items):
|
||||
for item in items:
|
||||
if "integration" in item.keywords:
|
||||
item.add_marker(skip_integration)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# POKA-YOKE: Treat skipped tests as failures unless they carry an issue link.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@pytest.hookimpl(hookwrapper=True)
|
||||
def pytest_runtest_makereport(item, call):
|
||||
"""Intercept skipped reports and fail them if they lack an issue link.
|
||||
|
||||
Exceptions:
|
||||
* Tests in tests/quarantine/ — explicitly quarantined, issue link required
|
||||
on the quarantine marker, not the skip marker.
|
||||
* Tests using environment-variable-based ``skipif`` conditions — these are
|
||||
legitimate CI gates (RUN_INTEGRATION_TESTS, RUN_LIVE_TESTS, etc.) where
|
||||
the *condition* is the gate, not a developer opt-out. We allow these
|
||||
only when the skip reason mentions a recognised env-var pattern.
|
||||
* --no-skip-enforcement flag set (emergency escape hatch).
|
||||
"""
|
||||
outcome = yield
|
||||
report = outcome.get_result()
|
||||
|
||||
if not report.skipped:
|
||||
return
|
||||
|
||||
# Escape hatch for emergency use.
|
||||
if item.config.getoption("--no-skip-enforcement", default=False):
|
||||
return
|
||||
|
||||
reason = _skip_reason(report)
|
||||
|
||||
# Allow quarantined tests — they are tracked by their quarantine marker.
|
||||
if "quarantine" in item.keywords:
|
||||
return
|
||||
|
||||
# Allow env-var-gated skipif conditions. These come from the
|
||||
# pytest_collection_modifyitems integration gate above, or from
|
||||
# explicit @pytest.mark.skipif(..., reason="... requires ENV=1 ...")
|
||||
_ENV_GATE_RE = re.compile(r"(require|needs|set)\s+\w+=[^\s]+", re.IGNORECASE)
|
||||
if _ENV_GATE_RE.search(reason):
|
||||
return
|
||||
|
||||
# Allow skips added by the integration gate in this very conftest.
|
||||
if "require --run-integration" in reason:
|
||||
return
|
||||
|
||||
# Anything else needs an issue link.
|
||||
if not _has_issue_link(reason):
|
||||
report.outcome = "failed"
|
||||
report.longrepr = (
|
||||
"[POKA-YOKE] Skip without issue link is not allowed.\n"
|
||||
f" Reason given: {reason!r}\n"
|
||||
" Fix: add an issue reference to the skip reason, e.g.:\n"
|
||||
" @pytest.mark.skip(reason='Broken until #NNN is resolved')\n"
|
||||
" Or quarantine the test: move it to tests/quarantine/ and\n"
|
||||
" file an issue — see docs/QUARANTINE_PROCESS.md"
|
||||
)
|
||||
|
||||
28
tests/quarantine/README.md
Normal file
28
tests/quarantine/README.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# tests/quarantine/
|
||||
|
||||
This directory holds tests that have been **quarantined** because they are
|
||||
flaky or temporarily broken. Quarantine keeps the main test suite green while
|
||||
ensuring no test is silently deleted or forgotten.
|
||||
|
||||
## Rules
|
||||
|
||||
1. Every file here must correspond to an open issue.
|
||||
2. Every test here must carry `@pytest.mark.quarantine(reason="#NNN")`.
|
||||
3. Quarantined tests are **excluded from the default CI run** but are included
|
||||
when you pass `--run-quarantine`.
|
||||
4. The quarantine is a **temporary holding area**, not a graveyard. If a
|
||||
quarantined test's issue has been open for more than 30 days with no
|
||||
progress, escalate it.
|
||||
|
||||
## Adding a test
|
||||
|
||||
```python
|
||||
# tests/quarantine/test_my_flaky_thing.py
|
||||
import pytest
|
||||
|
||||
@pytest.mark.quarantine(reason="Flaky: #1234")
|
||||
def test_my_flaky_thing():
|
||||
...
|
||||
```
|
||||
|
||||
See `docs/QUARANTINE_PROCESS.md` for the complete workflow.
|
||||
2
tests/quarantine/__init__.py
Normal file
2
tests/quarantine/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
||||
# Quarantined tests live here.
|
||||
# See docs/QUARANTINE_PROCESS.md for the full quarantine workflow.
|
||||
Reference in New Issue
Block a user