2026-04-07 14:38:50 +00:00
6 changed files with 559 additions and 0 deletions
--- a/docs/QUARANTINE_PROCESS.md
+++ b/docs/QUARANTINE_PROCESS.md
@@ -0,0 +1,168 @@
+# Quarantine Process
+
+**Poka-yoke principle:** a flaky or broken test must never silently rot in
+place.  Quarantine is the correction step in the
+Prevention → Detection → Correction triad described in issue #1094.
+
+---
+
+## When to quarantine
+
+Quarantine a test when **any** of the following are true:
+
+| Signal | Source |
+|--------|--------|
+| `flake_detector.py` flags the test at < 95 % consistency | Automated |
+| The test fails intermittently in CI over two consecutive runs | Manual observation |
+| The test depends on infrastructure that is temporarily unavailable | Manual observation |
+| You are fixing a bug and need to defer a related test | Developer judgement |
+
+Do **not** use quarantine as a way to ignore tests indefinitely.  The
+quarantine directory is a **30-day time-box** — see the escalation rule below.
+
+---
+
+## Step-by-step workflow
+
+### 1  File an issue
+
+Open a Gitea issue with the title prefix `[FLAKY]` or `[BROKEN]`:
+
+```
+[FLAKY] test_foo_bar non-deterministically fails with assertion error
+```
+
+Note the issue number — you will need it in the next step.
+
+### 2  Move the test file
+
+Move (or copy) the test from `tests/` into `tests/quarantine/`.
+
+```bash
+git mv tests/test_my_thing.py tests/quarantine/test_my_thing.py
+```
+
+If only individual test functions are flaky, extract them into a new file in
+`tests/quarantine/` rather than moving the whole module.
+
+### 3  Annotate the test
+
+Add the `@pytest.mark.quarantine` marker with the issue reference:
+
+```python
+import pytest
+
+@pytest.mark.quarantine(reason="Flaky until #NNN is resolved")
+def test_my_thing():
+    ...
+```
+
+This satisfies the poka-yoke skip-enforcement rule: the test is allowed to
+skip/be excluded because it is explicitly linked to a tracking issue.
+
+### 4  Verify CI still passes
+
+```bash
+pytest          # default run — quarantine tests are excluded
+pytest --run-quarantine   # optional: run quarantined tests explicitly
+```
+
+The main CI run must be green before merging.
+
+### 5  Add to `.test-history.json` exclusions (optional)
+
+If the flake detector is tracking the test, add it to the `quarantine_list` in
+`.test-history.json` so it is excluded from the consistency report:
+
+```json
+{
+  "quarantine_list": [
+    "tests/quarantine/test_my_thing.py::test_my_thing"
+  ]
+}
+```
+
+---
+
+## Escalation rule
+
+If a quarantined test's tracking issue has had **no activity for 30 days**,
+the next developer to touch that file must:
+
+1. Attempt to fix and un-quarantine the test, **or**
+2. Delete the test and close the issue with a comment explaining why, **or**
+3. Leave a comment on the issue explaining the blocker and reset the 30-day
+   clock explicitly.
+
+**A test may not stay in quarantine indefinitely without active attention.**
+
+---
+
+## Un-quarantining a test
+
+When the underlying issue is resolved:
+
+1. Remove `@pytest.mark.quarantine` from the test.
+2. Move the file back from `tests/quarantine/` to `tests/`.
+3. Run the full suite to confirm it passes consistently (at least 3 local runs).
+4. Close the tracking issue.
+5. Remove any entries from `.test-history.json`'s `quarantine_list`.
+
+---
+
+## Flake detector integration
+
+The flake detector (`scripts/flake_detector.py`) is run after every CI test
+execution.  It reads `.test-report.json` (produced by `pytest --json-report`)
+and updates `.test-history.json`.
+
+**CI integration example (shell script or CI step):**
+
+```bash
+pytest --json-report --json-report-file=.test-report.json
+python scripts/flake_detector.py
+```
+
+If the flake detector exits non-zero, the CI step fails and the output lists
+the offending tests with their consistency percentages.
+
+**Local usage:**
+
+```bash
+# After running tests with JSON report:
+python scripts/flake_detector.py
+
+# Just view current statistics without ingesting a new report:
+python scripts/flake_detector.py --no-update
+
+# Lower threshold for local dev:
+python scripts/flake_detector.py --threshold 0.90
+```
+
+---
+
+## Summary
+
+```
+Test fails intermittently
+        │
+        ▼
+  File [FLAKY] issue
+        │
+        ▼
+  git mv test → tests/quarantine/
+        │
+        ▼
+  Add @pytest.mark.quarantine(reason="#NNN")
+        │
+        ▼
+  Main CI green ✓
+        │
+        ▼
+  Fix the root cause (within 30 days)
+        │
+        ▼
+  git mv back → tests/
+  Remove quarantine marker
+  Close issue ✓
+```
--- a/pytest.ini
+++ b/pytest.ini
@@ -0,0 +1,14 @@
+[pytest]
+testpaths = tests
+asyncio_mode = auto
+
+# Show full diffs and verbose skip/fail reasons
+addopts =
+    -v
+    --tb=short
+    --strict-markers
+
+# Markers registered here (also registered in conftest.py for programmatic use)
+markers =
+    integration: mark test as integration test (requires MCP servers)
+    quarantine: mark test as quarantined (flaky/broken, tracked by issue)
--- a/scripts/flake_detector.py
+++ b/scripts/flake_detector.py
@@ -0,0 +1,256 @@
+#!/usr/bin/env python3
+"""Flake detector for the Nexus test suite.
+
+Reads pytest JSON reports (produced by pytest-json-report) and maintains a
+rolling history file at .test-history.json.  After each update it prints a
+report of any test whose pass rate has dropped below the 95 % consistency
+threshold and exits non-zero if any flaky tests are found.
+
+Usage
+-----
+Install pytest-json-report once::
+
+    pip install pytest-json-report
+
+Then run tests with JSON output::
+
+    pytest --json-report --json-report-file=.test-report.json
+
+Then call this script::
+
+    python scripts/flake_detector.py            # uses .test-report.json + .test-history.json
+    python scripts/flake_detector.py --report path/to/report.json
+    python scripts/flake_detector.py --history path/to/history.json
+    python scripts/flake_detector.py --threshold 0.90   # lower threshold for local dev
+
+The script is also safe to call with no report file — it will just print the
+current history statistics without updating anything.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import TypedDict
+
+
+# ---------------------------------------------------------------------------
+# Types
+# ---------------------------------------------------------------------------
+
+class TestRecord(TypedDict):
+    """Per-test rolling history."""
+    runs: int
+    passes: int
+    failures: int
+    skips: int
+    last_outcome: str  # "passed" | "failed" | "skipped" | "error"
+
+
+class HistoryFile(TypedDict):
+    total_runs: int
+    tests: dict[str, TestRecord]
+
+
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+
+DEFAULT_REPORT = Path(".test-report.json")
+DEFAULT_HISTORY = Path(".test-history.json")
+DEFAULT_THRESHOLD = 0.95  # 95 % consistency required
+
+
+# ---------------------------------------------------------------------------
+# Core helpers
+# ---------------------------------------------------------------------------
+
+def load_history(history_path: Path) -> HistoryFile:
+    if history_path.exists():
+        with history_path.open() as fh:
+            return json.load(fh)
+    return {"total_runs": 0, "tests": {}}
+
+
+def save_history(history: HistoryFile, history_path: Path) -> None:
+    with history_path.open("w") as fh:
+        json.dump(history, fh, indent=2, sort_keys=True)
+    print(f"[flake-detector] History saved → {history_path}", file=sys.stderr)
+
+
+def ingest_report(report_path: Path, history: HistoryFile) -> int:
+    """Merge a pytest JSON report into *history*.  Returns the number of tests updated."""
+    with report_path.open() as fh:
+        report = json.load(fh)
+
+    history["total_runs"] = history.get("total_runs", 0) + 1
+    tests_section = report.get("tests", [])
+
+    for test in tests_section:
+        node_id: str = test.get("nodeid", "unknown")
+        outcome: str = test.get("outcome", "unknown")
+
+        rec: TestRecord = history["tests"].setdefault(
+            node_id,
+            {"runs": 0, "passes": 0, "failures": 0, "skips": 0, "last_outcome": ""},
+        )
+        rec["runs"] += 1
+        rec["last_outcome"] = outcome
+        if outcome == "passed":
+            rec["passes"] += 1
+        elif outcome in ("failed", "error"):
+            rec["failures"] += 1
+        elif outcome == "skipped":
+            rec["skips"] += 1
+
+    return len(tests_section)
+
+
+def consistency(rec: TestRecord) -> float:
+    """Return fraction of runs that produced a deterministic (pass or fail) outcome.
+
+    A test that always passes → 1.0 (stable green).
+    A test that always fails  → 0.0 (stable red — broken, not flaky).
+    A test that passes 9 out of 10 times → 0.9 (flaky).
+
+    We define *consistency* as the rate at which the test's outcome matches
+    its dominant outcome (pass or fail).  A test with fewer than
+    MIN_RUNS runs is not judged.
+    """
+    runs = rec["runs"]
+    if runs == 0:
+        return 1.0
+    passes = rec["passes"]
+    failures = rec["failures"]
+    dominant = max(passes, failures)
+    return dominant / runs
+
+
+MIN_RUNS = 5  # need at least this many runs before flagging
+
+
+def find_flaky_tests(
+    history: HistoryFile,
+    threshold: float = DEFAULT_THRESHOLD,
+) -> list[tuple[str, TestRecord, float]]:
+    """Return (node_id, record, consistency_rate) for all tests below threshold."""
+    flaky: list[tuple[str, TestRecord, float]] = []
+    for node_id, rec in history["tests"].items():
+        if rec["runs"] < MIN_RUNS:
+            continue
+        rate = consistency(rec)
+        if rate < threshold:
+            flaky.append((node_id, rec, rate))
+    flaky.sort(key=lambda x: x[2])  # worst first
+    return flaky
+
+
+# ---------------------------------------------------------------------------
+# Reporting
+# ---------------------------------------------------------------------------
+
+def print_report(
+    flaky: list[tuple[str, TestRecord, float]],
+    history: HistoryFile,
+    threshold: float,
+) -> None:
+    total_tests = len(history["tests"])
+    total_runs = history.get("total_runs", 0)
+
+    print(f"\n{'=' * 70}")
+    print("  FLAKE DETECTOR REPORT")
+    print(f"{'=' * 70}")
+    print(f"  Total suite runs tracked : {total_runs}")
+    print(f"  Total distinct tests     : {total_tests}")
+    print(f"  Consistency threshold    : {threshold:.0%}")
+    print(f"  Min runs before judging  : {MIN_RUNS}")
+    print(f"{'=' * 70}")
+
+    if not flaky:
+        print("  ✓ No flaky tests detected — all tests above consistency threshold.")
+        print(f"{'=' * 70}\n")
+        return
+
+    print(f"  ✗ {len(flaky)} FLAKY TEST(S) DETECTED:\n")
+    for node_id, rec, rate in flaky:
+        print(f"  [{rate:.0%}] {node_id}")
+        print(
+            f"       runs={rec['runs']}  passes={rec['passes']}  "
+            f"failures={rec['failures']}  skips={rec['skips']}  "
+            f"last={rec['last_outcome']}"
+        )
+        print()
+
+    print("  ACTION REQUIRED:")
+    print("  1. Move each flaky test to tests/quarantine/")
+    print("  2. File a tracking issue with [FLAKY] in the title")
+    print("  3. Add @pytest.mark.quarantine(reason='#NNN') to the test")
+    print("  See docs/QUARANTINE_PROCESS.md for full instructions.")
+    print(f"{'=' * 70}\n")
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Detect flaky tests by analysing pytest JSON report history."
+    )
+    parser.add_argument(
+        "--report",
+        type=Path,
+        default=DEFAULT_REPORT,
+        help=f"Path to pytest JSON report file (default: {DEFAULT_REPORT})",
+    )
+    parser.add_argument(
+        "--history",
+        type=Path,
+        default=DEFAULT_HISTORY,
+        help=f"Path to rolling history JSON file (default: {DEFAULT_HISTORY})",
+    )
+    parser.add_argument(
+        "--threshold",
+        type=float,
+        default=DEFAULT_THRESHOLD,
+        help=f"Consistency threshold 0–1 (default: {DEFAULT_THRESHOLD})",
+    )
+    parser.add_argument(
+        "--no-update",
+        action="store_true",
+        default=False,
+        help="Print current statistics without ingesting a new report",
+    )
+    return parser.parse_args(argv)
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = parse_args(argv)
+    history = load_history(args.history)
+
+    if not args.no_update:
+        if not args.report.exists():
+            print(
+                f"[flake-detector] No report file at {args.report} — "
+                "run pytest with --json-report first.",
+                file=sys.stderr,
+            )
+            # Not a fatal error; just print current state.
+        else:
+            n = ingest_report(args.report, history)
+            print(
+                f"[flake-detector] Ingested {n} test results from {args.report}",
+                file=sys.stderr,
+            )
+            save_history(history, args.history)
+
+    flaky = find_flaky_tests(history, threshold=args.threshold)
+    print_report(flaky, history, threshold=args.threshold)
+
+    return 1 if flaky else 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1,15 +1,43 @@
 """Pytest configuration for the test suite."""
+import re
 import pytest

 # Configure pytest-asyncio mode
 pytest_plugins = ["pytest_asyncio"]

+# Pattern that constitutes a valid issue link in a skip reason.
+# Accepts: #NNN, https?://..., or JIRA-NNN style keys.
+_ISSUE_LINK_RE = re.compile(
+    r"(#\d+|https?://\S+|[A-Z]+-\d+)",
+    re.IGNORECASE,
+)
+
+
+def _has_issue_link(reason: str) -> bool:
+    """Return True if *reason* contains a recognisable issue reference."""
+    return bool(_ISSUE_LINK_RE.search(reason or ""))
+
+
+def _skip_reason(report) -> str:
+    """Extract the human-readable skip reason from a pytest report."""
+    longrepr = getattr(report, "longrepr", None)
+    if longrepr is None:
+        return ""
+    if isinstance(longrepr, tuple) and len(longrepr) >= 3:
+        # (filename, lineno, "Skipped: <reason>")
+        return str(longrepr[2])
+    return str(longrepr)
+

 def pytest_configure(config):
    """Configure pytest."""
    config.addinivalue_line(
        "markers", "integration: mark test as integration test (requires MCP servers)"
    )
+    config.addinivalue_line(
+        "markers",
+        "quarantine: mark test as quarantined (flaky/broken, tracked by issue)",
+    )


 def pytest_addoption(parser):
@@ -20,6 +48,12 @@ def pytest_addoption(parser):
        default=False,
        help="Run integration tests that require MCP servers",
    )
+    parser.addoption(
+        "--no-skip-enforcement",
+        action="store_true",
+        default=False,
+        help="Disable poka-yoke enforcement of issue-linked skip reasons (CI escape hatch)",
+    )


 def pytest_collection_modifyitems(config, items):
@@ -31,3 +65,60 @@ def pytest_collection_modifyitems(config, items):
        for item in items:
            if "integration" in item.keywords:
                item.add_marker(skip_integration)
+
+
+# ---------------------------------------------------------------------------
+# POKA-YOKE: Treat skipped tests as failures unless they carry an issue link.
+# ---------------------------------------------------------------------------
+
+@pytest.hookimpl(hookwrapper=True)
+def pytest_runtest_makereport(item, call):
+    """Intercept skipped reports and fail them if they lack an issue link.
+
+    Exceptions:
+    * Tests in tests/quarantine/ — explicitly quarantined, issue link required
+      on the quarantine marker, not the skip marker.
+    * Tests using environment-variable-based ``skipif`` conditions — these are
+      legitimate CI gates (RUN_INTEGRATION_TESTS, RUN_LIVE_TESTS, etc.) where
+      the *condition* is the gate, not a developer opt-out.  We allow these
+      only when the skip reason mentions a recognised env-var pattern.
+    * --no-skip-enforcement flag set (emergency escape hatch).
+    """
+    outcome = yield
+    report = outcome.get_result()
+
+    if not report.skipped:
+        return
+
+    # Escape hatch for emergency use.
+    if item.config.getoption("--no-skip-enforcement", default=False):
+        return
+
+    reason = _skip_reason(report)
+
+    # Allow quarantined tests — they are tracked by their quarantine marker.
+    if "quarantine" in item.keywords:
+        return
+
+    # Allow env-var-gated skipif conditions.  These come from the
+    # pytest_collection_modifyitems integration gate above, or from
+    # explicit @pytest.mark.skipif(..., reason="... requires ENV=1 ...")
+    _ENV_GATE_RE = re.compile(r"(require|needs|set)\s+\w+=[^\s]+", re.IGNORECASE)
+    if _ENV_GATE_RE.search(reason):
+        return
+
+    # Allow skips added by the integration gate in this very conftest.
+    if "require --run-integration" in reason:
+        return
+
+    # Anything else needs an issue link.
+    if not _has_issue_link(reason):
+        report.outcome = "failed"
+        report.longrepr = (
+            "[POKA-YOKE] Skip without issue link is not allowed.\n"
+            f"  Reason given: {reason!r}\n"
+            "  Fix: add an issue reference to the skip reason, e.g.:\n"
+            "    @pytest.mark.skip(reason='Broken until #NNN is resolved')\n"
+            "  Or quarantine the test: move it to tests/quarantine/ and\n"
+            "  file an issue — see docs/QUARANTINE_PROCESS.md"
+        )
--- a/tests/quarantine/README.md
+++ b/tests/quarantine/README.md
@@ -0,0 +1,28 @@
+# tests/quarantine/
+
+This directory holds tests that have been **quarantined** because they are
+flaky or temporarily broken.  Quarantine keeps the main test suite green while
+ensuring no test is silently deleted or forgotten.
+
+## Rules
+
+1. Every file here must correspond to an open issue.
+2. Every test here must carry `@pytest.mark.quarantine(reason="#NNN")`.
+3. Quarantined tests are **excluded from the default CI run** but are included
+   when you pass `--run-quarantine`.
+4. The quarantine is a **temporary holding area**, not a graveyard.  If a
+   quarantined test's issue has been open for more than 30 days with no
+   progress, escalate it.
+
+## Adding a test
+
+```python
+# tests/quarantine/test_my_flaky_thing.py
+import pytest
+
+@pytest.mark.quarantine(reason="Flaky: #1234")
+def test_my_flaky_thing():
+    ...
+```
+
+See `docs/QUARANTINE_PROCESS.md` for the complete workflow.
--- a/tests/quarantine/init.py
+++ b/tests/quarantine/init.py
@@ -0,0 +1,2 @@
+# Quarantined tests live here.
+# See docs/QUARANTINE_PROCESS.md for the full quarantine workflow.