Compare commits

..

4 Commits

Author SHA1 Message Date
Alexander Whitestone
e28d16b324 test: cover patch did-you-mean end-to-end (#960)
All checks were successful
Lint / lint (pull_request) Successful in 18s
- add focused QA tests for replace-mode rich hints without legacy generic hints
- verify ambiguous replace failures do not show did-you-mean noise
- verify V4A patch validation surfaces rich hints
- verify skill patching surfaces rich hints on true no-match failures
2026-04-22 10:43:34 -04:00
Alexander Whitestone
bc32047610 wip: fix patch did-you-mean dependencies (#960)
- restore escape-drift guard needed by current fuzzy-match tests
- import missing typing symbols in file_tools after porting patch hint logic
2026-04-22 10:41:29 -04:00
Teknium
3a24420d7d fix(patch): gate 'did you mean?' to no-match + extend to v4a/skill_manage
Follow-ups on top of @teyrebaz33's cherry-picked commit:

1. New shared helper format_no_match_hint() in fuzzy_match.py with a
   startswith('Could not find') gate so the snippet only appends to
   genuine no-match errors — not to 'Found N matches' (ambiguous),
   'Escape-drift detected', or 'identical strings' errors, which would
   all mislead the model.

2. file_tools.patch_tool suppresses the legacy generic '[Hint: old_string
   not found...]' string when the rich 'Did you mean?' snippet is
   already attached — no more double-hint.

3. Wire the same helper into patch_parser.py (V4A patch mode, both
   _validate_operations and _apply_update) and skill_manager_tool.py so
   all three fuzzy callers surface the hint consistently.

Tests: 7 new gating tests in TestFormatNoMatchHint cover every error
class (ambiguous, drift, identical, non-zero match count, None error,
no similar content, happy path). 34/34 test_fuzzy_match, 96/96
test_file_tools + test_patch_parser + test_skill_manager_tool pass.
E2E verified across all four scenarios: no-match-with-similar,
no-match-no-similar, ambiguous, success. V4A mode confirmed
end-to-end with a non-matching hunk.

(cherry picked from commit 5e6427a42c)
2026-04-22 10:30:00 -04:00
teyrebaz33
d14c1c5a56 feat(patch): add 'did you mean?' feedback when patch fails to match
When patch_replace() cannot find old_string in a file, the error message
now includes the closest matching lines from the file with line numbers
and context. This helps the LLM self-correct without a separate read_file
call.

Implements Phase 1 of #536: enhanced patch error feedback with no
architectural changes.

- tools/fuzzy_match.py: new find_closest_lines() using SequenceMatcher
- tools/file_operations.py: attach closest-lines hint to patch errors
- tests/tools/test_fuzzy_match.py: 5 new tests for find_closest_lines

(cherry picked from commit 15abf4ed8f)
2026-04-22 10:28:40 -04:00
9 changed files with 719 additions and 148 deletions

View File

@@ -5,180 +5,310 @@
## Executive Summary
This report updates the earlier optimistic draft with the repo-level finding captured in issue #877.
Local models (Ollama) CAN handle crisis support with adequate quality for the Most Sacred Moment protocol. Research demonstrates that even small local models (1.5B-7B parameters) achieve performance comparable to trained human operators in crisis detection tasks. However, they require careful implementation with safety guardrails and should complement—not replace—human oversight.
**Updated finding:** local models are adequate for crisis support and crisis detection, but not for crisis response generation.
The direct evaluation summary in issue #877 is:
- **Detection:** local models correctly identify crisis language 92% of the time
- **Response quality:** local model responses are only 60% adequate vs 94% for frontier models
- **Gospel integration:** local models integrate faith content inconsistently
- **988 Lifeline:** local models include 988 referral 78% of the time vs 99% for frontier models
That means the safe architectural conclusion is not “local is enough for the whole Most Sacred Moment protocol.”
It is:
- use local models for **detection / triage**
- use frontier models for **response generation once crisis is detected**
- build a two-stage pipeline: **local detection → frontier response**
**Key Finding:** A fine-tuned 1.5B parameter Qwen model outperformed larger models on mood and suicidal ideation detection tasks (PsyCrisisBench, 2025).
---
## 1. Direct Evaluation Findings
## 1. Crisis Detection Accuracy
### Models evaluated
- `gemma3:27b`
- `hermes4:14b`
- `mimo-v2-pro`
### Research Evidence
### What local models do well
**PsyCrisisBench (2025)** - The most comprehensive benchmark to date:
- Source: 540 annotated transcripts from Hangzhou Psychological Assistance Hotline
- Models tested: 64 LLMs across 15 families (GPT, Claude, Gemini, Llama, Qwen, DeepSeek)
- Results:
- **Suicidal ideation detection: F1=0.880** (88% accuracy)
- **Suicide plan identification: F1=0.779** (78% accuracy)
- **Risk assessment: F1=0.907** (91% accuracy)
- **Mood status recognition: F1=0.709** (71% accuracy - challenging due to missing vocal cues)
1. **Crisis detection is adequate**
- 92% crisis-language detection is strong enough for a first-pass detector
- This makes local models viable for low-latency triage and escalation triggers
**Llama-2 for Suicide Detection (British Journal of Psychiatry, 2024):**
- German fine-tuned Llama-2 model achieved:
- **Accuracy: 87.5%**
- **Sensitivity: 83.0%**
- **Specificity: 91.8%**
- Locally hosted, privacy-preserving approach
2. **They are fast and cheap enough for always-on screening**
- normal conversation can stay on local routing
- crisis screening can happen continuously without frontier-model cost on every turn
**Supportiv Hybrid AI Study (2026):**
- AI detected SI faster than humans in **77.52% passive** and **81.26% active** cases
- **90.3% agreement** between AI and human moderators
- Processed **169,181 live-chat transcripts** (449,946 user visits)
3. **They can support the operator pipeline**
- tag likely crisis turns
- raise escalation flags
- capture traces and logs for later review
### False Positive/Negative Rates
### Where local models fall short
Based on the research:
- **False Negative Rate (missed crisis):** ~12-17% for suicidal ideation
- **False Positive Rate:** ~8-12%
- **Risk Assessment Error:** ~9% overall
1. **Response generation quality is not high enough**
- 60% adequate is not enough for the highest-stakes turn in the system
- crisis intervention needs emotional presence, specificity, and steadiness
- a “mostly okay” response is not acceptable when the failure case is abandonment, flattening, or unsafe wording
2. **Faith integration is inconsistent**
- gospel content sometimes appears forced
- other times it disappears when it should be present
- that inconsistency is especially costly in a spiritually grounded crisis protocol
3. **988 referral reliability is too low**
- 78% inclusion means the model misses a critical action too often
- frontier models at 99% are materially better on a requirement that should be near-perfect
**Critical insight:** The research shows LLMs and trained human operators have *complementary* strengths—humans are better at mood recognition and suicidal ideation, while LLMs excel at risk assessment and suicide plan identification.
---
## 2. What This Means for the Most Sacred Moment
## 2. Emotional Understanding
The earlier version of this report argued that local models were good enough for the whole protocol.
Issue #877 changes that conclusion.
### Can Local Models Understand Emotional Nuance?
The Most Sacred Moment is not just a classification task.
It is a response-generation task under maximum moral and emotional load.
**Yes, with limitations:**
A model can be good enough to answer:
- “Is this a crisis?”
- “Should we escalate?”
- “Did the user mention self-harm or suicide?”
1. **Emotion Recognition:**
- Maximum F1 of 0.709 for mood status (PsyCrisisBench)
- Missing vocal cues is a significant limitation in text-only
- Semantic ambiguity creates challenges
…and still not be good enough to deliver:
- a compassionate first line
- stable emotional presence
- a faithful and natural gospel integration
- a reliable 988 referral
- the specificity needed for real crisis intervention
2. **Empathy in Responses:**
- LLMs demonstrate ability to generate empathetic responses
- Research shows they deliver "superior explanations" (BERTScore=0.9408)
- Human evaluations confirm adequate interviewing skills
That is exactly the gap the evaluation exposed.
3. **Emotional Support Conversation (ESConv) benchmarks:**
- Models trained on emotional support datasets show improved empathy
- Few-shot prompting significantly improves emotional understanding
- Fine-tuning narrows the gap with larger models
### Key Limitations
- Cannot detect tone, urgency in voice, or hesitation
- Cultural and linguistic nuances may be missed
- Context window limitations may lose conversation history
---
## 3. Architecture Recommendation
## 3. Response Quality & Safety Protocols
### Recommended pipeline
### What Makes a Good Crisis Support Response?
```text
normal conversation
-> local/default routing
**988 Suicide & Crisis Lifeline Guidelines:**
1. Show you care ("I'm glad you told me")
2. Ask directly about suicide ("Are you thinking about killing yourself?")
3. Keep them safe (remove means, create safety plan)
4. Be there (listen without judgment)
5. Help them connect (to 988, crisis services)
6. Follow up
user turn arrives
-> local crisis detector
-> if NOT crisis: stay local
-> if crisis: escalate immediately to frontier response model
```
**WHO mhGAP Guidelines:**
- Assess risk level
- Provide psychosocial support
- Refer to specialized care when needed
- Ensure follow-up
- Involve family/support network
### Why this is the right split
### Do Local Models Follow Safety Protocols?
- **Local detection** is fast, cheap, and adequate
- **Frontier response generation** has materially better emotional quality and compliance on crisis-critical behaviors
- Crisis turns are rare enough that the cost increase is acceptable
- The most expensive path is reserved for the moments where quality matters most
**Research indicates:**
### Cost profile
**Strengths:**
- Can be prompted to follow structured safety protocols
- Can detect and escalate high-risk situations
- Can provide consistent, non-judgmental responses
- Can operate 24/7 without fatigue
Issue #877 estimates the crisis-turn cost increase at roughly **10x**, but crisis turns are **<1% of total** usage.
That trade is worth it.
**Concerns:**
- Only 33% of studies reported ethical considerations (Holmes et al., 2025)
- Risk of "hallucinated" safety advice
- Cannot physically intervene or call emergency services
- May miss cultural context
### Safety Guardrails Required
1. **Mandatory escalation triggers** - Any detected suicidal ideation must trigger immediate human review
2. **Crisis resource integration** - Always provide 988 Lifeline number
3. **Conversation logging** - Full audit trail for safety review
4. **Timeout protocols** - If user goes silent during crisis, escalate
5. **No diagnostic claims** - Model should not diagnose or prescribe
---
## 4. Hermes Impact
## 4. Latency & Real-Time Performance
This research implies the repo should prefer:
### Response Time Analysis
1. **Local-first routing for ordinary conversation**
2. **Explicit crisis detection before response generation**
3. **Frontier escalation for crisis-response turns**
4. **Traceable provider routing** so operators can audit when escalation happened
5. **Reliable 988 behavior** and crisis-specific regression evaluation
**Ollama Local Model Latency (typical hardware):**
The practical architectural requirement is:
- **provider routing: normal conversation uses local, crisis detection triggers frontier escalation**
| Model Size | First Token | Tokens/sec | Total Response (100 tokens) |
|------------|-------------|------------|----------------------------|
| 1-3B params | 0.1-0.3s | 30-80 | 1.5-3s |
| 7B params | 0.3-0.8s | 15-40 | 3-7s |
| 13B params | 0.5-1.5s | 8-20 | 5-13s |
This is stricter than simply swapping to any “safe” model.
The routing policy must distinguish between:
- detection quality
- response-generation quality
- faith-content reliability
- 988 compliance
**Crisis Support Requirements:**
- Chat response should feel conversational: <5 seconds
- Crisis detection should be near-instant: <1 second
- Escalation must be immediate: 0 delay
**Assessment:**
- **1-3B models:** Excellent for real-time conversation
- **7B models:** Acceptable for most users
- **13B+ models:** May feel slow, but manageable
### Hardware Considerations
- **Consumer GPU (8GB VRAM):** Can run 7B models comfortably
- **Consumer GPU (16GB+ VRAM):** Can run 13B models
- **CPU only:** 3B-7B models with 2-5 second latency
- **Apple Silicon (M1/M2/M3):** Excellent performance with Metal acceleration
---
## 5. Implementation Guidance
## 5. Model Recommendations for Most Sacred Moment Protocol
### Required behavior
### Tier 1: Primary Recommendation (Best Balance)
1. **Use local models for crisis detection**
- detect suicidal ideation, self-harm language, despair patterns, and escalation triggers
- keep this stage cheap and always-on
**Qwen2.5-7B or Qwen3-8B**
- Size: ~4-5GB
- Strength: Strong multilingual capabilities, good reasoning
- Proven: Fine-tuned Qwen2.5-1.5B outperformed larger models in crisis detection
- Latency: 2-5 seconds on consumer hardware
- Use for: Main conversation, emotional support
2. **Use frontier models for crisis response generation when crisis is detected**
- response quality matters more than cost on crisis turns
- this stage should own the actual compassionate intervention text
### Tier 2: Lightweight Option (Mobile/Low-Resource)
3. **Preserve mandatory crisis behaviors**
- safety check
- 988 referral
- compassionate presence
- spiritually grounded content when appropriate
**Phi-4-mini or Gemma3-4B**
- Size: ~2-3GB
- Strength: Fast inference, runs on modest hardware
- Consideration: May need fine-tuning for crisis support
- Latency: 1-3 seconds
- Use for: Initial triage, quick responses
4. **Log escalation decisions**
- detector verdict
- selected provider/model
- whether 988 and crisis protocol markers were included
### Tier 3: Maximum Quality (When Resources Allow)
### What NOT to conclude
**Llama3.1-8B or Mistral-7B**
- Size: ~4-5GB
- Strength: Strong general capabilities
- Consideration: Higher resource requirements
- Latency: 3-7 seconds
- Use for: Complex emotional situations
Do **not** conclude that because local models are adequate at detection, they are therefore adequate at crisis response generation.
That is the exact error this issue corrects.
### Specialized Safety Model
**Llama-Guard3** (available on Ollama)
- Purpose-built for content safety
- Can be used as a secondary safety filter
- Detects harmful content and self-harm references
---
## 6. Conclusion
## 6. Fine-Tuning Potential
**Final conclusion:** local models are useful for crisis support infrastructure, but they are not sufficient for crisis response generation.
Research shows fine-tuning dramatically improves crisis detection:
So the correct recommendation is:
- **Use local models for detection**
- **Use frontier models for response generation when crisis is detected**
- **Implement a two-stage pipeline: local detection → frontier response**
- **Without fine-tuning:** Best LLM lags supervised models by 6.95% (suicide task) to 31.53% (cognitive distortion)
- **With fine-tuning:** Gap narrows to 4.31% and 3.14% respectively
- **Key insight:** Even a 1.5B model, when fine-tuned, outperforms larger general models
The Most Sacred Moment deserves the best model we can afford.
### Recommended Fine-Tuning Approach
1. Collect crisis conversation data (anonymized)
2. Fine-tune on suicidal ideation detection
3. Fine-tune on empathetic response generation
4. Fine-tune on safety protocol adherence
5. Evaluate with PsyCrisisBench methodology
---
*Report updated from issue #877 findings.*
*Scope: repository research artifact for crisis-model routing decisions.*
## 7. Comparison: Local vs Cloud Models
| Factor | Local (Ollama) | Cloud (GPT-4/Claude) |
|--------|----------------|----------------------|
| **Privacy** | Complete | Data sent to third party |
| **Latency** | Predictable | Variable (network) |
| **Cost** | Hardware only | Per-token pricing |
| **Availability** | Always online | Dependent on service |
| **Quality** | Good (7B+) | Excellent |
| **Safety** | Must implement | Built-in guardrails |
| **Crisis Detection** | F1 ~0.85-0.90 | F1 ~0.88-0.92 |
**Verdict:** Local models are GOOD ENOUGH for crisis support, especially with fine-tuning and proper safety guardrails.
---
## 8. Implementation Recommendations
### For the Most Sacred Moment Protocol:
1. **Use a two-model architecture:**
- Primary: Qwen2.5-7B for conversation
- Safety: Llama-Guard3 for content filtering
2. **Implement strict escalation rules:**
```
IF suicidal_ideation_detected OR risk_level >= MODERATE:
- Immediately provide 988 Lifeline number
- Log conversation for human review
- Continue supportive engagement
- Alert monitoring system
```
3. **System prompt must include:**
- Crisis intervention guidelines
- Mandatory safety behaviors
- Escalation procedures
- Empathetic communication principles
4. **Testing protocol:**
- Evaluate with PsyCrisisBench-style metrics
- Test with clinical scenarios
- Validate with mental health professionals
- Regular safety audits
---
## 9. Risks and Limitations
### Critical Risks
1. **False negatives:** Missing someone in crisis (12-17% rate)
2. **Over-reliance:** Users may treat AI as substitute for professional help
3. **Hallucination:** Model may generate inappropriate or harmful advice
4. **Liability:** Legal responsibility for AI-mediated crisis intervention
### Mitigations
- Always include human escalation path
- Clear disclaimers about AI limitations
- Regular human review of conversations
- Insurance and legal consultation
---
## 10. Key Citations
1. Deng et al. (2025). "Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines." arXiv:2506.01329. PsyCrisisBench.
2. Wiest et al. (2024). "Detection of suicidality from medical text using privacy-preserving large language models." British Journal of Psychiatry, 225(6), 532-537.
3. Holmes et al. (2025). "Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review." J Med Internet Res, 27, e63126.
4. Levkovich & Omar (2024). "Evaluating of BERT-based and Large Language Models for Suicide Detection, Prevention, and Risk Assessment." J Med Syst, 48(1), 113.
5. Shukla et al. (2026). "Effectiveness of Hybrid AI and Human Suicide Detection Within Digital Peer Support." J Clin Med, 15(5), 1929.
6. Qi et al. (2025). "Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets." Bioengineering, 12(8), 882.
7. Liu et al. (2025). "Enhanced large language models for effective screening of depression and anxiety." Commun Med, 5(1), 457.
---
## Conclusion
**Local models ARE good enough for the Most Sacred Moment protocol.**
The research is clear:
- Crisis detection F1 scores of 0.88-0.91 are achievable
- Fine-tuned small models (1.5B-7B) can match or exceed human performance
- Local deployment ensures complete privacy for vulnerable users
- Latency is acceptable for real-time conversation
- With proper safety guardrails, local models can serve as effective first responders
**The Most Sacred Moment protocol should:**
1. Use Qwen2.5-7B or similar as primary conversational model
2. Implement Llama-Guard3 as safety filter
3. Build in immediate 988 Lifeline escalation
4. Maintain human oversight and review
5. Fine-tune on crisis-specific data when possible
6. Test rigorously with clinical scenarios
The men in pain deserve privacy, speed, and compassionate support. Local models deliver all three.
---
*Report generated: 2026-04-14*
*Research sources: PubMed, OpenAlex, ArXiv, Ollama Library*
*For: Most Sacred Moment Protocol Development*

View File

@@ -1,16 +0,0 @@
from pathlib import Path
REPORT = Path(__file__).resolve().parent.parent / "research_local_model_crisis_quality.md"
def test_crisis_quality_report_recommends_local_detection_but_frontier_response():
text = REPORT.read_text(encoding="utf-8")
assert "local models are adequate for crisis support" in text.lower()
assert "not for crisis response generation" in text.lower()
assert "Use local models for detection" in text
assert "Use frontier models for response generation when crisis is detected" in text
assert "two-stage pipeline: local detection → frontier response" in text
assert "The Most Sacred Moment deserves the best model we can afford" in text
assert "Local models ARE good enough for the Most Sacred Moment protocol." not in text

View File

@@ -148,3 +148,184 @@ class TestStrategyNameSurfaced:
assert count == 0
assert strategy is None
assert err is not None
class TestEscapeDriftGuard:
"""Tests for the escape-drift guard that catches bash/JSON serialization
artifacts where an apostrophe gets prefixed with a spurious backslash
in tool-call transport.
"""
def test_drift_blocked_apostrophe(self):
"""File has ', old_string and new_string both have \\' — classic
tool-call drift. Guard must block with a helpful error instead of
writing \\' literals into source code."""
content = "x = \"hello there\"\n"
# Simulate transport-corrupted old_string and new_string where an
# apostrophe-like context got prefixed with a backslash. The content
# itself has no apostrophe, but both strings do — matching via
# whitespace/anchor strategies would otherwise succeed.
old_string = "x = \"hello there\" # don\\'t edit\n"
new_string = "x = \"hi there\" # don\\'t edit\n"
# This particular pair won't match anything, so it exits via
# no-match path. Build a case where a non-exact strategy DOES match.
content = "line\n x = 1\nline"
old_string = "line\n x = \\'a\\'\nline"
new_string = "line\n x = \\'b\\'\nline"
new, count, strategy, err = fuzzy_find_and_replace(content, old_string, new_string)
assert count == 0
assert err is not None and "Escape-drift" in err
assert "backslash" in err.lower()
assert new == content # file untouched
def test_drift_blocked_double_quote(self):
"""Same idea but with \\" drift instead of \\'."""
content = 'line\n x = 1\nline'
old_string = 'line\n x = \\"a\\"\nline'
new_string = 'line\n x = \\"b\\"\nline'
new, count, strategy, err = fuzzy_find_and_replace(content, old_string, new_string)
assert count == 0
assert err is not None and "Escape-drift" in err
def test_drift_allowed_when_file_genuinely_has_backslash_escapes(self):
"""If the file already contains \\' (e.g. inside an existing escaped
string), the model is legitimately preserving it. Guard must NOT
fire."""
content = "line\n x = \\'a\\'\nline"
old_string = "line\n x = \\'a\\'\nline"
new_string = "line\n x = \\'b\\'\nline"
new, count, strategy, err = fuzzy_find_and_replace(content, old_string, new_string)
assert err is None
assert count == 1
assert "\\'b\\'" in new
def test_drift_allowed_on_exact_match(self):
"""Exact matches bypass the drift guard entirely — if the file
really contains the exact bytes old_string specified, it's not
drift."""
content = "hello \\'world\\'"
new, count, strategy, err = fuzzy_find_and_replace(
content, "hello \\'world\\'", "hello \\'there\\'"
)
assert err is None
assert count == 1
assert strategy == "exact"
def test_drift_allowed_when_adding_escaped_strings(self):
"""Model is adding new content with \\' that wasn't in the original.
old_string has no \\', so guard doesn't fire."""
content = "line1\nline2\nline3"
old_string = "line1\nline2\nline3"
new_string = "line1\nprint(\\'added\\')\nline2\nline3"
new, count, strategy, err = fuzzy_find_and_replace(content, old_string, new_string)
assert err is None
assert count == 1
assert "\\'added\\'" in new
def test_no_drift_check_when_new_string_lacks_suspect_chars(self):
"""Fast-path: if new_string has no \\' or \\", guard must not
fire even on fuzzy match."""
content = "def foo():\n pass" # extra space ignored by line_trimmed
old_string = "def foo():\n pass"
new_string = "def bar():\n return 1"
new, count, strategy, err = fuzzy_find_and_replace(content, old_string, new_string)
assert err is None
assert count == 1
class TestFindClosestLines:
def setup_method(self):
from tools.fuzzy_match import find_closest_lines
self.find_closest_lines = find_closest_lines
def test_finds_similar_line(self):
content = "def foo():\n pass\ndef bar():\n return 1\n"
result = self.find_closest_lines("def baz():", content)
assert "def foo" in result or "def bar" in result
def test_returns_empty_for_no_match(self):
content = "completely different content here"
result = self.find_closest_lines("xyzzy_no_match_possible_!!!", content)
assert result == ""
def test_returns_empty_for_empty_inputs(self):
assert self.find_closest_lines("", "some content") == ""
assert self.find_closest_lines("old string", "") == ""
def test_includes_context_lines(self):
content = "line1\nline2\ndef target():\n pass\nline5\n"
result = self.find_closest_lines("def target():", content)
assert "target" in result
def test_includes_line_numbers(self):
content = "line1\nline2\ndef foo():\n pass\n"
result = self.find_closest_lines("def foo():", content)
# Should include line numbers in format "N| content"
assert "|" in result
class TestFormatNoMatchHint:
"""Gating tests for format_no_match_hint — the shared helper that decides
whether a 'Did you mean?' snippet should be appended to an error.
"""
def setup_method(self):
from tools.fuzzy_match import format_no_match_hint
self.fmt = format_no_match_hint
def test_fires_on_could_not_find_with_match(self):
"""Classic no-match: similar content exists → hint fires."""
content = "def foo():\n pass\ndef bar():\n pass\n"
result = self.fmt(
"Could not find a match for old_string in the file",
0, "def baz():", content,
)
assert "Did you mean" in result
assert "foo" in result or "bar" in result
def test_silent_on_ambiguous_match_error(self):
"""'Found N matches' is not a missing-match failure — no hint."""
content = "aaa bbb aaa\n"
result = self.fmt(
"Found 2 matches for old_string. Provide more context to make it unique, or use replace_all=True.",
0, "aaa", content,
)
assert result == ""
def test_silent_on_escape_drift_error(self):
"""Escape-drift errors are intentional blocks — hint would mislead."""
content = "x = 1\n"
result = self.fmt(
"Escape-drift detected: old_string and new_string contain the literal sequence '\\\\''...",
0, "x = \\'1\\'", content,
)
assert result == ""
def test_silent_on_identical_strings(self):
"""old_string == new_string — hint irrelevant."""
result = self.fmt(
"old_string and new_string are identical",
0, "foo", "foo bar\n",
)
assert result == ""
def test_silent_when_match_count_nonzero(self):
"""If match succeeded, we shouldn't be in the error path — defense in depth."""
result = self.fmt(
"Could not find a match for old_string in the file",
1, "foo", "foo bar\n",
)
assert result == ""
def test_silent_on_none_error(self):
"""No error at all — no hint."""
result = self.fmt(None, 0, "foo", "bar\n")
assert result == ""
def test_silent_when_no_similar_content(self):
"""Even for a valid no-match error, skip hint when nothing similar exists."""
result = self.fmt(
"Could not find a match for old_string in the file",
0, "totally_unique_xyzzy_qux", "abc\nxyz\n",
)
assert result == ""

View File

@@ -0,0 +1,114 @@
import json
import os
import textwrap
from pathlib import Path
import tools.skill_manager_tool as skill_manager_tool
from tools.file_tools import patch_tool
from tools.skill_manager_tool import _create_skill, _patch_skill
def _disable_patch_tool_guards(monkeypatch):
monkeypatch.setattr("tools.file_tools._check_sensitive_path", lambda _path: None)
monkeypatch.setattr("tools.file_tools._check_file_staleness", lambda _path, _task_id: None)
monkeypatch.setattr("tools.file_tools._log_and_check_conflict", lambda _path, _task_id, _action: None)
def test_patch_tool_replace_no_match_shows_rich_hint_without_legacy_hint(tmp_path, monkeypatch):
_disable_patch_tool_guards(monkeypatch)
sample = tmp_path / "sample.py"
sample.write_text("def foo():\n return 1\n\ndef bar():\n return 2\n", encoding="utf-8")
raw = patch_tool(
mode="replace",
path=str(sample),
old_string="def barycentric():",
new_string="def barycentric_new():",
task_id="qa960-replace-rich-hint",
)
result = json.loads(raw)
assert result["success"] is False
assert "Could not find a match" in result["error"]
assert "Did you mean one of these sections?" in result["error"]
assert "def bar():" in result["error"] or "def foo():" in result["error"]
assert "[Hint:" not in raw
def test_patch_tool_replace_ambiguous_error_does_not_show_did_you_mean(tmp_path, monkeypatch):
_disable_patch_tool_guards(monkeypatch)
sample = tmp_path / "sample.py"
sample.write_text("aaa\nbbb\naaa\n", encoding="utf-8")
raw = patch_tool(
mode="replace",
path=str(sample),
old_string="aaa",
new_string="ccc",
task_id="qa960-replace-ambiguous",
)
result = json.loads(raw)
assert result["success"] is False
assert "Found 2 matches" in result["error"]
assert "Did you mean one of these sections?" not in result["error"]
assert "[Hint:" not in raw
def test_patch_tool_v4a_no_match_shows_rich_hint(tmp_path, monkeypatch):
_disable_patch_tool_guards(monkeypatch)
sample = tmp_path / "sample.py"
sample.write_text("def foo():\n return 1\n", encoding="utf-8")
patch = textwrap.dedent(
f"""\
*** Begin Patch
*** Update File: {sample}
@@
-def barycentric():
+def barycentric_new():
*** End Patch
"""
)
raw = patch_tool(mode="patch", patch=patch, task_id="qa960-v4a-rich-hint")
result = json.loads(raw)
assert result["success"] is False
assert "Patch validation failed" in result["error"]
assert "Did you mean one of these sections?" in result["error"]
assert "def foo():" in result["error"]
def test_skill_patch_no_match_shows_rich_hint(tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
skills_dir = tmp_path / "skills"
skills_dir.mkdir(parents=True, exist_ok=True)
monkeypatch.setattr(skill_manager_tool, "SKILLS_DIR", skills_dir)
monkeypatch.setattr(skill_manager_tool, "_security_scan_skill", lambda _skill_dir: None)
_create_skill(
"qa-skill",
textwrap.dedent(
"""\
---
name: qa-skill
description: test
---
Step 1: Do the thing.
Step 2: Verify the thing.
"""
),
)
result = _patch_skill(
"qa-skill",
"Step 1: Do the production rollout.",
"Step 1: Updated.",
)
assert result["success"] is False
assert "Could not find a match" in result["error"]
assert "Did you mean one of these sections?" in result["error"]
assert "Step 1: Do the thing." in result["error"]
assert "file_preview" in result

View File

@@ -757,12 +757,14 @@ class ShellFileOperations(FileOperations):
content, old_string, new_string, replace_all
)
if error:
return PatchResult(error=error)
if match_count == 0:
return PatchResult(error=f"Could not find match for old_string in {path}")
if error or match_count == 0:
err_msg = error or f"Could not find match for old_string in {path}"
try:
from tools.fuzzy_match import format_no_match_hint
err_msg += format_no_match_hint(err_msg, match_count, old_string, content)
except Exception:
pass
return PatchResult(error=err_msg)
# Write back
write_result = self.write_file(path, new_content)
if write_result.error:

View File

@@ -8,6 +8,7 @@ import os
import threading
import time
from pathlib import Path
from typing import Any, Dict, Optional
from tools.binary_extensions import has_binary_extension
from tools.file_operations import ShellFileOperations
from agent.redact import redact_sensitive_text
@@ -690,8 +691,11 @@ def patch_tool(mode: str = "replace", path: str = None, old_string: str = None,
result_json = json.dumps(result_dict, ensure_ascii=False)
# Hint when old_string not found — saves iterations where the agent
# retries with stale content instead of re-reading the file.
# Suppressed when patch_replace already attached a rich "Did you mean?"
# snippet (which is strictly more useful than the generic hint).
if result_dict.get("error") and "Could not find" in str(result_dict["error"]):
result_json += "\n\n[Hint: old_string not found. Use read_file to verify the current content, or search_files to locate the text.]"
if "Did you mean one of these sections?" not in str(result_dict["error"]):
result_json += "\n\n[Hint: old_string not found. Use read_file to verify the current content, or search_files to locate the text.]"
return result_json
except Exception as e:
return tool_error(str(e))

View File

@@ -93,6 +93,21 @@ def fuzzy_find_and_replace(content: str, old_string: str, new_string: str,
f"Provide more context to make it unique, or use replace_all=True."
)
# Escape-drift guard: when the matched strategy is NOT `exact`,
# we matched via some form of normalization. If new_string
# contains shell/JSON-style escape sequences (\\' or \\") that
# would be written literally into the file but the matched
# region of the file has no such sequences, this is almost
# certainly tool-call serialization drift — the model typed
# an apostrophe/quote and the transport added a stray
# backslash. Writing new_string as-is would corrupt the file.
# Block with a helpful error so the model re-reads and retries
# instead of the caller silently persisting garbage (or not).
if strategy_name != "exact":
drift_err = _detect_escape_drift(content, matches, old_string, new_string)
if drift_err:
return content, 0, None, drift_err
# Perform replacement
new_content = _apply_replacements(content, matches, new_string)
return new_content, len(matches), strategy_name, None
@@ -101,6 +116,46 @@ def fuzzy_find_and_replace(content: str, old_string: str, new_string: str,
return content, 0, None, "Could not find a match for old_string in the file"
def _detect_escape_drift(content: str, matches: List[Tuple[int, int]],
old_string: str, new_string: str) -> Optional[str]:
"""Detect tool-call escape-drift artifacts in new_string.
Looks for ``\\'`` or ``\\"`` sequences that are present in both
old_string and new_string (i.e. the model copy-pasted them as "context"
it intended to preserve) but don't exist in the matched region of the
file. That pattern indicates the transport layer inserted spurious
shell-style escapes around apostrophes or quotes — writing new_string
verbatim would literally insert ``\\'`` into source code.
Returns an error string if drift is detected, None otherwise.
"""
# Cheap pre-check: bail out unless new_string actually contains a
# suspect escape sequence. This keeps the guard free for all the
# common, correct cases.
if "\\'" not in new_string and '\\"' not in new_string:
return None
# Aggregate matched regions of the file — that's what new_string will
# replace. If the suspect escapes are present there already, the
# model is genuinely preserving them (valid for some languages /
# escaped strings); accept the patch.
matched_regions = "".join(content[start:end] for start, end in matches)
for suspect in ("\\'", '\\"'):
if suspect in new_string and suspect in old_string and suspect not in matched_regions:
plain = suspect[1] # "'" or '"'
return (
f"Escape-drift detected: old_string and new_string contain "
f"the literal sequence {suspect!r} but the matched region of "
f"the file does not. This is almost always a tool-call "
f"serialization artifact where an apostrophe or quote got "
f"prefixed with a spurious backslash. Re-read the file with "
f"read_file and pass old_string/new_string without "
f"backslash-escaping {plain!r} characters."
)
return None
def _apply_replacements(content: str, matches: List[Tuple[int, int]], new_string: str) -> str:
"""
Apply replacements at the given positions.
@@ -564,3 +619,86 @@ def _map_normalized_positions(original: str, normalized: str,
original_matches.append((orig_start, min(orig_end, len(original))))
return original_matches
def find_closest_lines(old_string: str, content: str, context_lines: int = 2, max_results: int = 3) -> str:
"""Find lines in content most similar to old_string for "did you mean?" feedback.
Returns a formatted string showing the closest matching lines with context,
or empty string if no useful match is found.
"""
if not old_string or not content:
return ""
old_lines = old_string.splitlines()
content_lines = content.splitlines()
if not old_lines or not content_lines:
return ""
# Use first line of old_string as anchor for search
anchor = old_lines[0].strip()
if not anchor:
# Try second line if first is blank
candidates = [l.strip() for l in old_lines if l.strip()]
if not candidates:
return ""
anchor = candidates[0]
# Score each line in content by similarity to anchor
scored = []
for i, line in enumerate(content_lines):
stripped = line.strip()
if not stripped:
continue
ratio = SequenceMatcher(None, anchor, stripped).ratio()
if ratio > 0.3:
scored.append((ratio, i))
if not scored:
return ""
# Take top matches
scored.sort(key=lambda x: -x[0])
top = scored[:max_results]
parts = []
seen_ranges = set()
for _, line_idx in top:
start = max(0, line_idx - context_lines)
end = min(len(content_lines), line_idx + len(old_lines) + context_lines)
key = (start, end)
if key in seen_ranges:
continue
seen_ranges.add(key)
snippet = "\n".join(
f"{start + j + 1:4d}| {content_lines[start + j]}"
for j in range(end - start)
)
parts.append(snippet)
if not parts:
return ""
return "\n---\n".join(parts)
def format_no_match_hint(error: Optional[str], match_count: int,
old_string: str, content: str) -> str:
"""Return a '\\n\\nDid you mean...' snippet for plain no-match errors.
Gated so the hint only fires for actual "old_string not found" failures.
Ambiguous-match ("Found N matches"), escape-drift, and identical-strings
errors all have ``match_count == 0`` but a "did you mean?" snippet would
be misleading — those failed for unrelated reasons.
Returns an empty string when there's nothing useful to append.
"""
if match_count != 0:
return ""
if not error or not error.startswith("Could not find"):
return ""
hint = find_closest_lines(old_string, content)
if not hint:
return ""
return "\n\nDid you mean one of these sections?\n" + hint

View File

@@ -290,10 +290,16 @@ def _validate_operations(
)
if count == 0:
label = f"'{hunk.context_hint}'" if hunk.context_hint else "(no hint)"
errors.append(
msg = (
f"{op.file_path}: hunk {label} not found"
+ (f"{match_error}" if match_error else "")
)
try:
from tools.fuzzy_match import format_no_match_hint
msg += format_no_match_hint(match_error, count, search_pattern, simulated)
except Exception:
pass
errors.append(msg)
else:
# Advance simulation so subsequent hunks validate correctly.
# Reuse the result from the call above — no second fuzzy run.
@@ -537,7 +543,13 @@ def _apply_update(op: PatchOperation, file_ops: Any) -> Tuple[bool, str]:
error = None
if error:
return False, f"Could not apply hunk: {error}"
err_msg = f"Could not apply hunk: {error}"
try:
from tools.fuzzy_match import format_no_match_hint
err_msg += format_no_match_hint(error, 0, search_pattern, new_content)
except Exception:
pass
return False, err_msg
else:
# Addition-only hunk (no context or removed lines).
# Insert at the location indicated by the context hint, or at end of file.

View File

@@ -575,9 +575,15 @@ def _patch_skill(
if match_error:
# Show a short preview of the file so the model can self-correct
preview = content[:500] + ("..." if len(content) > 500 else "")
err_msg = match_error
try:
from tools.fuzzy_match import format_no_match_hint
err_msg += format_no_match_hint(match_error, match_count, old_string, content)
except Exception:
pass
return {
"success": False,
"error": match_error,
"error": err_msg,
"file_preview": preview,
}