docs: add tool calling benchmark template for 1-bit models

Refs #101
2026-04-16 01:54:02 +00:00
parent 3caeaf13eb
commit 0d92de9b3f
1 changed files with 49 additions and 0 deletions
--- a/benchmarks/bonsai-tool-calling.md
+++ b/benchmarks/bonsai-tool-calling.md
@@ -0,0 +1,49 @@
+# Tool Calling Test Results — 1-Bit Models
+
+**Status:** Pending execution  
+**Issue:** #101  
+**Model:** bonsai-1bit (to be tested)  
+**Backend:** Ollama  
+
+## Test Suite
+
+10 test cases covering:
+
+| # | Test | Type | Difficulty | Description |
+|---|------|------|------------|-------------|
+| 1 | simple_file_read | file_read | easy | Read README.md with exact path |
+| 2 | absolute_path_read | file_read | easy | Read /etc/hostname with absolute path |
+| 3 | simple_terminal | terminal | easy | Run `echo hello world` |
+| 4 | terminal_ls | terminal | medium | List files in directory |
+| 5 | web_search | web_search | easy | Search for a query |
+| 6 | read_then_analyze | multi_step | medium | Read file then analyze content |
+| 7 | nested_params | schema_parsing | hard | Complex nested parameters |
+| 8 | optional_params | schema_parsing | medium | Tool with optional parameters |
+| 9 | sequential_calls | multi_step | hard | Multiple tool calls in sequence |
+| 10 | no_tool_needed | file_read | easy | No tool needed for simple question |
+
+## Hypothesis
+
+1-bit quantization destroys fine-grained reasoning. Tool calling (precise JSON output) may be impossible. But worth testing — the field is moving fast.
+
+## Results
+
+*To be filled after running:*
+```bash
+python3 benchmarks/test_tool_calling_1bit.py --model bonsai-1bit --report benchmarks/bonsai-tool-calling.md --results benchmarks/tool_calling_results.json
+```
+
+## Failure Modes (Expected)
+
+If tests fail, likely causes:
+1. **JSON formatting:** Model cannot produce valid JSON tool calls
+2. **Parameter extraction:** Model confuses or drops parameters
+3. **Schema adherence:** Model ignores tool schema constraints
+4. **Consistency:** Model produces different formats across runs
+
+## Alternative Edge Models
+
+If 1-bit is not viable:
+- **Qwen3.5 3B Q4** — Good tool calling, reasonable size
+- **Phi-3 Mini** — Strong reasoning, supports function calling
+- **Llama 3.2 3B** — Good balance of size and capability