diff --git a/benchmarks/bonsai-tool-calling.md b/benchmarks/bonsai-tool-calling.md new file mode 100644 index 00000000..528b4be7 --- /dev/null +++ b/benchmarks/bonsai-tool-calling.md @@ -0,0 +1,49 @@ +# Tool Calling Test Results — 1-Bit Models + +**Status:** Pending execution +**Issue:** #101 +**Model:** bonsai-1bit (to be tested) +**Backend:** Ollama + +## Test Suite + +10 test cases covering: + +| # | Test | Type | Difficulty | Description | +|---|------|------|------------|-------------| +| 1 | simple_file_read | file_read | easy | Read README.md with exact path | +| 2 | absolute_path_read | file_read | easy | Read /etc/hostname with absolute path | +| 3 | simple_terminal | terminal | easy | Run `echo hello world` | +| 4 | terminal_ls | terminal | medium | List files in directory | +| 5 | web_search | web_search | easy | Search for a query | +| 6 | read_then_analyze | multi_step | medium | Read file then analyze content | +| 7 | nested_params | schema_parsing | hard | Complex nested parameters | +| 8 | optional_params | schema_parsing | medium | Tool with optional parameters | +| 9 | sequential_calls | multi_step | hard | Multiple tool calls in sequence | +| 10 | no_tool_needed | file_read | easy | No tool needed for simple question | + +## Hypothesis + +1-bit quantization destroys fine-grained reasoning. Tool calling (precise JSON output) may be impossible. But worth testing — the field is moving fast. + +## Results + +*To be filled after running:* +```bash +python3 benchmarks/test_tool_calling_1bit.py --model bonsai-1bit --report benchmarks/bonsai-tool-calling.md --results benchmarks/tool_calling_results.json +``` + +## Failure Modes (Expected) + +If tests fail, likely causes: +1. **JSON formatting:** Model cannot produce valid JSON tool calls +2. **Parameter extraction:** Model confuses or drops parameters +3. **Schema adherence:** Model ignores tool schema constraints +4. **Consistency:** Model produces different formats across runs + +## Alternative Edge Models + +If 1-bit is not viable: +- **Qwen3.5 3B Q4** — Good tool calling, reasonable size +- **Phi-3 Mini** — Strong reasoning, supports function calling +- **Llama 3.2 3B** — Good balance of size and capability