49
benchmarks/bonsai-tool-calling.md
Normal file
49
benchmarks/bonsai-tool-calling.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Tool Calling Test Results — 1-Bit Models
|
||||
|
||||
**Status:** Pending execution
|
||||
**Issue:** #101
|
||||
**Model:** bonsai-1bit (to be tested)
|
||||
**Backend:** Ollama
|
||||
|
||||
## Test Suite
|
||||
|
||||
10 test cases covering:
|
||||
|
||||
| # | Test | Type | Difficulty | Description |
|
||||
|---|------|------|------------|-------------|
|
||||
| 1 | simple_file_read | file_read | easy | Read README.md with exact path |
|
||||
| 2 | absolute_path_read | file_read | easy | Read /etc/hostname with absolute path |
|
||||
| 3 | simple_terminal | terminal | easy | Run `echo hello world` |
|
||||
| 4 | terminal_ls | terminal | medium | List files in directory |
|
||||
| 5 | web_search | web_search | easy | Search for a query |
|
||||
| 6 | read_then_analyze | multi_step | medium | Read file then analyze content |
|
||||
| 7 | nested_params | schema_parsing | hard | Complex nested parameters |
|
||||
| 8 | optional_params | schema_parsing | medium | Tool with optional parameters |
|
||||
| 9 | sequential_calls | multi_step | hard | Multiple tool calls in sequence |
|
||||
| 10 | no_tool_needed | file_read | easy | No tool needed for simple question |
|
||||
|
||||
## Hypothesis
|
||||
|
||||
1-bit quantization destroys fine-grained reasoning. Tool calling (precise JSON output) may be impossible. But worth testing — the field is moving fast.
|
||||
|
||||
## Results
|
||||
|
||||
*To be filled after running:*
|
||||
```bash
|
||||
python3 benchmarks/test_tool_calling_1bit.py --model bonsai-1bit --report benchmarks/bonsai-tool-calling.md --results benchmarks/tool_calling_results.json
|
||||
```
|
||||
|
||||
## Failure Modes (Expected)
|
||||
|
||||
If tests fail, likely causes:
|
||||
1. **JSON formatting:** Model cannot produce valid JSON tool calls
|
||||
2. **Parameter extraction:** Model confuses or drops parameters
|
||||
3. **Schema adherence:** Model ignores tool schema constraints
|
||||
4. **Consistency:** Model produces different formats across runs
|
||||
|
||||
## Alternative Edge Models
|
||||
|
||||
If 1-bit is not viable:
|
||||
- **Qwen3.5 3B Q4** — Good tool calling, reasonable size
|
||||
- **Phi-3 Mini** — Strong reasoning, supports function calling
|
||||
- **Llama 3.2 3B** — Good balance of size and capability
|
||||
Reference in New Issue
Block a user