Files
turboquant/benchmarks/bonsai-tool-calling.md

1.9 KiB

Tool Calling Test Results — 1-Bit Models

Status: Pending execution
Issue: #101
Model: bonsai-1bit (to be tested)
Backend: Ollama

Test Suite

10 test cases covering:

# Test Type Difficulty Description
1 simple_file_read file_read easy Read README.md with exact path
2 absolute_path_read file_read easy Read /etc/hostname with absolute path
3 simple_terminal terminal easy Run echo hello world
4 terminal_ls terminal medium List files in directory
5 web_search web_search easy Search for a query
6 read_then_analyze multi_step medium Read file then analyze content
7 nested_params schema_parsing hard Complex nested parameters
8 optional_params schema_parsing medium Tool with optional parameters
9 sequential_calls multi_step hard Multiple tool calls in sequence
10 no_tool_needed file_read easy No tool needed for simple question

Hypothesis

1-bit quantization destroys fine-grained reasoning. Tool calling (precise JSON output) may be impossible. But worth testing — the field is moving fast.

Results

To be filled after running:

python3 benchmarks/test_tool_calling_1bit.py --model bonsai-1bit --report benchmarks/bonsai-tool-calling.md --results benchmarks/tool_calling_results.json

Failure Modes (Expected)

If tests fail, likely causes:

  1. JSON formatting: Model cannot produce valid JSON tool calls
  2. Parameter extraction: Model confuses or drops parameters
  3. Schema adherence: Model ignores tool schema constraints
  4. Consistency: Model produces different formats across runs

Alternative Edge Models

If 1-bit is not viable:

  • Qwen3.5 3B Q4 — Good tool calling, reasonable size
  • Phi-3 Mini — Strong reasoning, supports function calling
  • Llama 3.2 3B — Good balance of size and capability