Files

Alexander Whitestone 0d92de9b3f docs: add tool calling benchmark template for 1-bit models

Refs #101

2026-04-16 01:54:02 +00:00

1.9 KiB

Raw Blame History

Tool Calling Test Results — 1-Bit Models

Status: Pending execution
Issue: #101
Model: bonsai-1bit (to be tested)
Backend: Ollama

Test Suite

10 test cases covering:

#	Test	Type	Difficulty	Description
1	simple_file_read	file_read	easy	Read README.md with exact path
2	absolute_path_read	file_read	easy	Read /etc/hostname with absolute path
3	simple_terminal	terminal	easy	Run `echo hello world`
4	terminal_ls	terminal	medium	List files in directory
5	web_search	web_search	easy	Search for a query
6	read_then_analyze	multi_step	medium	Read file then analyze content
7	nested_params	schema_parsing	hard	Complex nested parameters
8	optional_params	schema_parsing	medium	Tool with optional parameters
9	sequential_calls	multi_step	hard	Multiple tool calls in sequence
10	no_tool_needed	file_read	easy	No tool needed for simple question

Hypothesis

1-bit quantization destroys fine-grained reasoning. Tool calling (precise JSON output) may be impossible. But worth testing — the field is moving fast.

Results

To be filled after running:

python3 benchmarks/test_tool_calling_1bit.py --model bonsai-1bit --report benchmarks/bonsai-tool-calling.md --results benchmarks/tool_calling_results.json

Failure Modes (Expected)

If tests fail, likely causes:

JSON formatting: Model cannot produce valid JSON tool calls
Parameter extraction: Model confuses or drops parameters
Schema adherence: Model ignores tool schema constraints
Consistency: Model produces different formats across runs

Alternative Edge Models

If 1-bit is not viable:

Qwen3.5 3B Q4 — Good tool calling, reasonable size
Phi-3 Mini — Strong reasoning, supports function calling
Llama 3.2 3B — Good balance of size and capability

1.9 KiB Raw Blame History