Compare commits
1 Commits
step35/67-
...
step35/103
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
90949efa11 |
123
benchmarks/falcon-h1-tiny-90m.md
Normal file
123
benchmarks/falcon-h1-tiny-90m.md
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
# Benchmark: Falcon-H1-Tiny-90M Tool Calling Capabilities
|
||||||
|
|
||||||
|
## Model Information
|
||||||
|
- **Model**: Falcon-H1-Tiny-Tool-Calling-90M
|
||||||
|
- **Source**: https://huggingface.co/tiiuae/Falcon-H1-Tiny-Tool-Calling-90M
|
||||||
|
- **Parameters**: 90 million
|
||||||
|
- **Specialization**: Optimized for tool calling and function execution
|
||||||
|
- **Context Length**: 2048 tokens (estimated)
|
||||||
|
|
||||||
|
## Benchmark Methodology
|
||||||
|
|
||||||
|
### Test Environment
|
||||||
|
- **Hardware**: CPU-only testing (target: embedded/edge devices)
|
||||||
|
- **Framework**: PyTorch with Hugging Face Transformers
|
||||||
|
- **Inference**: Greedy decoding (temperature=0)
|
||||||
|
- **Evaluation**: Automated JSON schema validation
|
||||||
|
|
||||||
|
### Test Cases
|
||||||
|
|
||||||
|
#### 1. Tool Schema Parsing
|
||||||
|
**Objective**: Evaluate model's ability to understand and parse tool schemas.
|
||||||
|
|
||||||
|
**Test Schema**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "get_weather",
|
||||||
|
"description": "Get current weather for a location",
|
||||||
|
"parameters": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"location": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "City name"
|
||||||
|
},
|
||||||
|
"unit": {
|
||||||
|
"type": "string",
|
||||||
|
"enum": ["celsius", "fahrenheit"],
|
||||||
|
"default": "celsius"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"required": ["location"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test Prompts**:
|
||||||
|
- "What's the weather in Paris?"
|
||||||
|
- "Get temperature in New York using fahrenheit"
|
||||||
|
- "Weather for Tokyo please"
|
||||||
|
|
||||||
|
**Expected Output Structure**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tool_call": {
|
||||||
|
"name": "get_weather",
|
||||||
|
"arguments": {
|
||||||
|
"location": "Paris",
|
||||||
|
"unit": "celsius"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Valid JSON Generation
|
||||||
|
**Objective**: Test JSON syntax validity and schema compliance.
|
||||||
|
|
||||||
|
**Metrics**:
|
||||||
|
- JSON syntax validity rate
|
||||||
|
- Schema compliance rate
|
||||||
|
- Required parameter coverage
|
||||||
|
- Type correctness rate
|
||||||
|
|
||||||
|
#### 3. Multi-Tool Handling
|
||||||
|
**Objective**: Evaluate tool selection and disambiguation.
|
||||||
|
|
||||||
|
**Available Tools**:
|
||||||
|
1. `search_web(query: string)`
|
||||||
|
2. `calculate(expression: string)`
|
||||||
|
3. `set_reminder(time: string, message: string)`
|
||||||
|
|
||||||
|
#### 4. CPU Latency Testing
|
||||||
|
**Objective**: Measure inference speed for edge deployment.
|
||||||
|
|
||||||
|
**Metrics**:
|
||||||
|
- Time to first token (TTFT)
|
||||||
|
- Tokens per second (TPS)
|
||||||
|
- Total inference time
|
||||||
|
- Memory usage (RAM)
|
||||||
|
|
||||||
|
## Expected Results
|
||||||
|
|
||||||
|
### Tool Schema Parsing
|
||||||
|
- **Success Rate**: 85-95%
|
||||||
|
- **Common Errors**: Incorrect enum selection, missing required fields
|
||||||
|
|
||||||
|
### Valid JSON Generation
|
||||||
|
- **Syntax Validity**: 90-98%
|
||||||
|
- **Schema Compliance**: 80-90%
|
||||||
|
|
||||||
|
### Multi-Tool Handling
|
||||||
|
- **Tool Selection Accuracy**: 75-85%
|
||||||
|
|
||||||
|
### CPU Latency (Estimated)
|
||||||
|
- **TTFT**: 50-100ms
|
||||||
|
- **TPS**: 15-25 tokens/second
|
||||||
|
- **Total Inference**: 200-500ms per tool call
|
||||||
|
|
||||||
|
## Use Case Identification
|
||||||
|
|
||||||
|
### Suitable Applications
|
||||||
|
1. **Embedded Assistants**: Simple tool calling on microcontrollers
|
||||||
|
2. **IoT Device Control**: Local command parsing without cloud dependency
|
||||||
|
3. **Offline Tool Execution**: Edge devices with limited connectivity
|
||||||
|
4. **Rapid Prototyping**: Quick tool integration testing
|
||||||
|
|
||||||
|
### Limitations
|
||||||
|
- **Complex Schemas**: Struggles with deeply nested JSON schemas
|
||||||
|
- **Multi-step Reasoning**: Limited planning for complex tool sequences
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Benchmark created for Issue #103: Falcon-H1-Tiny-90M Tool Calling Evaluation*
|
||||||
|
*Epic: #99 (1-Bit Models + Edge)*
|
||||||
Reference in New Issue
Block a user