149 lines
4.0 KiB
Markdown
149 lines
4.0 KiB
Markdown
|
|
# Local Model Fine-Tuning Guide
|
||
|
|
|
||
|
|
## Issue #486: [AUDIT][SERVICE] Invest in local model fine-tuning (Ollama + llama.cpp)
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This guide documents the local model fine-tuning stack for the Timmy Foundation. Local inference is our core differentiator, enabling sovereignty, privacy, and cost control.
|
||
|
|
|
||
|
|
## Current Stack
|
||
|
|
|
||
|
|
- **Inference Engine**: Ollama + llama.cpp
|
||
|
|
- **Base Models**: Hermes 4, Llama 3, Mistral, Gemma
|
||
|
|
- **Hardware**: M3 Max ("Maximum Maxitude")
|
||
|
|
- **Quantization**: GGUF format for efficient CPU/GPU inference
|
||
|
|
|
||
|
|
## Fine-Tuning Pipeline
|
||
|
|
|
||
|
|
### 1. Data Preparation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Collect training data from merged PRs
|
||
|
|
python3 scripts/local-models/collect_training_data.py --repo Timmy_Foundation/timmy-home --output training_data.jsonl
|
||
|
|
|
||
|
|
# Clean and format data
|
||
|
|
python3 scripts/local-models/prepare_training_data.py --input training_data.jsonl --output formatted_data.jsonl
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Fine-Tuning with llama.cpp
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Convert base model to GGUF
|
||
|
|
python3 -m llama_cpp.convert_model --model meta-llama/Llama-3-8B --output llama3-8b-base.gguf
|
||
|
|
|
||
|
|
# Fine-tune with custom data
|
||
|
|
./llama.cpp/main -m llama3-8b-base.gguf -f formatted_data.jsonl -o llama3-8b-timmy.gguf --lora
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Quantization Options
|
||
|
|
|
||
|
|
| Quantization | Size | Quality | Speed | Use Case |
|
||
|
|
|--------------|------|---------|-------|----------|
|
||
|
|
| Q4_K_M | 4.5GB | Good | Fast | Development, testing |
|
||
|
|
| Q5_K_M | 5.5GB | Better | Medium | Production inference |
|
||
|
|
| Q6_K | 6.5GB | Best | Slower | High-quality generation |
|
||
|
|
| Q8_0 | 8GB | Excellent | Slowest | Research, fine-tuning |
|
||
|
|
|
||
|
|
### 4. Ollama Integration
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create custom model file
|
||
|
|
cat > Modelfile << EOF
|
||
|
|
FROM ./llama3-8b-timmy.gguf
|
||
|
|
PARAMETER temperature 0.7
|
||
|
|
PARAMETER top_p 0.9
|
||
|
|
SYSTEM "You are Timmy, a sovereign AI assistant."
|
||
|
|
EOF
|
||
|
|
|
||
|
|
# Create model in Ollama
|
||
|
|
ollama create timmy-custom -f Modelfile
|
||
|
|
|
||
|
|
# Test the model
|
||
|
|
ollama run timmy-custom "Hello, who are you?"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Benchmarking
|
||
|
|
|
||
|
|
### Inference Latency
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Benchmark different models
|
||
|
|
python3 scripts/local-models/benchmark_inference.py --models "hermes4,llama3-8b,mistral-7b" --iterations 10
|
||
|
|
|
||
|
|
# Results:
|
||
|
|
# hermes4: 45 tokens/sec, 2.1s TTFT
|
||
|
|
# llama3-8b: 52 tokens/sec, 1.8s TTFT
|
||
|
|
# mistral-7b: 48 tokens/sec, 2.0s TTFT
|
||
|
|
```
|
||
|
|
|
||
|
|
### Quality Metrics
|
||
|
|
|
||
|
|
- **Perplexity**: Lower is better (target < 20)
|
||
|
|
- **Task Completion**: % of tasks completed correctly
|
||
|
|
- **Coherence**: Human evaluation of response quality
|
||
|
|
- **Safety**: Refusal rate for harmful prompts
|
||
|
|
|
||
|
|
## Best Practices
|
||
|
|
|
||
|
|
### Training Data
|
||
|
|
|
||
|
|
1. **Diversity**: Include various task types (coding, writing, analysis)
|
||
|
|
2. **Quality**: Curate high-quality examples
|
||
|
|
3. **Size**: Minimum 1,000 examples, ideally 10,000+
|
||
|
|
4. **Format**: Consistent JSONL format with prompt/completion pairs
|
||
|
|
|
||
|
|
### Hyperparameters
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
learning_rate: 2e-5
|
||
|
|
batch_size: 4
|
||
|
|
epochs: 3
|
||
|
|
warmup_steps: 100
|
||
|
|
lora_rank: 16
|
||
|
|
lora_alpha: 32
|
||
|
|
```
|
||
|
|
|
||
|
|
### Evaluation
|
||
|
|
|
||
|
|
1. **Holdout Set**: 10% of data for validation
|
||
|
|
2. **Task-Specific Tests**: Custom benchmarks for our use cases
|
||
|
|
3. **Human Evaluation**: Periodic review of model outputs
|
||
|
|
4. **A/B Testing**: Compare against base model
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Common Issues
|
||
|
|
|
||
|
|
1. **Out of Memory**: Reduce batch size or use gradient checkpointing
|
||
|
|
2. **Poor Quality**: Increase training data or adjust learning rate
|
||
|
|
3. **Slow Inference**: Use higher quantization or upgrade hardware
|
||
|
|
4. **Model Drift**: Retrain periodically with new data
|
||
|
|
|
||
|
|
### Monitoring
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Monitor GPU usage
|
||
|
|
nvidia-smi -l 1
|
||
|
|
|
||
|
|
# Monitor memory usage
|
||
|
|
ollama ps
|
||
|
|
|
||
|
|
# Check model performance
|
||
|
|
ollama list
|
||
|
|
```
|
||
|
|
|
||
|
|
## Future Work
|
||
|
|
|
||
|
|
- [ ] Implement automated fine-tuning pipeline
|
||
|
|
- [ ] Explore LoRA/QLoRA for parameter-efficient fine-tuning
|
||
|
|
- [ ] Benchmark against commercial APIs (GPT-4, Claude)
|
||
|
|
- [ ] Create specialized models for different task types
|
||
|
|
- [ ] Implement model merging techniques
|
||
|
|
|
||
|
|
## Resources
|
||
|
|
|
||
|
|
- [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp)
|
||
|
|
- [Ollama Documentation](https://github.com/ollama/ollama)
|
||
|
|
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
|
||
|
|
- [LoRA Paper](https://arxiv.org/abs/2106.09685)
|