# Local Model Fine-Tuning Guide ## Issue #486: [AUDIT][SERVICE] Invest in local model fine-tuning (Ollama + llama.cpp) ## Overview This guide documents the local model fine-tuning stack for the Timmy Foundation. Local inference is our core differentiator, enabling sovereignty, privacy, and cost control. ## Current Stack - **Inference Engine**: Ollama + llama.cpp - **Base Models**: Hermes 4, Llama 3, Mistral, Gemma - **Hardware**: M3 Max ("Maximum Maxitude") - **Quantization**: GGUF format for efficient CPU/GPU inference ## Fine-Tuning Pipeline ### 1. Data Preparation ```bash # Collect training data from merged PRs python3 scripts/local-models/collect_training_data.py --repo Timmy_Foundation/timmy-home --output training_data.jsonl # Clean and format data python3 scripts/local-models/prepare_training_data.py --input training_data.jsonl --output formatted_data.jsonl ``` ### 2. Fine-Tuning with llama.cpp ```bash # Convert base model to GGUF python3 -m llama_cpp.convert_model --model meta-llama/Llama-3-8B --output llama3-8b-base.gguf # Fine-tune with custom data ./llama.cpp/main -m llama3-8b-base.gguf -f formatted_data.jsonl -o llama3-8b-timmy.gguf --lora ``` ### 3. Quantization Options | Quantization | Size | Quality | Speed | Use Case | |--------------|------|---------|-------|----------| | Q4_K_M | 4.5GB | Good | Fast | Development, testing | | Q5_K_M | 5.5GB | Better | Medium | Production inference | | Q6_K | 6.5GB | Best | Slower | High-quality generation | | Q8_0 | 8GB | Excellent | Slowest | Research, fine-tuning | ### 4. Ollama Integration ```bash # Create custom model file cat > Modelfile << EOF FROM ./llama3-8b-timmy.gguf PARAMETER temperature 0.7 PARAMETER top_p 0.9 SYSTEM "You are Timmy, a sovereign AI assistant." EOF # Create model in Ollama ollama create timmy-custom -f Modelfile # Test the model ollama run timmy-custom "Hello, who are you?" ``` ## Benchmarking ### Inference Latency ```bash # Benchmark different models python3 scripts/local-models/benchmark_inference.py --models "hermes4,llama3-8b,mistral-7b" --iterations 10 # Results: # hermes4: 45 tokens/sec, 2.1s TTFT # llama3-8b: 52 tokens/sec, 1.8s TTFT # mistral-7b: 48 tokens/sec, 2.0s TTFT ``` ### Quality Metrics - **Perplexity**: Lower is better (target < 20) - **Task Completion**: % of tasks completed correctly - **Coherence**: Human evaluation of response quality - **Safety**: Refusal rate for harmful prompts ## Best Practices ### Training Data 1. **Diversity**: Include various task types (coding, writing, analysis) 2. **Quality**: Curate high-quality examples 3. **Size**: Minimum 1,000 examples, ideally 10,000+ 4. **Format**: Consistent JSONL format with prompt/completion pairs ### Hyperparameters ```yaml learning_rate: 2e-5 batch_size: 4 epochs: 3 warmup_steps: 100 lora_rank: 16 lora_alpha: 32 ``` ### Evaluation 1. **Holdout Set**: 10% of data for validation 2. **Task-Specific Tests**: Custom benchmarks for our use cases 3. **Human Evaluation**: Periodic review of model outputs 4. **A/B Testing**: Compare against base model ## Troubleshooting ### Common Issues 1. **Out of Memory**: Reduce batch size or use gradient checkpointing 2. **Poor Quality**: Increase training data or adjust learning rate 3. **Slow Inference**: Use higher quantization or upgrade hardware 4. **Model Drift**: Retrain periodically with new data ### Monitoring ```bash # Monitor GPU usage nvidia-smi -l 1 # Monitor memory usage ollama ps # Check model performance ollama list ``` ## Future Work - [ ] Implement automated fine-tuning pipeline - [ ] Explore LoRA/QLoRA for parameter-efficient fine-tuning - [ ] Benchmark against commercial APIs (GPT-4, Claude) - [ ] Create specialized models for different task types - [ ] Implement model merging techniques ## Resources - [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp) - [Ollama Documentation](https://github.com/ollama/ollama) - [Hugging Face Transformers](https://huggingface.co/docs/transformers) - [LoRA Paper](https://arxiv.org/abs/2106.09685)