Compare commits
2 Commits
burn/17-17
...
dispatch/1
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8affe79489 | ||
|
|
319f57780d |
3
.gitignore
vendored
Normal file
3
.gitignore
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
build/
|
||||
*.pyc
|
||||
__pycache__/
|
||||
36
CMakeLists.txt
Normal file
36
CMakeLists.txt
Normal file
@@ -0,0 +1,36 @@
|
||||
cmake_minimum_required(VERSION 3.16)
|
||||
|
||||
project(turboquant LANGUAGES CXX)
|
||||
|
||||
option(TURBOQUANT_BUILD_TESTS "Build standalone TurboQuant validation tests" ON)
|
||||
|
||||
add_library(turboquant STATIC
|
||||
llama-turbo.cpp
|
||||
)
|
||||
|
||||
target_include_directories(turboquant PUBLIC
|
||||
${CMAKE_CURRENT_SOURCE_DIR}
|
||||
)
|
||||
|
||||
target_compile_features(turboquant PUBLIC cxx_std_17)
|
||||
|
||||
if(MSVC)
|
||||
target_compile_options(turboquant PRIVATE /W4)
|
||||
else()
|
||||
target_compile_options(turboquant PRIVATE -Wall -Wextra -Wpedantic)
|
||||
endif()
|
||||
|
||||
if(TURBOQUANT_BUILD_TESTS)
|
||||
include(CTest)
|
||||
|
||||
add_executable(turboquant_roundtrip_test
|
||||
tests/roundtrip_test.cpp
|
||||
)
|
||||
target_link_libraries(turboquant_roundtrip_test PRIVATE turboquant)
|
||||
target_compile_features(turboquant_roundtrip_test PRIVATE cxx_std_17)
|
||||
|
||||
add_test(
|
||||
NAME turboquant_roundtrip
|
||||
COMMAND turboquant_roundtrip_test
|
||||
)
|
||||
endif()
|
||||
@@ -1,154 +0,0 @@
|
||||
# TurboQuant Initiative Review & Contributor Feedback
|
||||
|
||||
**Issue:** #17
|
||||
**Date:** 2026-04-14
|
||||
**Reviewer:** Timmy (burn worker)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The TurboQuant initiative is **on track** with strong Phase 1 results. The 73% KV memory savings with minimal overhead is production-quality. However, the repository activity concern is valid — we need to accelerate from documentation to integration.
|
||||
|
||||
## Review Points
|
||||
|
||||
### 1. Repository Activity (3 commits)
|
||||
|
||||
**Current State:**
|
||||
- 1 commit in main branch (long-session quality test)
|
||||
- Implementation files exist but are not yet integrated into llama.cpp
|
||||
|
||||
**Recommendation:**
|
||||
- Create a dedicated integration branch for llama.cpp
|
||||
- Commit incrementally: shaders first, then CPU reference, then benchmarks
|
||||
- Target: 10+ commits in next sprint to demonstrate momentum
|
||||
|
||||
### 2. Metal Shaders Integration
|
||||
|
||||
**Current State:**
|
||||
- `ggml-metal-turbo.metal` exists with production-quality kernels
|
||||
- Full flash attention for turbo2/3/4
|
||||
- WHT rotation kernels implemented
|
||||
- Lloyd-Max codebooks hardcoded
|
||||
|
||||
**Gap:** Shaders are standalone, not integrated into main llama.cpp fork.
|
||||
|
||||
**Action Items:**
|
||||
1. Create integration PR to `TheTom/llama-cpp-turboquant` feature branch
|
||||
2. Add shader registration in `ggml-metal.m`
|
||||
3. Update CMake build to include new files
|
||||
4. Add CI validation for shader compilation
|
||||
|
||||
### 3. QJL Residual Correction Accuracy
|
||||
|
||||
**Current State:**
|
||||
- QJL infrastructure exists in Metal shaders
|
||||
- `TURBO4_USE_4BIT=1` by default (QJL disabled)
|
||||
- 4-bit PolarQuant delivers 73% savings without QJL
|
||||
|
||||
**Assessment:** QJL is **not needed** for current compression targets. The 4-bit PolarQuant already meets quality requirements.
|
||||
|
||||
**Oversight Needed:**
|
||||
- If compression targets drop below 3 bits/channel, QJL becomes necessary
|
||||
- Current Metal QJL implementation is infrastructure-only (no active kernels)
|
||||
- Recommend: document QJL as "ready but disabled" and gate on future need
|
||||
|
||||
### 4. Phase 1→2 Transition
|
||||
|
||||
**Current State:**
|
||||
- Phase 1 complete (PolarQuant MVP)
|
||||
- Phase 2 partially complete (Ollama deferred, llama-server available)
|
||||
- 12/16 issues resolved
|
||||
|
||||
**Blockers:**
|
||||
- Ollama integration requires multi-day effort (34 custom patches)
|
||||
- qwen3.5:27b model not downloaded
|
||||
- PPL testing needs wikitext corpus
|
||||
|
||||
**Recommendation:**
|
||||
- Focus on llama-server deployment (immediate value)
|
||||
- Defer Ollama to Phase 4 / upstream watch
|
||||
- Download qwen3.5:27b and run production validation
|
||||
|
||||
---
|
||||
|
||||
## Contributor Feedback
|
||||
|
||||
### For @manus (Frequent Updates)
|
||||
|
||||
**Current:** PROJECT_STATUS.md is comprehensive but only updated at phase completion.
|
||||
|
||||
**Recommendation:**
|
||||
- Weekly progress updates in issue comments
|
||||
- Benchmark results as they happen (not batched)
|
||||
- Blocker escalation within 24 hours
|
||||
|
||||
### For @Timmy (Spec Alignment)
|
||||
|
||||
**Current:** Build spec v2.2 is well-aligned with implementation.
|
||||
|
||||
**Verification:**
|
||||
- ✅ WHT rotation matches spec
|
||||
- ✅ Lloyd-Max codebook matches spec
|
||||
- ✅ No per-vector normalization (spec requirement)
|
||||
- ⚠️ CPU turbo4 reference incompatible with Metal (documented)
|
||||
|
||||
**Recommendation:** Spec is stable. Focus on implementation velocity.
|
||||
|
||||
### For @Rockachopa (QJL Oversight)
|
||||
|
||||
**Current:** QJL is disabled by default. No accuracy risk at 4-bit compression.
|
||||
|
||||
**Oversight Framework:**
|
||||
1. Gate QJL enablement on quality metrics (PPL delta ≤ 0.5)
|
||||
2. Run A/B tests: turbo4 vs turbo4+QJL when QJL kernels are active
|
||||
3. Monitor for accuracy regression in long sessions (>32K context)
|
||||
|
||||
**Recommendation:** Current approach is correct. QJL oversight can be passive until needed.
|
||||
|
||||
---
|
||||
|
||||
## Action Items
|
||||
|
||||
### Immediate (This Week)
|
||||
1. [ ] Create llama.cpp integration branch
|
||||
2. [ ] Commit Metal shaders with registration
|
||||
3. [ ] Download qwen3.5:27b model
|
||||
4. [ ] Deploy llama-server for production testing
|
||||
|
||||
### Short Term (Next Sprint)
|
||||
5. [ ] Run PPL test with wikitext corpus
|
||||
6. [ ] Complete 10-prompt quality matrix
|
||||
7. [ ] Weekly progress updates in issue comments
|
||||
8. [ ] John quality sign-off
|
||||
|
||||
### Medium Term (Phase 3)
|
||||
9. [ ] Ollama integration assessment (if upstream doesn't update)
|
||||
10. [ ] QJL activation if compression needs exceed 4-bit
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Status | Mitigation |
|
||||
|------|--------|------------|
|
||||
| Low repo activity | ⚠️ Active | Accelerate commits, weekly updates |
|
||||
| Metal integration complexity | ✅ Low | Shaders exist, just need registration |
|
||||
| QJL accuracy | ✅ Low | Disabled by default, gated on metrics |
|
||||
| Ollama blockage | ⚠️ Active | Use llama-server instead |
|
||||
| PPL regression | ⏸️ Untested | Download corpus, test in prod |
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
**PROCEED WITH CONFIDENCE.** The technical foundation is solid. The 73% KV savings is production-ready. Focus on:
|
||||
1. Integration velocity (more commits)
|
||||
2. Production deployment (llama-server)
|
||||
3. Quality validation (PPL + prompt matrix)
|
||||
|
||||
The transition from spec to implementation is achievable in the next sprint.
|
||||
|
||||
---
|
||||
|
||||
*Review generated by burn worker for issue #17*
|
||||
104
tests/roundtrip_test.cpp
Normal file
104
tests/roundtrip_test.cpp
Normal file
@@ -0,0 +1,104 @@
|
||||
#include "llama-turbo.h"
|
||||
|
||||
#include <cmath>
|
||||
#include <cstdint>
|
||||
#include <iostream>
|
||||
#include <random>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace {
|
||||
|
||||
constexpr int kDim = 128;
|
||||
constexpr float kCosineThreshold = 0.99f;
|
||||
constexpr float kZeroTolerance = 1.0e-6f;
|
||||
|
||||
[[nodiscard]] bool all_finite(const std::vector<float> & values) {
|
||||
for (float value : values) {
|
||||
if (!std::isfinite(value)) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
[[nodiscard]] float max_abs(const std::vector<float> & values) {
|
||||
float best = 0.0f;
|
||||
for (float value : values) {
|
||||
best = std::max(best, std::fabs(value));
|
||||
}
|
||||
return best;
|
||||
}
|
||||
|
||||
[[nodiscard]] float cosine_similarity(const std::vector<float> & lhs, const std::vector<float> & rhs) {
|
||||
float dot = 0.0f;
|
||||
float lhs_norm = 0.0f;
|
||||
float rhs_norm = 0.0f;
|
||||
for (int i = 0; i < kDim; ++i) {
|
||||
dot += lhs[i] * rhs[i];
|
||||
lhs_norm += lhs[i] * lhs[i];
|
||||
rhs_norm += rhs[i] * rhs[i];
|
||||
}
|
||||
|
||||
const float denom = std::sqrt(lhs_norm) * std::sqrt(rhs_norm);
|
||||
return denom == 0.0f ? 1.0f : dot / denom;
|
||||
}
|
||||
|
||||
[[nodiscard]] std::vector<float> roundtrip(const std::vector<float> & input, float & norm_out) {
|
||||
std::vector<uint8_t> packed(kDim / 2, 0);
|
||||
norm_out = -1.0f;
|
||||
polar_quant_encode_turbo4(input.data(), packed.data(), &norm_out, kDim);
|
||||
|
||||
std::vector<float> decoded(kDim, 0.0f);
|
||||
polar_quant_decode_turbo4(packed.data(), decoded.data(), norm_out, kDim);
|
||||
return decoded;
|
||||
}
|
||||
|
||||
void require(bool condition, const std::string & message) {
|
||||
if (!condition) {
|
||||
throw std::runtime_error(message);
|
||||
}
|
||||
}
|
||||
|
||||
void test_zero_vector_roundtrip() {
|
||||
std::vector<float> zeros(kDim, 0.0f);
|
||||
float norm = -1.0f;
|
||||
const auto decoded = roundtrip(zeros, norm);
|
||||
|
||||
require(norm == 0.0f, "zero vector should encode with zero norm");
|
||||
require(all_finite(decoded), "zero vector decode produced non-finite values");
|
||||
require(max_abs(decoded) <= kZeroTolerance, "zero vector decode should remain near zero");
|
||||
}
|
||||
|
||||
void test_gaussian_roundtrip_quality() {
|
||||
std::mt19937 rng(12345);
|
||||
std::normal_distribution<float> dist(0.0f, 1.0f);
|
||||
|
||||
std::vector<float> input(kDim, 0.0f);
|
||||
for (float & value : input) {
|
||||
value = dist(rng);
|
||||
}
|
||||
|
||||
float norm = -1.0f;
|
||||
const auto decoded = roundtrip(input, norm);
|
||||
|
||||
require(norm > 0.0f, "random vector should encode with positive norm");
|
||||
require(all_finite(decoded), "random vector decode produced non-finite values");
|
||||
|
||||
const float cosine = cosine_similarity(input, decoded);
|
||||
require(cosine >= kCosineThreshold, "roundtrip cosine similarity below threshold");
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
int main() {
|
||||
try {
|
||||
test_zero_vector_roundtrip();
|
||||
test_gaussian_roundtrip_quality();
|
||||
std::cout << "PASS: turboquant standalone roundtrip tests\n";
|
||||
return 0;
|
||||
} catch (const std::exception & exc) {
|
||||
std::cerr << "FAIL: " << exc.what() << '\n';
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user