All checks were successful
Smoke Test / smoke (pull_request) Successful in 13s
1.7 KiB
1.7 KiB
WASM Inference Module
Run quantized models directly in the browser via WebAssembly.
Why
- Crisis detection works offline
- No server round-trip
- Privacy: messages never leave the browser
Architecture
Browser Page
└─ Web Worker (inference-worker.js)
└─ WASM Module (llama-turbo-wasm.wasm)
└─ llama-turbo-wasm.c (simplified inference engine)
Web Worker keeps the UI thread responsive. WASM provides near-native speed.
Target Models
| Model | Size (Q2_K) | Vocab | Dim | Layers | Status |
|---|---|---|---|---|---|
| Falcon-H1-Tiny-90M | ~45MB | 32000 | 256 | 22 | Target |
| Bonsai-1.7B | ~400MB | 32000 | 2048 | 24 | Stretch |
Build
source /path/to/emsdk/emsdk_env.sh
bash wasm/build.sh
Run
Serve the wasm/ directory:
cd wasm && python3 -m http.server 8080
# Open http://localhost:8080
API (from JavaScript)
const worker = new Worker('inference-worker.js');
worker.onmessage = (e) => console.log(e.data);
// Init
worker.postMessage({cmd: 'init'});
// Load model (ArrayBuffer)
const resp = await fetch('model.bin');
const buf = await resp.arrayBuffer();
worker.postMessage({cmd: 'load', data: buf}, [buf]);
// Generate
worker.postMessage({cmd: 'generate', prompt: 'Hello', maxTokens: 64, temperature: 0.7});
// Benchmark
worker.postMessage({cmd: 'benchmark', runs: 100});
Browser Memory Limits
| Browser | WASM Memory | 90M OK? | 1.7B OK? |
|---|---|---|---|
| Chrome | 4GB | Yes | Yes |
| Firefox | 2GB | Yes | Yes |
| Safari | 1GB | Yes | Borderline |
Viability Assessment
See benchmark results in the demo page after loading a model.
Closes #104