Files
turboquant/wasm/README.md
Alexander Whitestone 02d1edafdc
All checks were successful
Smoke Test / smoke (pull_request) Successful in 13s
feat: WASM README (#104)
2026-04-16 01:27:13 +00:00

78 lines
1.7 KiB
Markdown

# WASM Inference Module
Run quantized models directly in the browser via WebAssembly.
## Why
- Crisis detection works offline
- No server round-trip
- Privacy: messages never leave the browser
## Architecture
```
Browser Page
└─ Web Worker (inference-worker.js)
└─ WASM Module (llama-turbo-wasm.wasm)
└─ llama-turbo-wasm.c (simplified inference engine)
```
Web Worker keeps the UI thread responsive. WASM provides near-native speed.
## Target Models
| Model | Size (Q2_K) | Vocab | Dim | Layers | Status |
|-------|------------|-------|-----|--------|--------|
| Falcon-H1-Tiny-90M | ~45MB | 32000 | 256 | 22 | Target |
| Bonsai-1.7B | ~400MB | 32000 | 2048 | 24 | Stretch |
## Build
```bash
source /path/to/emsdk/emsdk_env.sh
bash wasm/build.sh
```
## Run
Serve the `wasm/` directory:
```bash
cd wasm && python3 -m http.server 8080
# Open http://localhost:8080
```
## API (from JavaScript)
```js
const worker = new Worker('inference-worker.js');
worker.onmessage = (e) => console.log(e.data);
// Init
worker.postMessage({cmd: 'init'});
// Load model (ArrayBuffer)
const resp = await fetch('model.bin');
const buf = await resp.arrayBuffer();
worker.postMessage({cmd: 'load', data: buf}, [buf]);
// Generate
worker.postMessage({cmd: 'generate', prompt: 'Hello', maxTokens: 64, temperature: 0.7});
// Benchmark
worker.postMessage({cmd: 'benchmark', runs: 100});
```
## Browser Memory Limits
| Browser | WASM Memory | 90M OK? | 1.7B OK? |
|---------|------------|---------|----------|
| Chrome | 4GB | Yes | Yes |
| Firefox | 2GB | Yes | Yes |
| Safari | 1GB | Yes | Borderline |
## Viability Assessment
See benchmark results in the demo page after loading a model.
Closes #104