All checks were successful
Smoke Test / smoke (pull_request) Successful in 13s
78 lines
1.7 KiB
Markdown
78 lines
1.7 KiB
Markdown
# WASM Inference Module
|
|
|
|
Run quantized models directly in the browser via WebAssembly.
|
|
|
|
## Why
|
|
|
|
- Crisis detection works offline
|
|
- No server round-trip
|
|
- Privacy: messages never leave the browser
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Browser Page
|
|
└─ Web Worker (inference-worker.js)
|
|
└─ WASM Module (llama-turbo-wasm.wasm)
|
|
└─ llama-turbo-wasm.c (simplified inference engine)
|
|
```
|
|
|
|
Web Worker keeps the UI thread responsive. WASM provides near-native speed.
|
|
|
|
## Target Models
|
|
|
|
| Model | Size (Q2_K) | Vocab | Dim | Layers | Status |
|
|
|-------|------------|-------|-----|--------|--------|
|
|
| Falcon-H1-Tiny-90M | ~45MB | 32000 | 256 | 22 | Target |
|
|
| Bonsai-1.7B | ~400MB | 32000 | 2048 | 24 | Stretch |
|
|
|
|
## Build
|
|
|
|
```bash
|
|
source /path/to/emsdk/emsdk_env.sh
|
|
bash wasm/build.sh
|
|
```
|
|
|
|
## Run
|
|
|
|
Serve the `wasm/` directory:
|
|
|
|
```bash
|
|
cd wasm && python3 -m http.server 8080
|
|
# Open http://localhost:8080
|
|
```
|
|
|
|
## API (from JavaScript)
|
|
|
|
```js
|
|
const worker = new Worker('inference-worker.js');
|
|
worker.onmessage = (e) => console.log(e.data);
|
|
|
|
// Init
|
|
worker.postMessage({cmd: 'init'});
|
|
|
|
// Load model (ArrayBuffer)
|
|
const resp = await fetch('model.bin');
|
|
const buf = await resp.arrayBuffer();
|
|
worker.postMessage({cmd: 'load', data: buf}, [buf]);
|
|
|
|
// Generate
|
|
worker.postMessage({cmd: 'generate', prompt: 'Hello', maxTokens: 64, temperature: 0.7});
|
|
|
|
// Benchmark
|
|
worker.postMessage({cmd: 'benchmark', runs: 100});
|
|
```
|
|
|
|
## Browser Memory Limits
|
|
|
|
| Browser | WASM Memory | 90M OK? | 1.7B OK? |
|
|
|---------|------------|---------|----------|
|
|
| Chrome | 4GB | Yes | Yes |
|
|
| Firefox | 2GB | Yes | Yes |
|
|
| Safari | 1GB | Yes | Borderline |
|
|
|
|
## Viability Assessment
|
|
|
|
See benchmark results in the demo page after loading a model.
|
|
|
|
Closes #104 |