Files
turboquant/wasm/README.md
Alexander Whitestone 02d1edafdc
All checks were successful
Smoke Test / smoke (pull_request) Successful in 13s
feat: WASM README (#104)
2026-04-16 01:27:13 +00:00

1.7 KiB

WASM Inference Module

Run quantized models directly in the browser via WebAssembly.

Why

  • Crisis detection works offline
  • No server round-trip
  • Privacy: messages never leave the browser

Architecture

Browser Page
  └─ Web Worker (inference-worker.js)
       └─ WASM Module (llama-turbo-wasm.wasm)
            └─ llama-turbo-wasm.c (simplified inference engine)

Web Worker keeps the UI thread responsive. WASM provides near-native speed.

Target Models

Model Size (Q2_K) Vocab Dim Layers Status
Falcon-H1-Tiny-90M ~45MB 32000 256 22 Target
Bonsai-1.7B ~400MB 32000 2048 24 Stretch

Build

source /path/to/emsdk/emsdk_env.sh
bash wasm/build.sh

Run

Serve the wasm/ directory:

cd wasm && python3 -m http.server 8080
# Open http://localhost:8080

API (from JavaScript)

const worker = new Worker('inference-worker.js');
worker.onmessage = (e) => console.log(e.data);

// Init
worker.postMessage({cmd: 'init'});

// Load model (ArrayBuffer)
const resp = await fetch('model.bin');
const buf = await resp.arrayBuffer();
worker.postMessage({cmd: 'load', data: buf}, [buf]);

// Generate
worker.postMessage({cmd: 'generate', prompt: 'Hello', maxTokens: 64, temperature: 0.7});

// Benchmark
worker.postMessage({cmd: 'benchmark', runs: 100});

Browser Memory Limits

Browser WASM Memory 90M OK? 1.7B OK?
Chrome 4GB Yes Yes
Firefox 2GB Yes Yes
Safari 1GB Yes Borderline

Viability Assessment

See benchmark results in the demo page after loading a model.

Closes #104