TurboQuant WASM: Compress Vector Indexes 6x and Search Directly in the Browser
TurboQuant WASM: Vector Compression at the Edge
Embedding indexes are memory hogs. One million 384-dimensional float32 vectors weigh 1.5 GB. On mobile devices, that is minutes of download time and a significant chunk of RAM. TurboQuant WASM shrinks them to ~240 MB a 6x compression and lets you search directly on the compressed data without ever decompressing it first.
Built on the Google Research paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” (ICLR 2026), this package wraps a Zig → WASM build with relaxed SIMD and optional WebGPU acceleration into a ~12 kB gzipped npm module that runs entirely in the browser or Node.js. No server. No training step. No dataset-dependent configuration.
What It Does
🗜️ 6x Compression, No Training
Unlike Product Quantization (PQ/OPQ) methods that require a training pass over your dataset, TurboQuant is online: initialize with dim and seed, then encode any vector immediately. Each compressed vector is self-contained, making it ideal for streaming data, LLM KV caches, and real-time indexing where you cannot pause to build a codebook.
| Scenario | Raw Float32 | TurboQuant | Savings |
|---|---|---|---|
| 1M × 384-dim vectors | 1.5 GB | ~240 MB | ~6x |
⚡ Search Without Decompression
TurboQuant preserves inner products well enough for approximate search. You can run dot() on a single compressed vector, or dotBatch() across an entire index. The batch call automatically detects WebGPU and dispatches a compute shader that scores compressed vectors directly on the GPU. No decompression step, no float32 round-trip.
🔮 Two Substrates, One Algorithm
The library ships two implementations of the same TurboQuant math:
- WASM SIMD the
turboquant-wasmnpm package. CPU path for vector search, image similarity, and 3D Gaussian Splatting compression. - WGSL Compute Shaders a GPU-native reimplementation for workloads that need real-time throughput. The live Prompt → Diagram demo and the in-browser Gemma 4 E2B LLM both run the algorithm on the GPU so the KV cache stays compressed during inference.
Quick Start
import { TurboQuant } from "turboquant-wasm";
const tq = await TurboQuant.init({ dim: 1024, seed: 42 });
// Compress a vector (~4.5 bits/dim)
const compressed = tq.encode(myFloat32Array); // Uint8Array
// Decode back when you need the original
const decoded = tq.decode(compressed); // Float32Array
// Fast dot product without decoding
const score = tq.dot(queryVector, compressed);
// Batch search across an index
const scores = await tq.dotBatch(
queryVector,
allCompressed, // concatenated Uint8Array
bytesPerVector,
);
tq.destroy();dotBatch() prefers WebGPU when available (Chrome/Edge 113+), and falls back transparently to WASM SIMD on devices without GPU support.
API
class TurboQuant {
static async init(config: { dim: number; seed: number }): Promise<TurboQuant>;
encode(vector: Float32Array): Uint8Array;
decode(compressed: Uint8Array): Float32Array;
dot(query: Float32Array, compressed: Uint8Array): number;
dotBatch(
query: Float32Array,
compressedConcat: Uint8Array,
bytesPerVector: number,
): Promise<Float32Array>;
rotateQuery(query: Float32Array): Float32Array;
destroy(): void;
}encode/decodesingle-vector compression and reconstruction.dotscalar dot product between a float32 query and one compressed vector.dotBatchscores a query against many compressed vectors. Auto-detects WebGPU.rotateQuerypre-rotates a query for faster repeated batch scoring.destroyreleases WASM memory.
Browser Requirements
The WASM binary uses relaxed SIMD instructions. Supported runtimes:
| Runtime | Minimum Version |
|---|---|
| Chrome / Edge | 114+ |
| Firefox | 128+ |
| Safari | 18+ |
| Node.js | 20+ |
WebGPU batch scoring requires Chrome/Edge 113+.
When to Use TurboQuant (and When Not To)
| TurboQuant | PQ / OPQ (FAISS, ScaNN) | |
|---|---|---|
| Compression | ~4.5 bits/dim (~6x) | ~1–2 bits/dim (16–32x) |
| Query speed | Slower (float decode per pair) | Faster (integer codebook lookup) |
| Training | None encode any vector immediately | Required must train on dataset |
| Streaming data | Yes each vector is self-contained | Degrades if distribution shifts |
| Deployment | npm install + 3 lines of code | Dataset-dependent configuration |
| Size | ~12 kB gzipped | Usually much larger |
Use TurboQuant when vectors arrive continuously (LLM KV cache, real-time indexing), you cannot afford a training step, you need simple browser or edge deployment, or you want a dependency-free npm package.
Use PQ/OPQ when you have a static dataset, can train offline, and need the absolute fastest queries with maximum compression.
Live Demos
- Vector Search & Image Similarity upload an image and find similar vectors in a TurboQuant-compressed index.
- 3D Gaussian Splatting Compression compress 3DGS scene data and render with preserved quality.
- Prompt → Diagram (WGSL) a GPU-native demo that runs the same TurboQuant math in compute shaders.
- Gemma 4 E2B in-browser LLM the KV cache is kept TurboQuant-compressed during inference, all client-side.
Quality Guarantees
- Bit-identical output with the reference Zig implementation for the same input + seed.
- MSE decreases as dimension increases (verified on unit vectors).
- Dot product preservation mean absolute error < 1.0 for unit vectors at dim=128.
- Golden-value tests confirm correctness across encode, decode, and scoring paths.
Installation
npm install turboquant-wasmNo additional build tools or native dependencies are required at install time. The WASM binary is embedded in the package.
Building from source (if you want to hack on the Zig implementation):
# Run Zig tests
zig test -target aarch64-macos src/turboquant.zig
# Full npm build (Zig → wasm-opt → base64 embed → bundle + tsc)
bun run build
# WASM only
bun run build:zigRequires Zig 0.15.2 and Bun.
Links
- 🔗 GitHub: teamchong/turboquant-wasm
- 🔗 Paper: arXiv:2504.19874 TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Google Research, ICLR 2026)
- 🔗 npm: turboquant-wasm
- 🔗 Original implementation: botirk38/turboquant
- 🔗 Live demos: teamchong.github.io/turboquant-wasm
Why This Tool Rocks
- Tiny footprint ~12 kB gzipped. Smaller than most image assets.
- No training encode vectors as they arrive. Perfect for streaming and LLM caches.
- Browser-native runs in Chrome, Firefox, Safari, and Node.js with no server round-trips.
- GPU-accelerated WebGPU batch scoring when available; WASM SIMD fallback when not.
- Near-optimal distortion backed by peer-reviewed Google Research with proven quality bounds.
- Open source MIT licensed, with bit-identical verification against the reference Zig code.
- Dual substrate the same algorithm in WASM for CPU and WGSL for GPU, so you can choose the right hardware path for your workload.
Crepi il lupo! 🐺