TurboQuant WASM: Compress Vector Indexes 6x and Search Directly in the Browser

TurboQuant WASM: Vector Compression at the Edge

Embedding indexes are memory hogs. One million 384-dimensional float32 vectors weigh 1.5 GB. On mobile devices, that is minutes of download time and a significant chunk of RAM. TurboQuant WASM shrinks them to ~240 MB a 6x compression and lets you search directly on the compressed data without ever decompressing it first.

Built on the Google Research paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” (ICLR 2026), this package wraps a Zig → WASM build with relaxed SIMD and optional WebGPU acceleration into a ~12 kB gzipped npm module that runs entirely in the browser or Node.js. No server. No training step. No dataset-dependent configuration.

What It Does

🗜️ 6x Compression, No Training

Unlike Product Quantization (PQ/OPQ) methods that require a training pass over your dataset, TurboQuant is online: initialize with dim and seed, then encode any vector immediately. Each compressed vector is self-contained, making it ideal for streaming data, LLM KV caches, and real-time indexing where you cannot pause to build a codebook.

Scenario	Raw Float32	TurboQuant	Savings
1M × 384-dim vectors	1.5 GB	~240 MB	~6x

⚡ Search Without Decompression

TurboQuant preserves inner products well enough for approximate search. You can run dot() on a single compressed vector, or dotBatch() across an entire index. The batch call automatically detects WebGPU and dispatches a compute shader that scores compressed vectors directly on the GPU. No decompression step, no float32 round-trip.

🔮 Two Substrates, One Algorithm

The library ships two implementations of the same TurboQuant math:

WASM SIMD the turboquant-wasm npm package. CPU path for vector search, image similarity, and 3D Gaussian Splatting compression.
WGSL Compute Shaders a GPU-native reimplementation for workloads that need real-time throughput. The live Prompt → Diagram demo and the in-browser Gemma 4 E2B LLM both run the algorithm on the GPU so the KV cache stays compressed during inference.

Quick Start

import { TurboQuant } from "turboquant-wasm";

const tq = await TurboQuant.init({ dim: 1024, seed: 42 });

// Compress a vector (~4.5 bits/dim)
const compressed = tq.encode(myFloat32Array); // Uint8Array

// Decode back when you need the original
const decoded = tq.decode(compressed); // Float32Array

// Fast dot product without decoding
const score = tq.dot(queryVector, compressed);

// Batch search across an index
const scores = await tq.dotBatch(
  queryVector,
  allCompressed, // concatenated Uint8Array
  bytesPerVector,
);

tq.destroy();

dotBatch() prefers WebGPU when available (Chrome/Edge 113+), and falls back transparently to WASM SIMD on devices without GPU support.

API

class TurboQuant {
  static async init(config: { dim: number; seed: number }): Promise<TurboQuant>;
  encode(vector: Float32Array): Uint8Array;
  decode(compressed: Uint8Array): Float32Array;
  dot(query: Float32Array, compressed: Uint8Array): number;
  dotBatch(
    query: Float32Array,
    compressedConcat: Uint8Array,
    bytesPerVector: number,
  ): Promise<Float32Array>;
  rotateQuery(query: Float32Array): Float32Array;
  destroy(): void;
}

encode / decode single-vector compression and reconstruction.
dot scalar dot product between a float32 query and one compressed vector.
dotBatch scores a query against many compressed vectors. Auto-detects WebGPU.
rotateQuery pre-rotates a query for faster repeated batch scoring.
destroy releases WASM memory.

Browser Requirements

The WASM binary uses relaxed SIMD instructions. Supported runtimes:

Runtime	Minimum Version
Chrome / Edge	114+
Firefox	128+
Safari	18+
Node.js	20+

WebGPU batch scoring requires Chrome/Edge 113+.

When to Use TurboQuant (and When Not To)

	TurboQuant	PQ / OPQ (FAISS, ScaNN)
Compression	~4.5 bits/dim (~6x)	~1–2 bits/dim (16–32x)
Query speed	Slower (float decode per pair)	Faster (integer codebook lookup)
Training	None encode any vector immediately	Required must train on dataset
Streaming data	Yes each vector is self-contained	Degrades if distribution shifts
Deployment	`npm install` + 3 lines of code	Dataset-dependent configuration
Size	~12 kB gzipped	Usually much larger

Use TurboQuant when vectors arrive continuously (LLM KV cache, real-time indexing), you cannot afford a training step, you need simple browser or edge deployment, or you want a dependency-free npm package.

Use PQ/OPQ when you have a static dataset, can train offline, and need the absolute fastest queries with maximum compression.

Live Demos

Vector Search & Image Similarity upload an image and find similar vectors in a TurboQuant-compressed index.
3D Gaussian Splatting Compression compress 3DGS scene data and render with preserved quality.
Prompt → Diagram (WGSL) a GPU-native demo that runs the same TurboQuant math in compute shaders.
Gemma 4 E2B in-browser LLM the KV cache is kept TurboQuant-compressed during inference, all client-side.

Quality Guarantees

Bit-identical output with the reference Zig implementation for the same input + seed.
MSE decreases as dimension increases (verified on unit vectors).
Dot product preservation mean absolute error < 1.0 for unit vectors at dim=128.
Golden-value tests confirm correctness across encode, decode, and scoring paths.

Installation

npm install turboquant-wasm

No additional build tools or native dependencies are required at install time. The WASM binary is embedded in the package.

Building from source (if you want to hack on the Zig implementation):

# Run Zig tests
zig test -target aarch64-macos src/turboquant.zig

# Full npm build (Zig → wasm-opt → base64 embed → bundle + tsc)
bun run build

# WASM only
bun run build:zig

Requires Zig 0.15.2 and Bun.

Why This Tool Rocks

Tiny footprint ~12 kB gzipped. Smaller than most image assets.
No training encode vectors as they arrive. Perfect for streaming and LLM caches.
Browser-native runs in Chrome, Firefox, Safari, and Node.js with no server round-trips.
GPU-accelerated WebGPU batch scoring when available; WASM SIMD fallback when not.
Near-optimal distortion backed by peer-reviewed Google Research with proven quality bounds.
Open source MIT licensed, with bit-identical verification against the reference Zig code.
Dual substrate the same algorithm in WASM for CPU and WGSL for GPU, so you can choose the right hardware path for your workload.

Crepi il lupo! 🐺