Technical Deep Dive2025-08-056 min read

Understanding 1.58-Bit Ternary Quantization in TinyLM

A plain-language look at ternary weights — why representing each weight as just -1, 0, or +1 makes tiny models smaller and faster on phone CPUs.

ternary quantization1.58-bitTinyLMmodel compressionedge AI

The Problem With Big Weights

A standard model stores each weight as a 16- or 32-bit floating point number. For a model with millions of weights, that adds up fast in both file size and the amount of math the CPU must do. To run on a phone in a browser, we need something far leaner.

What Ternary Means

Ternary quantization restricts every weight to one of three values: -1, 0, or +1. That is the "1.58-bit" idea — the information content of three states works out to about 1.58 bits per weight. Instead of storing a precise number, each weight just stores a direction or a zero.

Why It Is Faster

When weights are only -1, 0, or +1, multiplication mostly disappears. Multiplying an input by +1 is the input itself, by -1 is its negation, and by 0 is nothing. So a layer that normally needs many multiplications collapses into additions and subtractions, which CPUs do extremely well. On a phone with no GPU, that is a huge win.

Why It Is Smaller

Fewer bits per weight means a smaller file. meeny 2.0 packs 6.2 million parameters using ternary weights, and eeny 2.0 fits in about 1.76MB. Small files download quickly, fit comfortably in IndexedDB, and leave plenty of headroom in browser memory.

The Honest Tradeoff

Ternary quantization is lossy. Crushing a precise weight down to three values throws away information, and that can hurt accuracy if done naively. The trick is training the model to be ternary from the start, or carefully converting it, so the network learns to work within the constraint rather than being damaged by it. Done well, the quality cost is modest for the size and speed you gain.

Where It Shows Up in TinyLM

In TinyLM, ternary weights are paired with int8 SIMD math in the Rust-to-WASM engine. The combination is what lets a multi-million-parameter model feel instant in a browser. You can feel the responsiveness firsthand at https://ai.sprapp.com — the model answers without the lag you would expect from its parameter count.

When Ternary Is the Right Choice

Ternary is ideal when you are tightly constrained on size and compute and your task is well-defined. It is less appropriate when you need the last few points of accuracy on a hard, open-ended task. TinyLM uses ternary precisely because its mission — tiny, offline, on-device — lives in the constrained regime where the tradeoff pays off.

The Bigger Picture

Quantization is one of the main levers that makes on-device AI practical. By being aggressive but careful with 1.58-bit weights, TinyLM pushes capable models into places they could never fit before: a web page, an old phone, an offline clinic. The math behind it is simple, and that simplicity is exactly what makes it run anywhere.

Written bySPRAPP Research