Inside the Semantic JSON Codec: How smoltext Reads Structure
A technical look at why treating JSON as structure rather than bytes unlocks compression that general algorithms cannot reach.
Bytes Versus Structure
A general compressor treats input as a stream of bytes and looks for repeated byte sequences. JSON, though, has a known grammar. The smoltext semantic JSON codec parses that grammar and encodes structure and data along separate paths — which is where its edge comes from.
Separating Shape from Value
Consider two events with identical shape but different values. A byte-level compressor sees them as mostly different. The semantic codec sees the same skeleton — same keys, same nesting, same value types — and encodes that skeleton once, cheaply, leaving only the differing values to carry information.
Key Tokenization
JSON keys are the most repetitive part of structured data. The codec maps frequent keys to short codes from a trained codebook. A key like "created_at": that appears on every record collapses to a one- or two-byte token. Because the codebook is shared and server-side, this costs nothing per message.
Value Typing
The codec types values rather than storing them as ASCII:
- Integers become varints, so
1718553600drops from ten bytes to four. - Booleans become single bits.
- Known enumerated strings (status codes, finish reasons) become codebook references.
- Floats use a compact numeric encoding.
This typing is invisible to a byte-level compressor, which only sees digits and quotes.
Punctuation Elimination
Structural punctuation — braces, brackets, colons, commas, quotes — is pure overhead in serialized JSON. Because the codec reconstructs structure from the grammar, it does not need to store most punctuation literally. On small objects, punctuation can be a surprising fraction of the bytes.
Then the Entropy Layer
After structural encoding, dictionary deflate runs over what remains, referencing common substrings from the shared dictionary. The result of both layers is the compact output you retrieve from https://api.smoltext.sprapp.com/v1/compress.
The Limits
The codec assumes valid, parseable JSON. Send it malformed input and it falls back to treating the data as an opaque string, where savings are smaller. It also gains the most when records share schema; wildly heterogeneous JSON benefits less from the codebook.
Lossless Guarantee
Every transformation is reversible. Varint encoding, key tokenization, and punctuation reconstruction all decode to the exact original bytes. smoltext never approximates your data.
Why This Matters for Small Payloads
On large data, byte-level modeling eventually finds all the redundancy a semantic codec would. On small data it cannot — there is not enough volume. Structure awareness lets smoltext extract savings from a 60-byte object that a general compressor would simply expand. That is the whole reason the semantic codec exists.