How to Benchmark Small-String Compressors Honestly
Naive benchmarks lie about small-payload compression. Here is how to measure ratios that reflect what you will actually pay for.
Benchmarks Are Easy to Get Wrong
Most compression benchmarks use large corpora — entire books, big log files, the Silesia corpus. Those tell you nothing about how a compressor behaves on your 80-byte event. To evaluate a small-string compressor like smoltext, you need a benchmark that mirrors small-payload reality.
Mistake 1: Concatenating Records
A common error is to concatenate thousands of records into one big blob and compress that. This lets a general compressor amortize its overhead across all records and find cross-record redundancy — which it never gets to do when records are stored and read individually. The blob ratio massively overstates real-world savings.
The fix: compress records individually, the way they are actually stored, then aggregate the sizes.
Mistake 2: Ignoring Framing Overhead
When measuring gzip on small records, count the full framed output including headers and trailers. Skipping the framing makes general compressors look better than they perform in practice. smoltext's per-message overhead is near zero because its tables are server-side; a fair benchmark must include the framing the alternatives actually carry.
Mistake 3: Unrepresentative Data
Synthetic data with artificial repetition inflates every compressor's ratio. Use a real sample of your production records — actual events, actual log lines, actual KV values — so the benchmark reflects your real entropy and schema.
A Fair Methodology
- Collect a few thousand real records.
- Compress each individually with each candidate (smoltext, gzip, zstd, raw).
- For each candidate, sum the compressed sizes.
- Compute aggregate ratio = total compressed / total original.
- Report the aggregate, not a cherry-picked best case.
For smoltext, run step 2 against https://api.smoltext.sprapp.com/v1/compress, which returns exact sizes per record.
Report the Distribution
A single average hides important behavior. Some records compress wonderfully; high-entropy ones barely move. Report the distribution — median, worst case, and the fraction of records that compress below some threshold — so you understand the spread, not just the mean.
Include the Crossover
A genuinely useful benchmark shows where smoltext wins and where it loses. Include some larger payloads so the crossover point is visible: small structured records favor smoltext, large ones favor zstd. A benchmark that hides the crossover is marketing, not measurement.
Account for Latency
If latency matters, measure end-to-end time including the round trip, and measure it again with batching. The compression itself is sub-millisecond; the network is the variable.
The Payoff
An honest benchmark gives you a defensible number to bring to a cost decision. It tells you the real ratio on your real data, where the boundaries are, and whether short-string compression is worth integrating for your specific workload.