Compression Benchmark Deep Dive: Every Algorithm, Every Dataset, Every Number
A data-first companion to the compression benchmark — sweet-spot picks per algorithm, an algorithm × category heatmap, the full per-dataset table, and a sortable preset leaderboard built from 3,420 measured runs.
The first post gave the narrative read. This one is the spreadsheet view: every algorithm next to its sweet spot, the heatmap that maps codecs to data categories, the full table of all 36 datasets, and a preset leaderboard you can sort however you want.
If you came here for the practical picks, the Algorithm sweet spots card grid below is the answer. If you came to argue with the data, scroll to the per-dataset table and the leaderboard.
3,420 measured runs · 36 real datasets · 6 algorithms · 95 distinct codec presets · 27.3 GiB of source material covering codebases, single source files, documents, databases, binaries, images, audio, and video.
At a glance
The headline answer to "which codec should I use" is anti-climactic: it depends on the data and what you're optimizing for. The more interesting answer comes out of the data:
- Brotli wins 18 / 36 best-ratio matchups — far more than its reputation suggests, especially on small structured text and dense web-style assets.
- LZMA2 takes 10 / 36 best-ratio wins, mostly on heavy text-like archives where you can afford the time penalty.
- zstd and brotli tie at 11 / 36 for best balanced score, the metric that actually matters for systems that decompress as much as they compress.
- The single highest ratio in the entire benchmark is
brotli 11on the PDF sample at 42.86x — which tells you more about that PDF than about Brotli. - The single best balanced score is
zstd fast-3on a DOCX file at 0.994 — basically perfect on that workload because the DOCX was already a ZIP, so the fastest preset that doesn't hurt the ratio wins.
The Deep Dive
How to read the data
The five visualizations above answer five different questions. They're meant to be read together, but each one stands alone.
The sweet-spot cards
For each codec, the card shows four presets: the strongest one, the most balanced one, the fastest one, and the smallest level that still captures 95% of the codec's peak weighted ratio. That last one is the only honest answer to "what should I actually use" because it tells you when going higher stops being worth it.
A few patterns stand out:
- zstd's strongest preset is
22, but its 95% sweet spot is much earlier in the level ladder. Most of zstd's value lives in the low-to-mid levels. - Brotli's 95% sweet spot is high on the ladder. Brotli compresses more with extra time, but its top levels become very expensive on large directories.
- LZ4's strongest preset is barely stronger than its fastest one. That's the entire LZ4 thesis: speed first, ratio second.
- bzip2 has only
9sensible levels and its strongest one isn't dramatically stronger than its mid-levels — there's not much to tune.
The category heatmap
The heatmap is the picture I refer to most. Each cell is "this codec's best result on this category." Bright cells mean a codec absolutely loved that data shape. Dark cells mean it wasted CPU.
The shape of the heatmap is brutal:
- The top-left quadrant (codebases, source files, documents) is bright across most codecs.
- The bottom rows (audio and video) are nearly flat regardless of codec.
- The "binary" row is wildly mixed because the category itself is wildly mixed: random noise, VM disks, model weights, and package caches all live there.
Switch the metric to Compression throughput and a different story shows up: LZ4 dominates almost everything, which makes the speed-vs-ratio trade-off concrete instead of theoretical.
The efficiency frontier
Each curve traces a single algorithm across its full level ladder. The x-axis is total compression time on a log scale.
The bend in every curve is the same point I mentioned in the first post: the place where extra time stops buying meaningful ratio. The interesting observation is how early that bend is:
- For zstd, the curve is mostly flat after about level 3-7.
- For Brotli, the bend is later — usually around level 7-9 — but the time cost climbs steeply after that.
- For LZMA2, the curve climbs almost the whole way, which is why it wins so many ratio contests but takes so long.
Switch the y-axis to "Space saved %" and you can see the diminishing returns directly: going from 60% saved to 70% saved is much easier than going from 70% to 75%.
Throughput and wins-by-family
Two complementary views:
- Fastest preset throughput shows the absolute speed ceiling per family. LZ4 sits at the top; bzip2 sits at the bottom. Decompression is consistently faster than compression for every codec.
- Wins by family stacks "best ratio" wins and "best balanced" wins per algorithm. The shape is the headline: Brotli dominates raw ratio, zstd and Brotli tie on balanced score, and LZ4 quietly picks up balanced wins on data that nobody can compress well anyway.
The full per-dataset table
This is the data view I would copy out into a spreadsheet if I were planning real storage policy. Every row is one dataset; every cell is the actual measurement. Sort by size if you want to focus on the biggest workloads, sort by ratio to see which inputs are easy targets, sort by saved % to spot the surprises.
A few specific datasets are worth highlighting:
- The PDF document is the highest-ratio result in the benchmark by a wide margin. That number tells you nothing about other PDFs — it tells you that this PDF sample contained large, repetitive, compressible streams.
- The journal log saves over 76% at the top result. Logs and codebases are the easiest, most predictable wins in the entire dataset.
- The Qwen3.5 LLM model directory is a 9.3 GiB binary tree where bzip2 took the best ratio. The same dataset has
lz4 1as the best balanced pick because the absolute time cost of getting that ratio is enormous. - The random binary blob and the HEVC/MP4/MOV samples have ratios stuck at ~1.000x or even below. That's the benchmark saying "do not even try."
The preset leaderboard
Ninety-five presets, sortable. The defaults order by weighted ratio so you see who took the absolute crown. Click "Throughput" to find the speed kings. Click "Balanced" to find the actual everyday picks.
The most interesting filter is "Wins" — how many of the 36 datasets a single preset took outright as best ratio. Almost every preset is at zero or one. A handful — brotli 11, lzma2 9e, lzma2 8e — concentrate the wins. That's a useful piece of intuition: the absolute leaderboard is dominated by a few brutal-but-slow presets, and almost everything else is a trade-off.
Practical takeaways from the data
If I had to compress these tables and charts down to one practical answer per workload, this is it:
| Workload | Use this | Why |
|---|---|---|
| Day-to-day developer compression | zstd 3 | Right at the bend in zstd's efficiency curve. Strong ratio, very fast, ubiquitous. |
| Long-term backup of source code, logs, configs | zstd 19 or lzma2 7 | Top-quartile ratio without extreme time cost. LZMA2 if you genuinely never decompress. |
| Web text assets served behind a CDN | brotli 11 | Pre-compressed once, served forever. Decompression is not the bottleneck. |
| Real-time pipelines, fast local packaging | lz4 1 | Decompression speed is the entire point. |
| Already compressed media (mp4, hevc, mp3, aac, png) | Skip it | The data shows you'll lose CPU and gain bytes. |
| Mixed-shape archives | zstd 1–zstd 6 | Any of these will be within striking distance of optimal on every category that matters. |
Every aggregate number in this post is computed across all 36 datasets, weighted by source size. That means a single very large dataset (the LLM model, the VM image, the random binary blob) can dominate weighted averages. The per-dataset table is the right place to go when you want the per-input answer rather than the global one.
What's next
I want to extend the corpus in two directions:
- More categories — game saves, sqlite databases under load, container image layers, parquet files. The "binary" bucket is currently doing too much work.
- Memory and CPU footprint — right now the benchmark only measures time and size. Peak resident memory and core-utilization would change the picture for some codecs significantly, especially LZMA2.
If there's a workload you'd want me to add, point me at a representative file and I'll fold it into the next round.
The companion narrative post is here if you want the prose read: Compression on Real Files: 3,420 Benchmark Runs.