It’ll have its uses, I’m sure, but is not like this training specialized compressors is free, or even always advantageous
The goal of OpenZL is to generate specialized compressors, optimized for user’s input. To this end, you are going to train a new profile and use it to compress the data. The approach makes more sense when compressing a large flow of similar data, because the cost of training is then amortized across all future compressions.
I strongly suspect that it’s a bunch of “machine learning” hooey. If your compression is capable at all, it should be able to spend a few bits on categorizing what the “format” type stuff he’s talking about is, and then do pretty much equally well as whatever specialized compressor. I won’t say it will never be useful for some kind of data that has patterns and regularity that are not immediately obvious unless you spell it out for the compressor (2d images where there are similarities between the same positions on consecutive lines widely separated in the bytestream for example), but my guess is that this is a bunch of hype and garbage.
Just out of curiosity, I downloaded it and did the quickstart to test my assumption. Results I got:
$ ls -lh reads*
-rw-r--r-- 1 billy users 27M Oct 19 15:14 reads.csv
-rw-r--r-- 1 billy users 4.2M Oct 19 15:15 reads.csv.bz2
-rw-r--r-- 1 billy users 6.7M Oct 19 15:16 reads.csv.zl
So yeah I think at least at first look, for general-purpose compression it’s trash. IDK. I also tried exactly what it sounds like their use case is, compressing PyTorch models, and it’s kinda cool maybe (and certainly faster than bzip2 for those models) but at best it seems like a one-trick pony.
$ ls -lh optimizer*
-rw-r--r-- 1 billy users 76M Oct 19 15:26 optimizer.bin
-rw-r--r-- 1 billy users 56M Oct 19 15:27 optimizer.bin.bz2
-rw-r--r-- 1 billy users 53M Oct 19 15:26 optimizer.bin.zl
I feel like maybe building Huffman trees based on general-purpose prediction of what comes next, and teaching that how to grasp what the next bits might turn out to be based on what has come before including traversing different formats or even just skipping backwards in the data by specified amounts, might be a better way than whatever this is doing. But doing way worse than bzip2 for simple textual data even when we give it the “format hint” that it’s looking for is a sign of problems to me.
This is from the same people that made zstd, the current state of the art for generic compression by almost any metric. They know what they are doing. Of course this is not better at generic compression because that’s not what it’s for.
the current state of the art for generic compression by almost any metric
$ ls -lh optimizer*
-rw-r--r-- 1 billy users 76M Oct 19 15:51 optimizer.bin
-rw-r--r-- 1 billy users 56M Oct 19 15:51 optimizer.bin.bz2
-rw-r--r-- 1 billy users 60M Oct 19 15:51 optimizer.bin.zstd
I mean apparently not.
(Lempel-Ziv is not the best compression that’s currently known by a wide margin. It’s very fast and it’s nicely elegant but I would expect almost any modern “next gen compression” to be based on Huffman trees at the very core, or else specialized lossy compression. Maybe I am wrong, I’m not super up to speed on this stuff, but zstd is not state of the art, that much I definitely know.)
Of course this is not better at generic compression because that’s not what it’s for.
They specifically offered csv as an example of a thing it can handle, that’s why I chose that as one of the tests.
Yes, Lempel-Ziv is incredibly fast in compression. That’s because it’s a sort of elegant hack from the 1970s that more or less gets lucky in terms of how it can be made to work to compress files. It’s very nice. You said “by almost any metric,” though, not “by compression speed and literally nothing else.” There is a reason web pages default to using gzip instead of zstd for example.
Absolutely no idea what you’re on about with >100 MB. I’ve used bzip2 for all my hard disk backups for about 20 years now, and I think I broke the 100 MB barrier for local storage at some point during that time.
My point is you are comparing the wrong thing, if you make zstd as slow as bz2 by increasing the level, you will get same or better compression ratio on most content. You’re just comparing who has defaults you like more. Zstd is on the Pareto front almost everywhere, you can tune it to be (almost) the fastest and you can tune it to be almost the highest compression ratio with a single number, all while having decompression speeds topping alternatives.
Additionally it has features nothing else has, like --adapt mode and dictionary compression.
$ time (cat optimizer.bin | bzip2 > optimizer.bin.bz2)
real0m4.352s
user0m4.244s
sys 0m0.135s
$ time (cat optimizer.bin | zstd -19> optimizer.bin.zst)
real0m12.786s
user0m28.457s
sys 0m0.237s
$ ls -lh optimizer.bin*-rw-r--r-- 1 billy users 76M Oct 20 17:54 optimizer.bin-rw-r--r-- 1 billy users 56M Oct 20 17:55 optimizer.bin.bz2-rw-r--r-- 1 billy users 59M Oct 20 17:56 optimizer.bin.zst
$ time (cat stocks-part-2022-08.tar | bzip2 > stocks-part-2022-08.tar.bz2)
real0m3.845s
user0m3.788s
sys 0m0.103s
$ time (cat stocks-part-2022-08.tar | zstd -19> stocks-part-2022-08.zst)
real0m34.917s
user1m12.811s
sys 0m0.211s
$ ls -lh stocks-part-2022-08.*-rw-r--r-- 1 billy users 73M Oct 20 17:57 stocks-part-2022-08.tar-rw-r--r-- 1 billy users 26M Oct 20 17:58 stocks-part-2022-08.tar.bz2-rw-r--r-- 1 billy users 27M Oct 20 17:59 stocks-part-2022-08.zst
Are you looking at https://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html or something? I would expect Lempel-Ziv to perform phenomenally on genomic data because of how many widely separated repeated sequences the data will have… for that specific domain I could see zstd being a clear winner (super fast obviously and also happens to have the best compression, although check the not-starting-at-0 Y axis to put that in context).
I have literally never heard of someone claiming zstd was the best overall general purpose compression. Where are you getting this?
I have literally never heard of someone claiming zstd was the best overall general purpose compression. Where are you getting this?
You must be living in a different bubble than me then, because I see zstd used everywhere, from my Linux package manager, my Linux kernel boot image, to my browser getting served zstd content-encoding by default, to large dataset compression (100GB+)… everything basically. On the other hand it’s been a long time since I’ve seen bz2 anywhere, I guess because of it’s terrible decompression speed - it decompresses slower than an average internet connection, making it the bottleneck and a bad idea for anything sent (multiple times) over the internet.
That might also be why I rarely see it included in compression benchmarks.
I stand corrected on the compression ratio vs compression speed, I was probably thinking of decompression speed as you said, which zstd optimizes heavily for and which I do think is more important for most use cases. Also, try -22--ultra as well as --long=31 (for data > 128MB). I was making an assumption in my previous comment based on comparisons I do often but I guess I never use bz2.
Random sources showing zstd performance on different datasets
The guy in the videos sounds really convinced. Is this a big thing or more of a “cautious optimism” thing as is with most scientific innovation?
It’ll have its uses, I’m sure, but is not like this training specialized compressors is free, or even always advantageous
I strongly suspect that it’s a bunch of “machine learning” hooey. If your compression is capable at all, it should be able to spend a few bits on categorizing what the “format” type stuff he’s talking about is, and then do pretty much equally well as whatever specialized compressor. I won’t say it will never be useful for some kind of data that has patterns and regularity that are not immediately obvious unless you spell it out for the compressor (2d images where there are similarities between the same positions on consecutive lines widely separated in the bytestream for example), but my guess is that this is a bunch of hype and garbage.
Just out of curiosity, I downloaded it and did the quickstart to test my assumption. Results I got:
$ ls -lh reads* -rw-r--r-- 1 billy users 27M Oct 19 15:14 reads.csv -rw-r--r-- 1 billy users 4.2M Oct 19 15:15 reads.csv.bz2 -rw-r--r-- 1 billy users 6.7M Oct 19 15:16 reads.csv.zl
So yeah I think at least at first look, for general-purpose compression it’s trash. IDK. I also tried exactly what it sounds like their use case is, compressing PyTorch models, and it’s kinda cool maybe (and certainly faster than bzip2 for those models) but at best it seems like a one-trick pony.
$ ls -lh optimizer* -rw-r--r-- 1 billy users 76M Oct 19 15:26 optimizer.bin -rw-r--r-- 1 billy users 56M Oct 19 15:27 optimizer.bin.bz2 -rw-r--r-- 1 billy users 53M Oct 19 15:26 optimizer.bin.zl
I feel like maybe building Huffman trees based on general-purpose prediction of what comes next, and teaching that how to grasp what the next bits might turn out to be based on what has come before including traversing different formats or even just skipping backwards in the data by specified amounts, might be a better way than whatever this is doing. But doing way worse than bzip2 for simple textual data even when we give it the “format hint” that it’s looking for is a sign of problems to me.
This is from the same people that made zstd, the current state of the art for generic compression by almost any metric. They know what they are doing. Of course this is not better at generic compression because that’s not what it’s for.
Edit: I would assume the video is not great and can recommend the official article: https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
$ ls -lh optimizer* -rw-r--r-- 1 billy users 76M Oct 19 15:51 optimizer.bin -rw-r--r-- 1 billy users 56M Oct 19 15:51 optimizer.bin.bz2 -rw-r--r-- 1 billy users 60M Oct 19 15:51 optimizer.bin.zstd
I mean apparently not.
(Lempel-Ziv is not the best compression that’s currently known by a wide margin. It’s very fast and it’s nicely elegant but I would expect almost any modern “next gen compression” to be based on Huffman trees at the very core, or else specialized lossy compression. Maybe I am wrong, I’m not super up to speed on this stuff, but zstd is not state of the art, that much I definitely know.)
They specifically offered csv as an example of a thing it can handle, that’s why I chose that as one of the tests.
Zstd by default uses a level that’s like 10x faster than the default of bz2. Also Bz2 is unusably slow in decompression if you have files >100MB.
Yes, Lempel-Ziv is incredibly fast in compression. That’s because it’s a sort of elegant hack from the 1970s that more or less gets lucky in terms of how it can be made to work to compress files. It’s very nice. You said “by almost any metric,” though, not “by compression speed and literally nothing else.” There is a reason web pages default to using gzip instead of zstd for example.
Absolutely no idea what you’re on about with >100 MB. I’ve used bzip2 for all my hard disk backups for about 20 years now, and I think I broke the 100 MB barrier for local storage at some point during that time.
My point is you are comparing the wrong thing, if you make zstd as slow as bz2 by increasing the level, you will get same or better compression ratio on most content. You’re just comparing who has defaults you like more. Zstd is on the Pareto front almost everywhere, you can tune it to be (almost) the fastest and you can tune it to be almost the highest compression ratio with a single number, all while having decompression speeds topping alternatives.
Additionally it has features nothing else has, like --adapt mode and dictionary compression.
What are you basing this all on?
$ time (cat optimizer.bin | bzip2 > optimizer.bin.bz2) real 0m4.352s user 0m4.244s sys 0m0.135s $ time (cat optimizer.bin | zstd -19 > optimizer.bin.zst) real 0m12.786s user 0m28.457s sys 0m0.237s $ ls -lh optimizer.bin* -rw-r--r-- 1 billy users 76M Oct 20 17:54 optimizer.bin -rw-r--r-- 1 billy users 56M Oct 20 17:55 optimizer.bin.bz2 -rw-r--r-- 1 billy users 59M Oct 20 17:56 optimizer.bin.zst $ time (cat stocks-part-2022-08.tar | bzip2 > stocks-part-2022-08.tar.bz2) real 0m3.845s user 0m3.788s sys 0m0.103s $ time (cat stocks-part-2022-08.tar | zstd -19 > stocks-part-2022-08.zst) real 0m34.917s user 1m12.811s sys 0m0.211s $ ls -lh stocks-part-2022-08.* -rw-r--r-- 1 billy users 73M Oct 20 17:57 stocks-part-2022-08.tar -rw-r--r-- 1 billy users 26M Oct 20 17:58 stocks-part-2022-08.tar.bz2 -rw-r--r-- 1 billy users 27M Oct 20 17:59 stocks-part-2022-08.zst
Are you looking at https://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html or something? I would expect Lempel-Ziv to perform phenomenally on genomic data because of how many widely separated repeated sequences the data will have… for that specific domain I could see zstd being a clear winner (super fast obviously and also happens to have the best compression, although check the not-starting-at-0 Y axis to put that in context).
I have literally never heard of someone claiming zstd was the best overall general purpose compression. Where are you getting this?
You must be living in a different bubble than me then, because I see zstd used everywhere, from my Linux package manager, my Linux kernel boot image, to my browser getting served zstd
content-encoding
by default, to large dataset compression (100GB+)… everything basically. On the other hand it’s been a long time since I’ve seen bz2 anywhere, I guess because of it’s terrible decompression speed - it decompresses slower than an average internet connection, making it the bottleneck and a bad idea for anything sent (multiple times) over the internet.That might also be why I rarely see it included in compression benchmarks.
I stand corrected on the compression ratio vs compression speed, I was probably thinking of decompression speed as you said, which zstd optimizes heavily for and which I do think is more important for most use cases. Also, try
-22 --ultra
as well as--long=31
(for data > 128MB). I was making an assumption in my previous comment based on comparisons I do often but I guess I never use bz2.Random sources showing zstd performance on different datasets
https://linuxreviews.org/Comparison_of_Compression_Algorithms
https://www.redpill-linpro.com/techblog/2024/12/18/compression-tool-test.html
https://insanity.industries/post/pareto-optimal-compression/