OpenZL Explained - Changing Data Compression Forever

ruffsl@programming.dev · edit-2 3 months ago

OpenZL Explained - Changing Data Compression Forever

PhilipTheBucket@piefed.social · 3 months ago

the current state of the art for generic compression by almost any metric

$ ls -lh optimizer*
-rw-r--r-- 1 billy users 76M Oct 19 15:51 optimizer.bin
-rw-r--r-- 1 billy users 56M Oct 19 15:51 optimizer.bin.bz2
-rw-r--r-- 1 billy users 60M Oct 19 15:51 optimizer.bin.zstd

I mean apparently not.

(Lempel-Ziv is not the best compression that’s currently known by a wide margin. It’s very fast and it’s nicely elegant but I would expect almost any modern “next gen compression” to be based on Huffman trees at the very core, or else specialized lossy compression. Maybe I am wrong, I’m not super up to speed on this stuff, but zstd is not state of the art, that much I definitely know.)

Of course this is not better at generic compression because that’s not what it’s for.

They specifically offered csv as an example of a thing it can handle, that’s why I chose that as one of the tests.

phiresky@lemmy.world · 3 months ago

Zstd by default uses a level that’s like 10x faster than the default of bz2. Also Bz2 is unusably slow in decompression if you have files >100MB.

PhilipTheBucket@piefed.social · 3 months ago

Yes, Lempel-Ziv is incredibly fast in compression. That’s because it’s a sort of elegant hack from the 1970s that more or less gets lucky in terms of how it can be made to work to compress files. It’s very nice. You said “by almost any metric,” though, not “by compression speed and literally nothing else.” There is a reason web pages default to using gzip instead of zstd for example.

Absolutely no idea what you’re on about with >100 MB. I’ve used bzip2 for all my hard disk backups for about 20 years now, and I think I broke the 100 MB barrier for local storage at some point during that time.

phiresky@lemmy.world · edit-2 3 months ago

My point is you are comparing the wrong thing, if you make zstd as slow as bz2 by increasing the level, you will get same or better compression ratio on most content. You’re just comparing who has defaults you like more. Zstd is on the Pareto front almost everywhere, you can tune it to be (almost) the fastest and you can tune it to be almost the highest compression ratio with a single number, all while having decompression speeds topping alternatives.

Additionally it has features nothing else has, like --adapt mode and dictionary compression.

PhilipTheBucket@piefed.social · 3 months ago

What are you basing this all on?

$ time (cat optimizer.bin | bzip2 > optimizer.bin.bz2)

real	0m4.352s
user	0m4.244s
sys	0m0.135s

$ time (cat optimizer.bin | zstd -19 > optimizer.bin.zst)

real	0m12.786s
user	0m28.457s
sys	0m0.237s

$ ls -lh optimizer.bin*
-rw-r--r-- 1 billy users 76M Oct 20 17:54 optimizer.bin
-rw-r--r-- 1 billy users 56M Oct 20 17:55 optimizer.bin.bz2
-rw-r--r-- 1 billy users 59M Oct 20 17:56 optimizer.bin.zst

$ time (cat stocks-part-2022-08.tar | bzip2 > stocks-part-2022-08.tar.bz2)

real	0m3.845s
user	0m3.788s
sys	0m0.103s

$ time (cat stocks-part-2022-08.tar | zstd -19 > stocks-part-2022-08.zst)

real	0m34.917s
user	1m12.811s
sys	0m0.211s

$ ls -lh stocks-part-2022-08.*
-rw-r--r-- 1 billy users 73M Oct 20 17:57 stocks-part-2022-08.tar
-rw-r--r-- 1 billy users 26M Oct 20 17:58 stocks-part-2022-08.tar.bz2
-rw-r--r-- 1 billy users 27M Oct 20 17:59 stocks-part-2022-08.zst

Are you looking at https://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html or something? I would expect Lempel-Ziv to perform phenomenally on genomic data because of how many widely separated repeated sequences the data will have… for that specific domain I could see zstd being a clear winner (super fast obviously and also happens to have the best compression, although check the not-starting-at-0 Y axis to put that in context).

I have literally never heard of someone claiming zstd was the best overall general purpose compression. Where are you getting this?

phiresky@lemmy.world · edit-2 3 months ago

I have literally never heard of someone claiming zstd was the best overall general purpose compression. Where are you getting this?

You must be living in a different bubble than me then, because I see zstd used everywhere, from my Linux package manager, my Linux kernel boot image, to my browser getting served zstd content-encoding by default, to large dataset compression (100GB+)… everything basically. On the other hand it’s been a long time since I’ve seen bz2 anywhere, I guess because of it’s terrible decompression speed - it decompresses slower than an average internet connection, making it the bottleneck and a bad idea for anything sent (multiple times) over the internet.

That might also be why I rarely see it included in compression benchmarks.

I stand corrected on the compression ratio vs compression speed, I was probably thinking of decompression speed as you said, which zstd optimizes heavily for and which I do think is more important for most use cases. Also, try -22 --ultra as well as --long=31 (for data > 128MB). I was making an assumption in my previous comment based on comparisons I do often but I guess I never use bz2.

Random sources showing zstd performance on different datasets

https://linuxreviews.org/Comparison_of_Compression_Algorithms

https://www.redpill-linpro.com/techblog/2024/12/18/compression-tool-test.html

https://insanity.industries/post/pareto-optimal-compression/

PhilipTheBucket@piefed.social · edit-2 3 months ago

You must be living in a different bubble than me then, because I see zstd used everywhere, from my Linux package manager, my Linux kernel boot image, to my browser getting served zstd content-encoding by default

Clearly a different bubble lol.

What distro are you using that uses zstd? Both kernel images and packages seem like a textbook case where compressed size is more important than speed of compression… which would mean not zstd. And of course I checked, it looks like NixOS uses bz2 for kernel images (which is obviously right to me) and gzip (!) for packages? Maybe? I’m not totally up to speed on it yet, but it sort of looks that way.

I mean I see the benchmarks, zstd looks nice. I checked this:

https://tools.paulcalvano.com/compression-tester/

… on lemmy.world, and it said that lemmy.world wasn’t offering zstd as an option, In its estimate, Brotli is way better than gzip, and sort of equivalent with zstd with zstd often being slightly faster in compression. I get the idea, it sounds cool, but it sort of sounds like some thing that Facebook is pushing that’s of dubious usefulness unless you really have a need for much faster compression (which, to be fair, is a lot of important use cases).

Yeah, I think of bz2 as sort of maximal compression at the cost of slower speed, gzip as the standard if you just want “compression” in general and don’t care that much, and then a little menagerie of higher performance options if you care enough to optimize. The only thing that struck me as weird about what you were saying was claiming it’s better in every metric (instead of it just being a good project that focuses on high speed and okay compression) and a global standard (instead of being something new-ish that is useful in some specific scenarios). And then when I tried both zstd and this other new Facebook thing and they were both worse (on compression) than bz2 which has been around for ages I became a lot more skeptical…

FizzyOrange@programming.dev · 3 months ago

He’s right, zstd is incredibly popular, quite widely used and also generally believed to be the best compression algorithm overall.

PhilipTheBucket@piefed.social · 3 months ago

Sure. I’m saying I tested it against bz2, looked up some rough details of how it works, and got a sense of what the strengths and weaknesses are, and you are wrong that it is simply “the best.” I actually do think it’s plausibly “the best” for applications where speed of compression is paramount and you still need decent compression, which is probably a lot of them. Having learned that, I’ve completed what I wanted to get out of this conversation.

phiresky@lemmy.world · 3 months ago

compressed size is more important than speed of compression

Yes, but decompression speed is even more important, no? My internet connection gets 40MByte/s and my ssd 500+MB/s, so if my decompressor runs at <40MB/s it’s slowing down my updates / boot time and it would be better to use a worse compression.

Arch - since 2021 for kernel images https://archlinux.org/news/moving-to-zstandard-images-by-default-on-mkinitcpio/ and since 2019 for packages https://lists.archlinux.org/pipermail/arch-dev-public/2019-December/029739.html

brotli is mainly good because it basically has a huge dictionary that includes common http headers and html structures so those don’t need to be part of the compressed file. I would assume without testing that zstd would more clearly win against brotli if you’d train a similar dictionary for it or just include a random WARC file into --patch-from.

Cloudflare started supporting zstd and is using it as the default since 2024 https://blog.cloudflare.com/new-standards/ citing compression speed as the main reason (since it does this on the fly). It’s been in chrome since 2021 https://chromestatus.com/feature/6186023867908096

The RFC mentions dictionaries but they are not currently used:

Actually this is already considered in RFC-8878 [0]. The RFC reserves zstd frame dictionary ids in the ranges: <= 32767 and >= (1 << 31) for a public IANA dictionary registry, but there are no such dictionaries published for public use yet. [0]: https://datatracker.ietf.org/doc/html/rfc8878#iana_dict

And there is a proposed standard for how zstd dictionaries could be served from a domain https://datatracker.ietf.org/doc/rfc9842/

it’s better in every metric

Let me revise that statement to - it’s better in every metric (compression speed, compressed size, feature set, most importantly decompression speed) compared to all other compressors I’m aware of, apart from xz and bz2 and potentially other non-lz compressors in the best compression ratio aspect. And I’m not sure whether it beats lzo/lz4 in the very fast levels (negative numbers on zstd).

that struck me as weird about what you were saying

What struck me as weird about what you were kind of calling it AI hype crap, when they are developing this for their own use and publishing it (not to make money). I’m kind of assuming this based on how much work they put into open sourcing the zstd format and how deeply it is now used in much FOSS which does not care at all for facebook. The format they are introducing uses explicitly structured data formats to guide a compressor - a structure which can be generated from a struct or class definition, and yes potentially much easier by an LLM, but I don’t think that is hooey. So I assumed you had no idea what you were talking about.

PhilipTheBucket@piefed.social · 3 months ago

Let me revise that statement to - it’s better in every metric (compression speed, compressed size, feature set, most importantly decompression speed) compared to all other compressors I’m aware of, apart from xz and bz2 and potentially other non-lz compressors in the best compression ratio aspect.

Your Cloudflare post literally says “a new compression algorithm that we have found compresses data 42% faster than Brotli while maintaining almost the same compression levels.” Yes, I get that in some circumstances where compression speed is important, this might be very useful. I don’t see the point in talking further in circles anymore, thank you for the information.

OpenZL Explained - Changing Data Compression Forever

OpenZL Explained - Changing Data Compression Forever

- YouTube