Zstd: Arch optimizations?

Created on 6 Sep 2016 · 13Comments · Source: facebook/zstd

I'm running some tests on Android and I'm seeing decompression speeds slightly slower than zlib. I do realize that most likely zlib has been optimized to take advantage of the CPU architecture running on the device I'm testing.

So I took a quick look around zstd and didn't see any arch optimizations. Any plans for these optimizations?

question

Source

jonameson

All 13 comments

I'm curious about your exact numbers & CPU you're using @jonameson

tomByrer on 6 Sep 2016

Hey @tomByrer, so I'm testing now on Nexus 5 device which has a snapdragon 800 CPU.

At the moment my tests are fairly simple with no low level profiler. Just a simple timestamp type of profiling. My use for zstd is to compress and decompress raw premultiplied pixel data. In my tests I'm using 1280x720 image.

Since my post I been doing various types of tests and was able to find something interesting. Essentially images containing complex pixel data zlib was noticeably faster, while simpler pixel data made zstd faster.

Decoding simple 1280x720 image with transparent pixels:

~32ms [zstd]
~40ms [zlib]
~70ms [png decoder]

Decoding complex 1280x720 image with no transparent pixels:

~87ms [zstd]
~80ms [zlib]
~105ms [png decoder]

jonameson on 6 Sep 2016

We are currently lacking a proper platform for ARM benchmark, but that's something we'll work on very soon.

From 3rd party supplied figures, it seems there is a large speed difference between 32 bits vs 64 bits ARM.

See for example #190 .

Cyan4973 on 7 Sep 2016

@Cyan4973, interesting... I did some more tests on a Samsung S6 with arm64 build and decompressing was almost double speed of zlib on most tests. So yeah that confirms the #190 post.

For sure it would be great if some optimizations where made to 32bit builds, however, I understand everything is currently moving to 64bit and the speed differences in my case aren't completely bad.

I'm not too familiar with the zstd algorithms used but there might be some extra horse power that could be added to zstd with some NEON optimizations for ARM.

jonameson on 7 Sep 2016

As a hint for anyone who wants to run ARM benchmarks using hardware they may already own, in my experience it is possible to use any off-the-shelf Android phone or tablet. The general process is like this:

Build a binary using arm-linux-androideabi-gcc from the Android NDK. The --sysroot argument is required; point it to a platform directory in the NDK like /opt/android-ndk/platforms/android-12/arch-arm. Also useful can be something like -march=armv7-a -mfpu=neon -mfloat-abi=softfp, if you want to target a more recent CPU.
Connect Android device to your computer using a USB cable
Enable USB debugging on Android device
Use adb push to copy the binary and sample data to /data/local/tmp on the Android device. I found this works without root permission, but if you have root permission, then other locations will work too.
Use adb shell to get a shell on Android and run the program.

If your device uses 64-bit ARM then you would need to use aarch64-linux-android-4.9 and target at least android-21.

Of course, there are many other options too. I'm just providing this for anyone who may find it useful.

ebiggers on 8 Sep 2016

👍1

Regarding arch optimisations, zlib has "recently" experienced improvements that make use of the SSE2.1 instruction set for newer Intel CPUs (http://www.htslib.org/benchmarks/zlib.html). This significantly improved the speed of zlib, over 20 years since it was first released.

When comparing these newer versions of zlib to Zstd, i feel it is somewhat of an unfair comparison if zlib has been optimized for my CPU and Zstd has not. I'm aware of Zstd's history and I know it's not a new algorithm by any means, it's been long in development; but i'd still be interested to know if the authors think there might be arch optimizations over the coming years that could increase Zstd's speed beyond what it currently is..? Or is it as fast as it's likely ever going to be?

For what it's worth, I tested Zstd on an Intel E5-2670 and it was much faster than zlib (unlike the OPs question), however i'm still curious how close to the max this is likely to be.

Thank you very much :)

JohnLonginotto on 30 Jan 2017

@JohnLonginotto

First, it's important to note that the official version of zlib still isn't very optimized, and is missing both arch-specific and non-arch-specific optimizations. To get those optimizations you have to go to forks like Intel's or Cloudflare's, or to other implementations entirely, like libdeflate (https://github.com/ebiggers/libdeflate).

Zstandard, on the other hand, seems to have been quite well optimized since the beginning and generally includes best practices for compression and decompression algorithms on modern CPUs. Naturally it is still possible to make it faster, but this is more likely to involve introducing hand-written assembly code, and such changes are comparatively more likely to result in small performance improvements rather than large ones.

If anything I think the bias usually goes the other way: people tend to underestimate the performance of zlib (or gzip, or DEFLATE) because they are not using a good implementation of it.

ebiggers on 30 Jan 2017

Yeah, these new implementations are much faster. It's quite odd that their existence isn't a bigger deal since so much relies on gzip. Unfortunately most unpacking binaries (gzip, zcat, pigz, etc) do not dynamically link zlib, and so you experience very little benefit by installing it on a system without really going through and recompiling things to use the new zlib. But it is much faster - although Zstd is still ~x2 faster than that!

EDIT: although i hadn't tested your libdeflate. I'll try that now - love that you have a proper gzip-like program in there! :)

JohnLonginotto on 30 Jan 2017

Yes it seems that GNU gzip has its own fork of the zlib code and doesn't link to zlib. That decision seems to have been made a long time ago, but I don't know why.

Since I don't have a zlib-compatible API in libdeflate and there is no real streaming support yet, for now it's still somewhat of a fringe project, usable by only some applications. It is however the fastest implementation of gzip, zlib, and DEFLATE I know of, at least on x86_64.

I think that to improve things for real there will need to be a much better maintained zlib fork which people agree on, maybe even getting the maintainer of the official version involved in including improvements in the official version. And GNU gzip should be updated to link to zlib, of course.

ebiggers on 30 Jan 2017

👍1

Ah, yeah, i hit a "file too large" error. I was decompressing something around 5Gb. But it's a good project - would be particularly useful in bioinformatics (since all our data is gzip'd) if it proves to be faster than the CloudFare solution :) The author of samtools would be very interested i bet

JohnLonginotto on 30 Jan 2017

I don't think we can do more on this thread.
Let's open a dedicated topic when such a question arise for some specific target.

Cyan4973 on 27 Mar 2018

@Cyan4973
Finally, does ZSTD support some Aarch optimizations such as to use NEON on ARMv8? just like @jonameson mentioned?

haileeshen on 11 Jul 2019

Not directly.

Some portions of code are designed to "auto-vectorize", which works well with clang, but they are still expressed using standard "scalar" code, and it's up to the compiler to use vector instructions if it sees fit.

We know a few places where that happens, but none of them make a big difference.

To make more use of vector instruction, some algorithms should be changed to fit the capabilities of a selected vector instruction set. Unfortunately, that's a very deep change, and there is no guarantee of success.

Cyan4973 on 11 Jul 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings