Zstd: Question of arch ARM optimizations

Created on 22 Mar 2017 · 15Comments · Source: facebook/zstd

Hi Cyan4973,
Do you have any plan for ARM arch optimizations?

question

Source

jinguang-dong

All 15 comments

This would be interesting, indeed,
but we would first need a bench with real hardware ARM systems.
Without them, we can only "guess" and test for correctness with emulators.

Cyan4973 on 22 Mar 2017

I'm trying to use zstd on an Android ARM device, and even at compressionLevel 0, it's significantly slower than zlib with compression level Z_BEST_SPEED.

m-sasha on 24 Mar 2017

We know that there is a large speed difference between 32 and 64 bits ARM cpus.
Zstandard works much better on 64-bits cpus.

That being said, being "significantly slower" than zlib is not the expectation, not even on 32-bits.

It's going to be tough to debug without access to hardware.

Cyan4973 on 24 Mar 2017

I'm running on a Nexus 9, which isn't hard to obtain. Compiling 32-bit code.

m-sasha on 24 Mar 2017

It's possible that Android's zlib is specifically optimized for ARM. I just tried LZ4, and it was also somewhat slower than zlib with Z_BEST_SPEED.

Note that I was/am referring to compression speed. I haven't tried decompression.

m-sasha on 25 Mar 2017

👍1

Hardware assist is a possibility.
In which case, it's not just "ARM optimized", but rather "Soc optimized" which is more specific.
It would mean the shipped system version of zlib isn't the reference one,
but a specifically built one triggering hardware assist.

A good way to compare would be to download the reference zlib implementation from Mark Adler,
at https://github.com/madler/zlib , compile and bench, to see if it delivers the same speed as your system one.

Cyan4973 on 26 Mar 2017

Here's the result:

System ZLIB (Z_BEST_SPEED)

03-26 14:11:26.276 11503-11980 I/GraphBuilder: Compressed 124285632 bytes
03-26 14:11:26.276 11503-11980 I/GraphBuilder: Compression took 8111099 microseconds
03-26 14:11:26.307 11503-11979 I/GraphBuilder: Compressed 124452864 bytes
03-26 14:11:26.307 11503-11979 I/GraphBuilder: Compression took 7868846 microseconds


Official ZLIB (Z_BEST_SPEED)

03-26 14:05:42.290 6812-7288 I/GraphBuilder: Compressed 124285632 bytes
03-26 14:05:42.290 6812-7288 I/GraphBuilder: Compression took 11004841 microseconds
03-26 14:05:42.478 6812-7287 I/GraphBuilder: Compressed 124452864 bytes
03-26 14:05:42.478 6812-7287 I/GraphBuilder: Compression took 11178578 microseconds


Zstd (compressionLevel=1)

03-26 14:47:52.850 4480-5054 I/GraphBuilder: Compressed 124285632 bytes
03-26 14:47:52.850 4480-5054 I/GraphBuilder: Compression took 29170605 microseconds
03-26 14:47:53.737 4480-5053 I/GraphBuilder: Compressed 124452864 bytes
03-26 14:47:53.737 4480-5053 I/GraphBuilder: Compression took 29943084 microseconds


LZ4 (acceleration=1):

03-26 14:42:01.062 1482-2389 I/GraphBuilder: Compressed 124452864 bytes
03-26 14:42:01.062 1482-2389 I/GraphBuilder: Compression took 16269208 microseconds
03-26 14:42:02.030 1482-2390 I/GraphBuilder: Compressed 124285632 bytes
03-26 14:42:02.030 1482-2390 I/GraphBuilder: Compression took 17026399 microseconds

m-sasha on 26 Mar 2017

Interesting !
These results give figures.
I guess we would need a bit more information on how they are generated,
notably : 32-bits mode (I guess), Nexus 9, but which compiler ? which flags ?

The result which surprises me most is LZ4 being slower than reference zlib.
That shouldn't be. I can't find a scenario which explains that.
ARM optimization could be a reason, but there are several other candidates.
Compilation version and flags are among them.

If you have time, try to play with definition LZ4_FORCE_MEMORY_ACCESS.
Set it to 0, 1 and 2, and see if it can produce a difference.

Cyan4973 on 26 Mar 2017

I'm not sure about the compiler flags. I'm compiling with "-DANDROID_TOOLCHAIN=gcc" and in CMakeLists.txt I have:

add_subdirectory(src/main/cpp/lz4/contrib/cmake_unofficial)
target_link_libraries(graph-builder ${log-lib} lz4_static)

So, I imagine it compiles with whatever flags cmake_unofficial/CMakeLists.txt configures.
Similarly, for Zstd, I have:

add_subdirectory(src/main/cpp/zstd/build/cmake)
target_link_libraries(graph-builder ${log-lib} libzstd_static)

The CMake log says this is the compiler:

arm-linux-androideabi-gcc (GCC) 4.9.x 20150123 (prerelease)

m-sasha on 26 Mar 2017

The CMake log says this is the compiler

Good, this is a very important information !

A first thing to ensure is that zlib and lz4 use the same compiler.
I'll take for granted that they do.

There is only so little I can do without locally reproducing the same build configuration as you.
But here is one :

Try to add #define LZ4_FORCE_MEMORY_ACCESS 1 just before this line :
https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L71

Then do the same with 2 and 0.
Benchmark each version.

If there is a substantial performance difference, then we'll have something to work on.

_Edit :_
This is likely to be the case, see https://godbolt.org/g/BHpzhB .

Cyan4973 on 26 Mar 2017

Good news:

System ZLIB
03-30 15:02:42.150 3576-4866 I/GraphBuilder: Written 124452864 bytes
03-30 15:02:42.150 3576-4866 I/GraphBuilder: Compression took 3734 milliseconds

Official ZLIB
03-30 15:16:45.838 7898-8434 I/GraphBuilder: Written 124452864 bytes
03-30 15:16:45.838 7898-8434 I/GraphBuilder: Compression took 4039 milliseconds

LZ4_FORCE_MEMORY_ACCESS 0
03-30 15:22:35.447 12344-12968 I/GraphBuilder: Written 124452864 bytes
03-30 15:22:35.447 12344-12968 I/GraphBuilder: Compression took 7191 milliseconds

LZ4_FORCE_MEMORY_ACCESS 1
03-30 15:26:15.474 13756-16195 I/GraphBuilder: Written 124452864 bytes
03-30 15:26:15.474 13756-16195 I/GraphBuilder: Compression took 1192 milliseconds

LZ4_FORCE_MEMORY_ACCESS 2
03-30 15:31:07.228 20016-20420 I/GraphBuilder: Written 124452864 bytes
03-30 15:31:07.228 20016-20420 I/GraphBuilder: Compression took 945 milliseconds

m-sasha on 30 Mar 2017

👍1

So, can we now get zstd that compresses the same as zlib, but faster? :-)

m-sasha on 30 Mar 2017

The solution for zstd is going to be similar :
https://github.com/facebook/zstd/blob/dev/lib/common/mem.h#L89
You can try to set MEM_FORCE_MEMORY_ACCESS to 0, 1 or 2 and measure the difference.

For a sustainable solution though, I'll need to find a way to automated the best setting.
And right now it's unclear.

I can probably force 1 when the compiler is gcc, because this method is always as good or better than 0.
Its downside is that it is not standard C, it's compiler-dependent, hence non-portable.
But it works fine with gcc, so it could become "gcc default".

More difficult, I see that value 2 provides some measurable speed benefits in your test, which was unexpected. Problem of 2 is that it is a dangerous setting. On wrong targets, it can lead to segfaults. This depends on hardware, compiler and even OS.
So 2 should only be applied on specific systems where it has been validated.
It requires to know which set of MACRO can be detected.
For an example of the kind of macro needed, see https://github.com/facebook/zstd/blob/dev/lib/common/mem.h#L90

I guess we'll need your help to catch the right set of macros for your system.

Cyan4973 on 30 Mar 2017

Unfortunately, that doesn't bring it to well below zlib speeds yet. At compressionLevel=1:

MEM_FORCE_MEMORY_ACCESS 0
03-31 00:18:18.887 13304-13805 I/GraphBuilder: Written 124452864 bytes
03-31 00:18:18.887 13304-13805 I/GraphBuilder: Compression took 14081 milliseconds

MEM_FORCE_MEMORY_ACCESS 1
03-31 00:13:15.988 10208-10818 I/GraphBuilder: Written 124452864 bytes
03-31 00:13:15.988 10208-10818 I/GraphBuilder: Compression took 4234 milliseconds

MEM_FORCE_MEMORY_ACCESS 2
03-31 00:15:19.329 11710-12197 I/GraphBuilder: Written 124452864 bytes
03-31 00:15:19.329 11710-12197 I/GraphBuilder: Compression took 3601 milliseconds

m-sasha on 30 Mar 2017

At least, it confirms the tendency, and the performance improvement over default memory access is quite large.

I'm less surprised by zstd speed, because its default configuration uses more memory than zlib, which is fine for PC, but a less good fit for ARM devices. Cache memories tend to be smaller there, and main memory latency can be dramatic.

zstd can make up for these situations by its wide configuration capabilities.
But it's not automatic (at least not yet), and default settings are still PC-centric.
It's also no longer a simple "performance optimization" game, but rather a different speed / compression trade-off space.

There is a tool in tests directory called paramgrill,
which is meant to try different parameters configuration.
It's used to create the 22 compression levels of zstd.
Its basic usage is simple ./paramgrill calibration_sample_filename,
but it can take quite some time to converge (typically several hours),
since it tries to select an optimal configuration for all 22 compression levels.
Alternatively, there are options to test a specific target speed, start from a manually decided configuration, and so on. But they are badly documented. This tool is not yet meant to be used by 3rd party, and it's pretty raw ...