Describe the bug
Parameter -T0 is detecting physical cores, but I think it should be detecting logical cores instead, to take advantage of hyperthreading.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
zstd should use all logical cores to speed up compression.
Screenshots and charts
As you can see from data below, parameter -T0 is the same as -T2, which is number of physical cores, which is expectable from source code. Only on default compression level 3 is using only number of physical cores faster than number of logical cores, but on any other compression level situation change. And with higher compression level the speed up is more visible.
Compression level 2:
$ time zstd -2 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar : 32.88% (211957760 => 69684920 bytes, silesia.tar.zst)
real 0m0.761s
user 0m1.444s
sys 0m0.134s
$ time zstd -2 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar : 32.88% (211957760 => 69684920 bytes, silesia.tar.zst)
real 0m0.754s
user 0m1.480s
sys 0m0.113s
$ time zstd -2 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar : 32.88% (211957760 => 69684920 bytes, silesia.tar.zst)
real 0m0.732s
user 0m2.393s
sys 0m0.148s
Compression level 3:
$ time zstd -3 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar : 31.47% (211957760 => 66713509 bytes, silesia.tar.zst)
real 0m1.273s
user 0m2.437s
sys 0m0.110s
$ time zstd -3 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar : 31.47% (211957760 => 66713509 bytes, silesia.tar.zst)
real 0m1.282s
user 0m2.423s
sys 0m0.159s
$ time zstd -3 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar : 31.47% (211957760 => 66713509 bytes, silesia.tar.zst)
real 0m1.389s
user 0m4.698s
sys 0m0.156s
Compression level 4:
$ time zstd -4 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar : 30.94% (211957760 => 65572880 bytes, silesia.tar.zst)
real 0m1.882s
user 0m3.659s
sys 0m0.139s
$ time zstd -4 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar : 30.94% (211957760 => 65572880 bytes, silesia.tar.zst)
real 0m1.906s
user 0m3.638s
sys 0m0.168s
$ time zstd -4 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar : 30.94% (211957760 => 65572880 bytes, silesia.tar.zst)
real 0m1.810s
user 0m6.132s
sys 0m0.162s
Compression level 10:
$ time zstd -10 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar : 28.11% (211957760 => 59582278 bytes, silesia.tar.zst)
real 0m10.516s
user 0m19.922s
sys 0m0.179s
$ time zstd -10 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar : 28.11% (211957760 => 59582278 bytes, silesia.tar.zst)
real 0m10.571s
user 0m20.508s
sys 0m0.186s
$ time zstd -10 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar : 28.11% (211957760 => 59582278 bytes, silesia.tar.zst)
real 0m8.313s
user 0m28.980s
sys 0m0.205s
Desktop:
Hi @mako2580,
Thanks for posting the issue.
I managed to reproduce your results on my machine
...
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
...
perf stat -r 10 zstd -10 -T0 silesia.tar -o /dev/null
1.79603 +- 0.00911 seconds time elapsed ( +- 0.51% )
perf stat -r 10 zstd -10 -T28 silesia.tar -o /dev/null
1.7640 +- 0.0132 seconds time elapsed ( +- 0.75% )
# as expected, the same results as -T0
perf stat -r 10 zstd -10 -T56 silesia.tar -o /dev/null
1.6867 +- 0.0220 seconds time elapsed ( +- 1.31% )
# a little better than when using physical cores
Unfortunately, the same improvement isn't there for higher levels that use more cpu.
perf stat -r 5 zstd -19 -T0 silesia.tar -o /dev/null
22.875 +- 0.592 seconds time elapsed ( +- 2.59% )
perf stat -r 5 zstd -19 -T56 silesia.tar -o /dev/null
22.463 +- 0.577 seconds time elapsed ( +- 2.57% )
# effectively the same
I'm not an expert on this topic but my understanding is that hyper-threading will only help when the running program doesn't fully saturate your cores (so when it's bound by something other than cpu). This should be the case when compressing large files (which should be I/O bound). Hence the speed improvement for level 10 but not for level 19 isn't terribly surprising to me.
I wasn't around when the decision to use the number of physical cores was made so maybe @Cyan4973, @terrelln or @felixhandte can provide some context here? Trying different options (not just -T) for specific host machines/data is something zstd users should do to extract the most utility out of the library. As for selecting a good default, I suppose if we can show that T(#logical_cores) is never worse than -T(#physical_cores) but sometimes quite a bit better, that would be a compelling enough reason to switch?
Retested for compression level 19, and I'm getting expected results. -T#logical_cores is about 15 % faster than -T#physical_cores. What type of storage are you using for test? I was doing test in ramdisk, to avoid I/O bound as much as possible, that might be reason why you are getting almost the same results even with high compression level.
$ time zstd -19 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar : 25.12% (211957760 => 53251489 bytes, silesia.tar.zst)
real 1m13.561s
user 2m23.948s
sys 0m0.266s
$ time zstd -19 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar : 25.12% (211957760 => 53251489 bytes, silesia.tar.zst)
real 1m3.259s
user 3m17.664s
sys 0m0.517s
This is a difficult topic.
Basically, the idea that one can use hyper-threading to improve performance has some truth, but it depends on multiple factors, and it's less true for zstd than for most applications.
The main idea is that modern OoO cores have multiple execution units in each core (and multiple cores per cpus, don't confuse those 2 levels). These execution units are able to issue instructions in parallel, as long as they are unrelated. At its peak, an Intel CPU can issue up to 4 micro-instructions per cycle. Of course, no-one reaches this theoretical limit, and any code featuring > 3 micro-instructions per cycle is considered very well optimized.
In practice, most codes are not that optimized, and it's common for many applications to linger in the ~0.5 instructions per clock territory. That's because the primary performance limitation nowadays is the memory wall. Fetching some random data in memory costs hundreds of cycles, and the core just stalls, starving from data, before being able to proceed on further calculation.
Modern cpus can re-order instructions to take advantage of memory fetch stalls, but there is a limit, both to cpu and program capabilities. At some point, the program cannot proceed forward, because there is absolutely nothing else to do, except waiting for the one critical data that every future action depend on.
In which case, hyper-threading is helpful, because it forces the core to process 2 threads in parallel, therefore when one thread is stalled, the other one can make use of the idle execution units. This improves occupancy rate.
The cost for this capability is not null : higher processing requirements for the cpu, on top of sharing essential core resources, such as L1 cache, virtual registers, or micro-op cache. So it'd better be useful.
Now let's talk about zstd.
Remember the topic of nb of micro-instructions per cycle ? zstd is highly optimized. Critical components of zstd, such as FSE, or Huff0, or xxHash, are able to extract more than 3 micro-instructions per cycle. It does so by organizing data dependencies so that multiple instructions can be processed in parallel without depending on each other. This happens within a single thread. That means there is basically no space left for a second thread to tap into the "idle" execution units.
That being said, this is actually an "ideal" situation. In practice, one can't completely avoid fetching data out of cache as compression level become higher. Because zstd uses larger tables, and has more history to look into. This necessarily brings it out of cache, and therefore stalls from waiting for memory happen.
So, as the average nb of micro-instructions per cycle becomes lower, there is more room for other threads to tap into unused resources. So at higher compression levels, I'm not surprised to see that hyper-threading is actually beneficial.
Now, that would be one reason to use logical cores. But there are other considerations to factor in.
Primarily, the "system" within which zstd works also has some other tasks to perform, even if only for keeping the "system" alive (bookkeeping, logging, interrupts, etc.). Therefore, a "default" setting bringing the whole system to a halt looks dangerous.
Also, more threads means more memory. At level 19, the cost is approximately ~100 MB per thread. Doubling the nb of threads double the memory budget. While some systems may be able to cope with that, others may feel that it's just too much, and trigger an OOM failure, or worse an OOM-Killer, with random impact on the system.
Therefore, due to these additional constraints, it felt "safer", or more conservative, to select a default which preserves system resources for other tasks, especially system-critical ones.
Now, it's not that using more threads is forbidden : at any rate, if someone _wants_ more threads, he can requests so. It's more about "what's a good default", and I'm sure this can be debated at length.
In the future, if this is an important topic, we can probably pour more resources on the topic, to provide a more complete and satisfactory answer, with maybe more performance and more control. But there is only so much resource we can spend at any moment in time, and these days, we have settled for a "good enough" middle ground, by selecting this "nb of cores" policy. Maybe something that can still be improved upon in some future.
Thanks for detailed answer. You agree, that using number of logical cores can have performance improvement. Not the big one, but it's something, The only problem you have, is that there will be no room for OS (background threads and free memory). That's good reason, but I think it is not that critical, as it might seen.
First one - Free memory - you will still run out of memory if you have CPU with many physical cores and not enough RAM. You are not solving problem with RAM consumption, you are just moving that line further. And if you do run out of memory, so for the next time you are going to run zstd you just specify -T#number_of_threads_you_have_ram_for, but if you have enough of RAM you get max performance even on default.
And the second one - OS background tasks - I think that might only happen if you set high process priority, but i think that's not the default case. Even if I wanted to be highly paranoid, I would rather set some constant that would reserve unused threads for background task - for example i will use (#logical_threads - 1) for -T0. Because in systems with higher number of cores you let half of cores unused, which is for background tasks too much IMHO.
Primarily, the "system" within which
zstdworks also has some other tasks to perform, even if only for keeping the "system" alive (bookkeeping, logging, interrupts, etc.). Therefore, a "default" setting bringing the whole system to a halt looks dangerous.
But that's the job of the CPU scheduler. My linux system does not become unresponsive or sluggish when zstd is using all logical processors; instead, the overall CPU consumption of zstd is lowered when other programs also need to consume CPU.
Obviously, I don't mean to say that all systems will handle 100% CPU consumption gracefully. What I mean is that if the user knows it works well for them, there's nothing bad in allowing them to enable this behaviour systematically.
In my case, I was trying to tweak makepkg's compression algorithm. Since /etc/makepkg.conf works like a bash script, I can set COMPRESSZST=(zstd -c -T$(nproc) -18 -) and avoid hardcoding the number of logical cores of my current system. But not all use cases may have the chance to automate this easily.
What you say makes a lot of sense, and I do not believe the default behaviour of -T0 should be changed. So I'd suggest adding a separate option (or a new suffix for -T) to take the count of logical processors instead of physical cores.
I think that adding separate option is acceptable, but in that case it should be written more explicitly in --help, which option is using logical cores and which physical ones. Nevertheless I think that default behavior user wants, is using logical cores for maximizing performance. And I think that lot of software out there has already hardcoded -T0, so changing behavior of -T0 is better idea to improve performance of third-party software as well. Just my opinion.
I'm not comfortable with changing the default for the time being.
In doubt, I prefer to lean towards more conservative (i.e. use less cpu usage).
Note that anyone can take advantage of hyper-threading, but it must be requested explicitly, it's just not the "default".
Yann, I support your decision, but can we have an argument that _explicitly_ requests to use the count of logical cores without the caller having to known the _exact_ number of cores?
As I said before, currently in Bash I can do zstd -c -T$(nproc), but I'm sure doing this in e.g. cmd.exe is not so straight-forward.
Which command name would make sense ?
Since it can't be -T#, it would have to use a different convention.
Can't -T# be used with a letter instead of a number? For example: -TL or -Tl.
Not with current code. At #, it expects an ascii number.
I guess there are two ways to make this:
--logical-cores: _when combined with -T0, the number of logical CPU cores is used instead of physical CPU cores._--auto-threads={physical,logical} (default: physical): _determines whether the number of working threads with -T0 is the number of physical CPU cores or logical CPU cores._ (This interface is more consistent and easier to extend).
Most helpful comment
This is a difficult topic.
Basically, the idea that one can use hyper-threading to improve performance has some truth, but it depends on multiple factors, and it's less true for
zstdthan for most applications.The main idea is that modern OoO cores have multiple execution units in each core (and multiple cores per cpus, don't confuse those 2 levels). These execution units are able to issue instructions in parallel, as long as they are unrelated. At its peak, an Intel CPU can issue up to 4 micro-instructions per cycle. Of course, no-one reaches this theoretical limit, and any code featuring > 3 micro-instructions per cycle is considered very well optimized.
In practice, most codes are not that optimized, and it's common for many applications to linger in the ~0.5 instructions per clock territory. That's because the primary performance limitation nowadays is the memory wall. Fetching some random data in memory costs hundreds of cycles, and the core just stalls, starving from data, before being able to proceed on further calculation.
Modern cpus can re-order instructions to take advantage of memory fetch stalls, but there is a limit, both to cpu and program capabilities. At some point, the program cannot proceed forward, because there is absolutely nothing else to do, except waiting for the one critical data that every future action depend on.
In which case, hyper-threading is helpful, because it forces the core to process 2 threads in parallel, therefore when one thread is stalled, the other one can make use of the idle execution units. This improves occupancy rate.
The cost for this capability is not null : higher processing requirements for the cpu, on top of sharing essential core resources, such as L1 cache, virtual registers, or micro-op cache. So it'd better be useful.
Now let's talk about
zstd.Remember the topic of nb of micro-instructions per cycle ?
zstdis highly optimized. Critical components ofzstd, such as FSE, or Huff0, or xxHash, are able to extract more than 3 micro-instructions per cycle. It does so by organizing data dependencies so that multiple instructions can be processed in parallel without depending on each other. This happens within a single thread. That means there is basically no space left for a second thread to tap into the "idle" execution units.That being said, this is actually an "ideal" situation. In practice, one can't completely avoid fetching data out of cache as compression level become higher. Because
zstduses larger tables, and has more history to look into. This necessarily brings it out of cache, and therefore stalls from waiting for memory happen.So, as the average nb of micro-instructions per cycle becomes lower, there is more room for other threads to tap into unused resources. So at higher compression levels, I'm not surprised to see that hyper-threading is actually beneficial.
Now, that would be one reason to use logical cores. But there are other considerations to factor in.
Primarily, the "system" within which
zstdworks also has some other tasks to perform, even if only for keeping the "system" alive (bookkeeping, logging, interrupts, etc.). Therefore, a "default" setting bringing the whole system to a halt looks dangerous.Also, more threads means more memory. At level 19, the cost is approximately ~100 MB per thread. Doubling the nb of threads double the memory budget. While some systems may be able to cope with that, others may feel that it's just too much, and trigger an OOM failure, or worse an OOM-Killer, with random impact on the system.
Therefore, due to these additional constraints, it felt "safer", or more conservative, to select a default which preserves system resources for other tasks, especially system-critical ones.
Now, it's not that using more threads is forbidden : at any rate, if someone _wants_ more threads, he can requests so. It's more about "what's a good default", and I'm sure this can be debated at length.
In the future, if this is an important topic, we can probably pour more resources on the topic, to provide a more complete and satisfactory answer, with maybe more performance and more control. But there is only so much resource we can spend at any moment in time, and these days, we have settled for a "good enough" middle ground, by selecting this "nb of cores" policy. Maybe something that can still be improved upon in some future.