I thought it does and years ago I saw numa*.cpp source code included in SF but I don't see it anymore, and someone is telling me that SF doesn't utilize NUMA. I have a dual E5-2696v3 system and to me it looks like it supports it just fine as it uses all cores and node affinity groups even with HT enabled.
In cFish I still see numa*.cpp files so I'm starting to have doubts now.
If you use less than 64 logical cores on windows , NUMA is not needed. I鈥檓 guessing you might have 72 logical cores ?
Yes that is correct. 72 logical cores.
SF takes advantage of NUMA hardware but doesn't use all possible optimization tricks that e.g. cfish does when it comes to thread binding (binding threads to cores) and memory affinity (allocating memory 'near' the core that needs it).
@vondele Rather than being "tricks", these are simply useful paradigms for effective use of modern multiprocessor machines. It would be nice if Stockfish incorporated some of these, especially since it sadly seems that Cfish is now defunct.
Agree @LouisZulli , going forward more and more processors ( hence users) will be impacted by the lack of full NUMA support. Is there any support of this initiative from the maintainers is the question. Plus it will likely be impacted by future TCEC events as well.
Agreed. And NUMA hardware's been out for years and there's a lot of users today using such machines. Hopefully SF can utilize every single benefit out of it soon.
I've been wanting to upgrade to something more serious, a 256 thread beast, but that'd be a waste if SF doesn't re-add Numa code. Without NUMA code the engine will run MUCH slower. Is there a thread somewhere why SF removed NUMA? And will SF ever bring it back?
Just look at the speed differences between hardware on IPMan's website when using NUMA:
http://ipmanchess.yolasite.com/amd---intel-chess-bench.php
Why SF would remove NUMA is beyond me.
I'm not aware about specific NUMA code being in SF (except some thread binding on windows)... maybe long ago. I played a bit with adding thread binding recently and saw no serious effect. Also some numactl on the command line makes no difference for me:
./stockfish bench 1024 128 26 default depth
Nodes/second : 186631199
Nodes/second : 185860175
numactl --interleave=all ./stockfish bench 1024 128 26 default depth
Nodes/second : 186285390
Nodes/second : 181855866
numactl --localalloc ./stockfish bench 1024 128 26 default depth
Nodes/second : 186953767
I'm thinking of buying 2xEpyc 7742 amd and from the SF benchmarks I saw on one site the performance was also only about 195,000,000 but when Patrick tested old asmFish from 2017, which had NUMA code, he got 280,000,000. Big difference. Just have a look at that Ipman link I provided and you can see the speeds for this cpu using old asmFish.
See this link: https://openbenchmarking.org/result/1910176-AS-2XAMDEPYC12 and search for stockfish. Right below you'll also see speed for asmFish. Both are very low when compared to the benchmark that Patrick@servetheHome got using asmFish from 2017. They are low because both versions already have NUMA code removed. Ipman fought long and hard to have SF add it and the speed improvement was huge so why it was removed I have no idea.
So I want to upgrade to this hardware but if I'm only going to get 190,000,000 instead of 280,000,000 then I'm wasting my money.
The above numbers are from 2xEpyc 7742, right now with hyperthreading disabled in bios, so only 128 threads. On the same system:
./asmFishL_2017-05-22_popcnt bench 1024 128 26 default depth
Nodes/second : 193904849
Nodes/second : 196844978
not so different (give the bench positions have changed, etc). Enabling hyperthreading would likely result in the ~280 with both master and asmfish...
Note that there have been some changes in master (like zeroing the hash before starting the clock), first-touch allocation of hash (and parallel zero), etc. that make things more comparable now than before.
Good morning Joost,
Can i include your asmFish bench 2xEpyc 7742 HT OFF 128cores in my list..and use vondele as name!
Kind regards,
Ipman.
sure
Thanks ,website updated!
Ipman.
I'm not aware about specific NUMA code being in SF (except some thread binding on windows)... maybe long ago. I played a bit with adding thread binding recently and saw no serious effect. Also some numactl on the command line makes no difference for me:
./stockfish bench 1024 128 26 default depth Nodes/second : 186631199 Nodes/second : 185860175 numactl --interleave=all ./stockfish bench 1024 128 26 default depth Nodes/second : 186285390 Nodes/second : 181855866 numactl --localalloc ./stockfish bench 1024 128 26 default depth Nodes/second : 186953767
@vondele Yes you're right. I totally missed the thread count listed on Ipman's site for this dual cpu. Although I'm surprised the difference is so huge between HT off and on (almost 100 million, wow!). This is close to 50% speed increase!! So my question now is, when have you the time, will SF use all the 256 threads (HT on) properly on both NUMA nodes? I'm guessing it will since the speed increase is so significant but it'd be nice to know for sure.
Also, I guess disabling HT would be a huge crippling. We're often told that HT enabled is bad for chess but how can you resist this speed boost, and would 186,000,000 still be better than 275,000,000 for SF?
@Ipmanchess So that's what the confusion was about. I guess Patrick gave you results with HT enabled and I always assumed that your list only has results with HT off. I shouldn't have ignored the thread count. Now I'll pay more attention but I think you should keep it to a standard since without looking up the cpu we may not how if HT was used or not. In this case 128 threads means HT is off, and 256 threads means HT is on.
@hero2017 you are the first one ;)
When HT is Off then it's cores
HT On are threads ..you can compare same systems.. a bench with 32cores and a bench with 64threads..logic..when not sure simple google for that cpu and get all information how many cores/threads a cpu have.
Then you know the difference..all benches in list are done like that..and it's very clear!
So, with HT enabled, one gets the following (bench 1024 256 26 default depth):
master:
Nodes/second : 256807980
Nodes/second : 259440481
asmfish:
Nodes/second : 270659065
Nodes/second : 272901620
more or less close again.
Good morning Joost,
Maybe i better ask ,can i include all future benches i see from your systems in my list ;)
Kind regards,
Ipman.
Please ask first, and note that I don't particularly try to tweak things for good numbers... they're just out-of-the-box. This one is fine.
Thanks..and i will always ask first!
The first bench i get is most time the right one..because i don't know what people doing..run a bench..close asmFish ,re-open run a bench..or let asmFish open and re-run bench(these i don't accept).. some people they mark it very well what they have done..even tell me Bios settings..or find a improvement in tuning the system..is just for people get a idea how fast these new cpu's are and helps in deciding which system they gone build/buy later on..and get a nice comparing list between Intel & AMD cpu's.
@vondele If I'm not wrong we can drop now the "libnuma-dev" requirement from the "Running the worker on Linux" wiki page.
@ppigazzini indeed, currently it should not be needed.
However, I didn't know it was listed as a requirement, having that library required would enable experimentation on fishtest more easily.
@vondele it's listed only in https://github.com/glinscott/fishtest/wiki/Running-the-worker-on-Linux#minimal-worker-setup . libnuma-dev is not installed in other pages/scripts for linux/windows/aws.
In "Running the worker ..." pages I would like to keep only the information for the CPU contributor, perhaps we should collect in a new wiki page the developer information (e.g. libnuma-dev, configure several gcc on linux etc.)
So it can be removed safely as a requirement, IMO ...
So, with HT enabled, one gets the following (bench 1024 256 26 default depth):
master:
Nodes/second : 256807980
Nodes/second : 259440481asmfish:
Nodes/second : 270659065
Nodes/second : 272901620more or less close again.
@vondele Could you please run it once more with HT on but with the latest Stockfish dev instead of asmFish? Nobody really uses asmFish anymore. I'm afraid if I go with this hardware and use SF then it'll be more like 19000000 instead of 270000000. And if that's true then it means fully working NUMA still needs work in SF.
Hopefully you can find a moment of your time to run a quick one. It'd really help.
I just noticed you have 'master' results - does master mean SF?
tracked further in #2619