Stockfish: Does SF take advantage of NUMA hardware?

Created on 9 Dec 2019 · 25Comments · Source: official-stockfish/Stockfish

I thought it does and years ago I saw numa*.cpp source code included in SF but I don't see it anymore, and someone is telling me that SF doesn't utilize NUMA. I have a dual E5-2696v3 system and to me it looks like it supports it just fine as it uses all cores and node affinity groups even with HT enabled.

In cFish I still see numa*.cpp files so I'm starting to have doubts now.

Windows

Source

hero2017

All 25 comments

If you use less than 64 logical cores on windows , NUMA is not needed. I’m guessing you might have 72 logical cores ?

MichaelB7 on 10 Dec 2019

Yes that is correct. 72 logical cores.

hero2017 on 10 Dec 2019

SF takes advantage of NUMA hardware but doesn't use all possible optimization tricks that e.g. cfish does when it comes to thread binding (binding threads to cores) and memory affinity (allocating memory 'near' the core that needs it).

vondele on 10 Dec 2019

@vondele Rather than being "tricks", these are simply useful paradigms for effective use of modern multiprocessor machines. It would be nice if Stockfish incorporated some of these, especially since it sadly seems that Cfish is now defunct.

LouisZulli on 10 Dec 2019

👍1

Agree @LouisZulli , going forward more and more processors ( hence users) will be impacted by the lack of full NUMA support. Is there any support of this initiative from the maintainers is the question. Plus it will likely be impacted by future TCEC events as well.

MichaelB7 on 10 Dec 2019

Agreed. And NUMA hardware's been out for years and there's a lot of users today using such machines. Hopefully SF can utilize every single benefit out of it soon.

hero2017 on 15 Dec 2019

I've been wanting to upgrade to something more serious, a 256 thread beast, but that'd be a waste if SF doesn't re-add Numa code. Without NUMA code the engine will run MUCH slower. Is there a thread somewhere why SF removed NUMA? And will SF ever bring it back?

Just look at the speed differences between hardware on IPMan's website when using NUMA:

http://ipmanchess.yolasite.com/amd---intel-chess-bench.php

Why SF would remove NUMA is beyond me.

hero2017 on 5 Jan 2020

I'm not aware about specific NUMA code being in SF (except some thread binding on windows)... maybe long ago. I played a bit with adding thread binding recently and saw no serious effect. Also some numactl on the command line makes no difference for me:

./stockfish bench 1024 128 26 default depth
Nodes/second    : 186631199
Nodes/second    : 185860175

numactl --interleave=all  ./stockfish bench 1024 128 26 default depth
Nodes/second    : 186285390
Nodes/second    : 181855866

numactl --localalloc  ./stockfish bench 1024 128 26 default depth
Nodes/second    : 186953767

vondele on 5 Jan 2020

I'm thinking of buying 2xEpyc 7742 amd and from the SF benchmarks I saw on one site the performance was also only about 195,000,000 but when Patrick tested old asmFish from 2017, which had NUMA code, he got 280,000,000. Big difference. Just have a look at that Ipman link I provided and you can see the speeds for this cpu using old asmFish.

See this link: https://openbenchmarking.org/result/1910176-AS-2XAMDEPYC12 and search for stockfish. Right below you'll also see speed for asmFish. Both are very low when compared to the benchmark that Patrick@servetheHome got using asmFish from 2017. They are low because both versions already have NUMA code removed. Ipman fought long and hard to have SF add it and the speed improvement was huge so why it was removed I have no idea.

So I want to upgrade to this hardware but if I'm only going to get 190,000,000 instead of 280,000,000 then I'm wasting my money.

hero2017 on 6 Jan 2020

The above numbers are from 2xEpyc 7742, right now with hyperthreading disabled in bios, so only 128 threads. On the same system:

./asmFishL_2017-05-22_popcnt bench 1024 128 26 default depth
Nodes/second    : 193904849
Nodes/second    : 196844978

not so different (give the bench positions have changed, etc). Enabling hyperthreading would likely result in the ~280 with both master and asmfish...

Note that there have been some changes in master (like zeroing the hash before starting the clock), first-touch allocation of hash (and parallel zero), etc. that make things more comparable now than before.

vondele on 6 Jan 2020

Good morning Joost,
Can i include your asmFish bench 2xEpyc 7742 HT OFF 128cores in my list..and use vondele as name!

Kind regards,
Ipman.

Ipmanchess on 6 Jan 2020

sure

vondele on 6 Jan 2020

Thanks ,website updated!

Ipman.

Ipmanchess on 6 Jan 2020

I'm not aware about specific NUMA code being in SF (except some thread binding on windows)... maybe long ago. I played a bit with adding thread binding recently and saw no serious effect. Also some numactl on the command line makes no difference for me:
./stockfish bench 1024 128 26 default depth
Nodes/second    : 186631199
Nodes/second    : 185860175

numactl --interleave=all  ./stockfish bench 1024 128 26 default depth
Nodes/second    : 186285390
Nodes/second    : 181855866

numactl --localalloc  ./stockfish bench 1024 128 26 default depth
Nodes/second    : 186953767

@vondele Yes you're right. I totally missed the thread count listed on Ipman's site for this dual cpu. Although I'm surprised the difference is so huge between HT off and on (almost 100 million, wow!). This is close to 50% speed increase!! So my question now is, when have you the time, will SF use all the 256 threads (HT on) properly on both NUMA nodes? I'm guessing it will since the speed increase is so significant but it'd be nice to know for sure.

Also, I guess disabling HT would be a huge crippling. We're often told that HT enabled is bad for chess but how can you resist this speed boost, and would 186,000,000 still be better than 275,000,000 for SF?

@Ipmanchess So that's what the confusion was about. I guess Patrick gave you results with HT enabled and I always assumed that your list only has results with HT off. I shouldn't have ignored the thread count. Now I'll pay more attention but I think you should keep it to a standard since without looking up the cpu we may not how if HT was used or not. In this case 128 threads means HT is off, and 256 threads means HT is on.

hero2017 on 6 Jan 2020

@hero2017 you are the first one ;)
When HT is Off then it's cores
HT On are threads ..you can compare same systems.. a bench with 32cores and a bench with 64threads..logic..when not sure simple google for that cpu and get all information how many cores/threads a cpu have.
Then you know the difference..all benches in list are done like that..and it's very clear!

Ipmanchess on 6 Jan 2020

So, with HT enabled, one gets the following (bench 1024 256 26 default depth):

master:
Nodes/second : 256807980
Nodes/second : 259440481

asmfish:
Nodes/second : 270659065
Nodes/second : 272901620

more or less close again.

vondele on 7 Jan 2020

👍1

Good morning Joost,
Maybe i better ask ,can i include all future benches i see from your systems in my list ;)

Kind regards,
Ipman.

Ipmanchess on 8 Jan 2020

Please ask first, and note that I don't particularly try to tweak things for good numbers... they're just out-of-the-box. This one is fine.

vondele on 8 Jan 2020

Thanks..and i will always ask first!
The first bench i get is most time the right one..because i don't know what people doing..run a bench..close asmFish ,re-open run a bench..or let asmFish open and re-run bench(these i don't accept).. some people they mark it very well what they have done..even tell me Bios settings..or find a improvement in tuning the system..is just for people get a idea how fast these new cpu's are and helps in deciding which system they gone build/buy later on..and get a nice comparing list between Intel & AMD cpu's.

Ipmanchess on 8 Jan 2020

@vondele If I'm not wrong we can drop now the "libnuma-dev" requirement from the "Running the worker on Linux" wiki page.

ppigazzini on 11 Jan 2020

@ppigazzini indeed, currently it should not be needed.

However, I didn't know it was listed as a requirement, having that library required would enable experimentation on fishtest more easily.

vondele on 11 Jan 2020

@vondele it's listed only in https://github.com/glinscott/fishtest/wiki/Running-the-worker-on-Linux#minimal-worker-setup . libnuma-dev is not installed in other pages/scripts for linux/windows/aws.
In "Running the worker ..." pages I would like to keep only the information for the CPU contributor, perhaps we should collect in a new wiki page the developer information (e.g. libnuma-dev, configure several gcc on linux etc.)

ppigazzini on 11 Jan 2020

So it can be removed safely as a requirement, IMO ...

vondele on 11 Jan 2020

👍1

So, with HT enabled, one gets the following (bench 1024 256 26 default depth):

master:
Nodes/second : 256807980
Nodes/second : 259440481

asmfish:
Nodes/second : 270659065
Nodes/second : 272901620

more or less close again.

@vondele Could you please run it once more with HT on but with the latest Stockfish dev instead of asmFish? Nobody really uses asmFish anymore. I'm afraid if I go with this hardware and use SF then it'll be more like 19000000 instead of 270000000. And if that's true then it means fully working NUMA still needs work in SF.

Hopefully you can find a moment of your time to run a quick one. It'd really help.

I just noticed you have 'master' results - does master mean SF?

hero2017 on 25 Jan 2020

tracked further in #2619

vondele on 30 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings