This is a follow-up of the hang report in https://github.com/official-stockfish/Stockfish/issues/2291 and the discussion in the pull request https://github.com/official-stockfish/Stockfish/pull/2307
I have pushed a branch here with debugging facilities for our threads, the searching flag, the condition variable states, that sort of things: https://github.com/snicolet/Stockfish/tree/hang3
Features:
0) The UCI thread is numbered 999999 instead of 0 in master. I changed that because we have two threads with idx 0 in master, and it made it a little bit confusing to see which was doing what in the log messages.
1) a DEBUG_HANG macro in misc.h to show/hide all the debug messages in the console
#define DEBUG_HANG 1 && sync_cout << "[DEBUG_HANG] " to show the messages
#define DEBUG_HANG 0 && sync_cout << "[DEBUG_HANG] "to hide the messages
2) a USE_CUSTOM_CONDITION_VARIABLE macro in thread_win32_osx.h to add the possibility to use a custom condition variable implementation.
a) if USE_CUSTOM_CONDITION_VARIABLE is 0, the code is the same as in current master.
b) if USE_CUSTOM_CONDITION_VARIABLE is 1, we use the version provided by @noobpwnftw (a copy of the buggy DLL implementation). But now this compiles on all systems with the same code path: in particular, I managed to simulate the repeatable and deterministic hang with 1 thread on my Mac, on the first position of bench.
3) notify_one() now has a new parameter with the thread index to help debugging
4) wait() has now two new parameters: thread index and sleep duration during the condition loop. This is used in the idle loop in line 179 of thread.cpp :
cv.wait(this->thread_index(), lk, [&]{ return searching; }, 1000);
Examples:
• to see the normal flow of the threads synchronization, compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 0, and then type
./stockfish
bench
stop
quit
• to see the buggy flow and the deterministic hang, compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 1, and then type
./stockfish
bench
stop
quit
Hope this helps
Example of output:
https://gist.github.com/snicolet/73c0364147412043ac9a86885a5be299
Well, just to note that their implementation is simply wrong. Correct ones must ensure a lock is held during the period of unlocking external lock, entering wait and other notify calls, that is probably not possible without an internal lock in the conditional variable implementation itself.
Problem this will cause is 1) notifies get lost(not the case with semaphores since they support releasing before there are waiters), 2) wrong threads consume notifies(if a thread has taken the external lock and there is an outstanding notify, this thread is guaranteed to receive it, in MINGW's case it is broken and caused the hangs, adding a delay after unlocking the external lock and before trying to consume the notify can make it reliably reproducible).
Example of output with the deterministic hang with one thread:
https://gist.github.com/snicolet/59686f8db4055be1a579d08a65417e4c
@noobpwnftw
My aim is to decorticate the hang to understand exactly the data race scenario, because I would like to rewrite our synchronization with something else than condition variables (atomic<bool> comes to mind, as in https://github.com/snicolet/Stockfish/commit/7da2b2b57b809f8d158c0a7f9a26d937faa2ae53 ). Any racing scenario leading to dead lock with the buggy condition variable implementation is a good test for the robustness of the alternative, IMHO. Many many thanks for your code, of course!!!!
Their recent master branch has reverted the changes that broke their implementation, however no releases have been made yet, these bugged ones will be out in the wild.
There are two ways to reduce the impact it caused but none would guarantee something else won't break given that their entire C++ runtime is also linked to it.
1) Fire extra notifies in case of a wrong thread consumed them.
2) Remove the usage of conditional variables or implement or own.
Best is just to advice users avoid using MINGW once and for all.
They seem to have poor versioning system as there is not an obvious way to tell what version their files are built upon, and no testing on whether their implementations are standard-compliant.
2) wrong threads consume notifies
In the example I have the hang is with only one thread?
@snicolet With bench, there are two consumers, UCI thread and MainThread.
In the same run you will not observe hangs with only go command, because then there will be only one consumer.
@snicolet I'm trying to work with you to figure out the problem with hanging I reported in #2305.
You said to compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 1
Where do I specify these commands? I compile with:
make profile-build ARCH=x86-64-bmi2 COMP=gcc -j 36
And can I compile any version that hangs for me or only for this one: https://github.com/snicolet/Stockfish/tree/hang3
@hero2017
DEBUG_HANG macro in misc.h
USE_CUSTOM_CONDITION_VARIABLE macro in thread_win32_osx.h
I think making searching an atomic type is not desired because it makes the threads that are in idle state busy with locking/unlocking and yield is a system call.
Making extra notifications on the other hand is considerably cheaper, since it only turns into a loop when a thread consumes a notify where it shouldn't.
[mingw-w64] recent master branch has reverted the changes that broke their implementation[...]
You are talking about this commit?
https://sourceforge.net/p/mingw-w64/mingw-w64/ci/330025c54b85512d54b6960fad07498365c8fee3/
given that mingw fixed the bug in their implementation, should we just document somewhere which version of mingw is working, which is known buggy, and move on? I think it is not a good idea to try and reimplement something like a condition_variable, because on one system the implementation was buggy.
On the other hand, https://github.com/snicolet/Stockfish/commit/7da2b2b57b809f8d158c0a7f9a26d937faa2ae53 is quite a simplification, removing the mutex and the condition variable. It is much easier to reason about the code.
I agree that's easier to follow, and at first sight, it could just be tested as a simplification.
However, the condition_variable implies (following cppreference) that the thread will be suspended, while this_thread::yield() is an implementation-dependent scheduler hint. Are we sure that we won't be consuming resources of the opponent while we're waiting for our next 'go' command (on multiple implementations)? I would be happy if this were the case.
@vondele @noobpwnftw You are right, my atomic(bool) version is not good enough.
It is not a problem of implicit mutex around the atomic (atomic(bool) is not guaranted to be lock-free by the standard, but in practice it is lock-free on all modern CPUs, see std::atomic<bool>{}.is_lock_free()).
But the semi-buzy waiting loops are consuming a little bit more of CPU than I expected, because it is not really blocking threads but doing a semi-active waiting loop. I called thread::yield to be polite to other threads but I tested it to be significantly more than 0% CPU time even in idle state.
It would pass the simplification test easily against master, because it's cheating on the scheduler :-)
I am now convinced too that the best solution is to document the library bug somewhere in our wiki on the page "building Stockfish on Windows", to provide links to proper versions of the DLL and to move on. It's a pity for the clarity of the atomic(bool) code :-)
IMO the original version is also very clear a demonstration of how to use signaling instead of a polling loop. Pity is that some implementations broke the standard, and we are affected.
I'd suggest replacing custom mutex with standard mutex (I'm sure there is no measurable performance difference at all, in context of how it's currently used in SF. Yes, if you run that mutex lock million times in a loop, there will be a difference, but SF doesn't do that).
And replace std::condition_variable_any with std::condition_variable, which is much easier for libraries to get right.
@mooskagh, I agree with that (and made a branch https://github.com/official-stockfish/Stockfish/pull/2307#issuecomment-531642185 ) however, it was tested to be 2.5% slower (https://github.com/official-stockfish/Stockfish/issues/2291#issuecomment-531706773 ).
Personally, I would take the little performance hit. If not, we should at least augment our Mutex implementation with the needed deletion of operator=
@vondele I however fail to see why using standard mutex could cause a slowdown:
We do not use it anywhere in the hot path, at most it slows our sync_cout a bit and TB probing, provided that bench doesn't probe TBs, where does this 2.5% slowdown come from?
no idea, I was also not expecting any significant change. It is indeed not on a hot path. Maybe it requires one more round of testing.
Edit: note the bench was done with a default depth of 13.... that's very low, especially threaded, and maybe more sensitive to overhead. I think should be repeated with e.g. depth 20 or so.
@mooskagh I agree to use more fool-proof methods within STL, yet this time they messed up pthread_cond_wait which I believe would be used in both places. :)
@vondele @mooskagh new speedup test master vs noCustomMutex using bench 16 1 20
dynamic binaries w/ msys2 g++, older libwinpthread
$ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex.exe 10
run base test diff
1 1821485 1818860 -2625
2 1814875 1813199 -1676
3 1817987 1814254 -3733
4 1827135 1812951 -14184
5 1817800 1823239 +5439
6 1825939 1819672 -6267
7 1827261 1805849 -21412
8 1820172 1818923 -1249
9 1825814 1827891 +2077
10 1826065 1813757 -12308
base = 1822453 +/- 2822
test = 1816859 +/- 3836
diff = -5593 +/- 5046
speedup = -0.003069
same result using bench
$ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex.exe 50
run base test diff
1 1935737 1943663 +7926
2 1945433 1944548 -885
3 1949873 1948984 -889
4 1944548 1938372 -6176
5 1947207 1940133 -7074
6 1945433 1948984 +3551
7 1945433 1938372 -7061
8 1949873 1947207 -2666
9 1946320 1948095 +1775
10 1941896 1938372 -3524
11 1945433 1946320 +887
12 1948095 1942779 -5316
13 1940133 1946320 +6187
14 1942779 1948095 +5316
15 1951655 1948095 -3560
16 1949873 1948984 -889
17 1944548 1939252 -5296
18 1941896 1945433 +3537
19 1928746 1946320 +17574
20 1949873 1943663 -6210
21 1948095 1946320 -1775
22 1947207 1949873 +2666
23 1941896 1947207 +5311
24 1949873 1943663 -6210
25 1944548 1944548 +0
26 1945433 1944548 -885
27 1948095 1942779 -5316
28 1946320 1948984 +2664
29 1947207 1942779 -4428
30 1933985 1946320 +12335
31 1948984 1940133 -8851
32 1941896 1942779 +883
33 1943663 1937493 -6170
34 1943663 1948984 +5321
35 1948095 1943663 -4432
36 1933109 1930489 -2620
37 1944548 1947207 +2659
38 1944548 1951655 +7107
39 1945433 1941014 -4419
40 1945433 1948095 +2662
41 1948095 1943663 -4432
42 1941896 1944548 +2652
43 1944548 1949873 +5325
44 1941896 1945433 +3537
45 1950763 1944548 -6215
46 1946320 1924402 -21918
47 1943663 1945433 +1770
48 1948095 1941896 -6199
49 1946320 1946320 +0
50 1944548 1951655 +7107
base = 1944859 +/- 1250
test = 1944365 +/- 1381
diff = -493 +/- 1745
speedup = -0.000254
In the https://github.com/official-stockfish/Stockfish/issues/2291#issuecomment-531706773 the mean speed of the noCustomMutex binary was slow (1854418). I made that test downloading the zip from the branch, now I used git and rebased the noCustomMutex on master.
$ git log --oneline --graph -n 6
* 74c51ad6 (HEAD -> noCustomMutex) Remove custom mutex implementation. No functional change.
* a858defd (vdv-upstream/master, upstream/master, origin/master, origin/HEAD, master) Raise stack size to 8MB for pthreads
* 7b064752 Scale down endgame factor when shuffling
* 843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
* e5cfa14f Assorted trivial cleanups
* a83d1a0e Use queens of either color in RookOnQueenFile
$ git log --oneline --graph -n 6
* a858defd (HEAD -> master, vdv-upstream/master, upstream/master, origin/master, origin/HEAD) Raise stack size to 8MB for pthreads
* 7b064752 Scale down endgame factor when shuffling
* 843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
* e5cfa14f Assorted trivial cleanups
* a83d1a0e Use queens of either color in RookOnQueenFile
* 8a04b3a1 Update Makefile documentation
So the custom Mutex class can be removed entirely, that is good.
new speedup test w/ g++ 8.1.0 built by mingw-w64, static binaries
bench$ bash bench-parallel.sh ./stockfish.exe ./stockfish-noCustomMutex.exe 50
run base test diff
1 1956123 1959712 +3589
2 1958813 1968743 +9930
3 1948984 1964217 +15233
4 1914056 1920941 +6885
5 1954333 1969650 +15317
6 1947207 1960611 +13404
7 1944548 1964217 +19669
8 1958813 1959712 +899
9 1954333 1958813 +4480
10 1944548 1969650 +25102
11 1958813 1959712 +899
12 1958813 1963314 +4501
13 1964217 1962412 -1805
14 1957019 1963314 +6295
15 1961511 1961511 +0
16 1959712 1960611 +899
17 1943663 1964217 +20554
18 1942779 1970559 +27780
19 1959712 1960611 +899
20 1949873 1961511 +11638
21 1955227 1965120 +9893
22 1957019 1967836 +10817
23 1961511 1960611 -900
24 1958813 1961511 +2698
25 1952547 1957019 +4472
26 1948984 1955227 +6243
27 1960611 1961511 +900
28 1952547 1965120 +12573
29 1957916 1957916 +0
30 1937493 1969650 +32157
31 1942779 1948984 +6205
32 1952547 1966025 +13478
33 1957916 1954333 -3583
34 1951655 1966025 +14370
35 1958813 1955227 -3586
36 1958813 1966025 +7212
37 1959712 1969650 +9938
38 1955227 1963314 +8087
39 1953439 1970559 +17120
40 1945433 1962412 +16979
41 1960611 1968743 +8132
42 1959712 1971468 +11756
43 1960611 1968743 +8132
44 1938372 1954333 +15961
45 1955227 1960611 +5384
46 1950763 1956123 +5360
47 1948095 1966025 +17930
48 1948095 1967836 +19741
49 1955227 1967836 +12609
50 1954333 1963314 +8981
base = 1952958 +/- 2367
test = 1962262 +/- 2164
diff = 9304 +/- 2228
speedup = 0.004764
bench 16 1 20$ bash bench-parallel.sh ./stockfish.exe ./stockfish-noCustomMutex.exe 10
run base test diff
1 1847383 1851318 +3935
2 1847705 1852742 +5037
3 1848091 1856245 +8154
4 1849187 1846032 -3155
5 1849897 1854427 +4530
6 1852677 1852677 +0
7 1849897 1856830 +6933
8 1850995 1857286 +6291
9 1847190 1856700 +9510
10 1852418 1854816 +2398
base = 1849544 +/- 1244
test = 1853907 +/- 2126
diff = 4363 +/- 2367
speedup = 0.002359
@vondele @noobpwnftw @snicolet noCustomMutex not rebased is a 2.7% slowdown wrt update master, noCustomMutex binary has nearly the same slow speed showed in test https://github.com/official-stockfish/Stockfish/issues/2291#issuecomment-531706773
I fear that my speedup tests in https://github.com/official-stockfish/Stockfish/issues/2291 could be not valid: the branches downloaded as zip perhaps were not aligned with the master.
Let me know if you want repeat some speedup tests.
$ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex-norebased.exe 50
run base test diff
1 1948095 1893260 -54835
2 1940133 1879025 -61108
3 1937493 1877364 -60129
4 1947207 1877364 -69843
5 1944548 1883189 -61359
6 1932235 1890732 -41503
7 1935737 1895794 -39943
8 1948095 1882355 -65740
9 1939252 1894949 -44303
10 1947207 1879025 -68182
11 1945433 1880688 -64745
12 1945433 1876535 -68898
13 1934860 1889051 -45809
14 1950763 1884860 -65903
15 1910631 1883189 -27442
16 1945433 1888211 -57222
17 1939252 1894104 -45148
18 1924402 1881521 -42881
19 1939252 1891574 -47678
20 1926137 1855229 -70908
21 1941896 1894949 -46947
22 1934860 1882355 -52505
23 1928746 1894949 -33797
24 1935737 1893260 -42477
25 1948095 1881521 -66574
26 1945433 1880688 -64745
27 1939252 1892416 -46836
28 1945433 1894104 -51329
29 1939252 1887372 -51880
30 1939252 1894104 -45148
31 1945433 1887372 -58061
32 1938372 1894104 -44268
33 1948095 1894104 -53991
34 1929617 1888211 -41406
35 1933109 1886534 -46575
36 1949873 1895794 -54079
37 1945433 1893260 -52173
38 1940133 1887372 -52761
39 1948984 1889051 -59933
40 1948095 1892416 -55679
41 1942779 1892416 -50363
42 1941014 1891574 -49440
43 1944548 1895794 -48754
44 1948984 1889051 -59933
45 1947207 1888211 -58996
46 1933985 1886534 -47451
47 1937493 1883189 -54304
48 1928746 1878194 -50552
49 1948095 1892416 -55679
50 1941014 1893260 -47754
base = 1940411 +/- 2200
test = 1887451 +/- 2077
diff = -52959 +/- 2659
speedup = -0.027293
$ git log --oneline -n 6
78f680be (HEAD -> noCustomMutex, vdv-upstream/noCustomMutex) Remove custom mutex implementation. No functional change.
843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
e5cfa14f Assorted trivial cleanups
a83d1a0e Use queens of either color in RookOnQueenFile
8a04b3a1 Update Makefile documentation
db00e162 Add sse4 if bmi2 is enabled
@ppigazzini thanks for the repeated testing, yes, these results after rebasing make sense. I'll make a pull request with that patch.
@vondele nobody asked for tests on Linux, so I suppose that we are testing custom Windows code.
Anyway here the speedup tests (bench 16 1 20) for Linux and WSL:
$ bash bench-parallel.sh ./stockfish ./stockfish-noCustomMutex 10
run base test diff
1 1244629 1223008 -21621
2 1231417 1246473 +15056
3 1200954 1220446 +19492
4 1236184 1236011 -173
5 1222895 1212823 -10072
6 1242296 1223431 -18865
7 1229903 1222923 -6980
8 1204556 1228620 +24064
9 1263767 1247177 -16590
10 1262112 1248969 -13143
base = 1233871 +/- 13015
test = 1230988 +/- 7956
diff = -2883 +/- 10379
speedup = -0.002337
$ bash bench-parallel.sh ./stockfish ./stockfish-noCustomMutex 10
run base test diff
1 1870721 1869864 -857
2 1869402 1871250 +1848
3 1869930 1873233 +3303
4 1869468 1870193 +725
5 1868084 1874028 +5944
6 1868150 1872175 +4025
7 1872770 1870721 -2049
8 1865652 1866769 +1117
9 1870853 1873035 +2182
10 1870787 1870457 -330
base = 1869581 +/- 1212
test = 1871172 +/- 1301
diff = 1590 +/- 1491
speedup = 0.000851
@ppigazzini yes, this is code under _WIN32 only. Under the PR https://github.com/official-stockfish/Stockfish/pull/2313 there is the question if there is any effect when TB are enabled. Not sure if you can test that, but if you can that would be useful.
in your script you can use something like:
cat << EOF > inp
setoption name SyzygyPath value C:\tablebases\wdl345;C:\tablebases\wdl6;D:\tablebases\dtz345;D:\tablebases\dtz6
bench 128 4 20 default depth
ucinewgame
EOF
cat inp | ./stockfish
to run bench with more special options.
With the Here Document syntax you can pass the input directly to stockfish. This works w/ a script and w/ the command line:
./stockfish << EOF
setoption name SyzygyPath value C:\tablebases\wdl345;C:\tablebases\wdl6;D:\tablebases\dtz345;D:\tablebases\dtz6
bench 128 4 20 default depth
ucinewgame
EOF
@vondele I never used tablebases, I don't know if depth 20 is useful for your test
bench 16 1 20 default depth$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10
run base test diff
1 1776866 1774840 -2026
2 1778657 1776448 -2209
3 1760029 1759737 -292
4 1758626 1755069 -3557
5 1792387 1789418 -2969
6 1785797 1786580 +783
7 1784954 1781830 -3124
8 1795244 1790811 -4433
9 1795914 1793298 -2616
10 1793298 1792084 -1214
base = 1782177 +/- 8503
test = 1780011 +/- 8383
diff = -2165 +/- 968
speedup = -0.001215
bench 128 4 20 default depth$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10
run base test diff
1 4170820 4161026 -9794
2 4148547 4128555 -19992
3 4199758 4207346 +7588
4 4188851 4063993 -124858
5 4147759 3886493 -261266
6 4059020 4137195 +78175
7 4101758 4066646 -35112
8 4018434 3962576 -55858
9 3943653 4148282 +204629
10 4208268 4026774 -181494
base = 4118686 +/- 54076
test = 4078888 +/- 61135
diff = -39798 +/- 81034
speedup = -0.009663
script:
#!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value C:\tablebases\wdl345; C:\tablebases\dtz345
bench 128 4 $depth default depth
quit
EOF
}
if [[ $# -ne 4 ]]; then
cat << EOF
usage: $0 base test depth n_runs
fast test:
$0 ./stockfish_base ./stockfish_test 13 10
slow test:
$0 ./stockfish_base ./stockfish_test 20 10
EOF
exit 1
fi
base=$1
test=$2
depth=$3
n_runs=$4
# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt
# preload of CPU/cache/memory
($base bench >/dev/null 2>&1)&
($test bench >/dev/null 2>&1)&
wait
# bench loop: SMP bench with background subshells
for k in `seq 1 $n_runs`; do
printf "run %3d /%3d\n" $k $n_runs
# swap the execution order to avoid bias
if [ $((k%2)) -eq 0 ]; then
(_bench $base base000.txt)&
(_bench $test test000.txt)&
wait
else
(_bench $test test000.txt)&
(_bench $base base000.txt)&
wait
fi
done
# text processing to extract nps values
cat base000.txt | grep second | grep -Eo '[0-9]{1,}' > base001.txt
cat test000.txt | grep second | grep -Eo '[0-9]{1,}' > test001.txt
for k in `seq 1 $n_runs`; do
echo $k >> tmp000.txt
done
printf "\nrun\tbase\ttest\tdiff\n"
paste tmp000.txt base001.txt test001.txt | awk '{printf "%3d %d %d %+d\n", $1, $2, $3, $3-$2}'
paste base001.txt test001.txt | awk '{printf "%d\t%d\t%d\n", $1, $2, $2-$1}' > tmp000.txt
# compute: sample mean, 1.96 * std of sample mean (95% of samples), speedup
# std of sample mean = sqrt(NR/(NR-1)) * (std population) / sqrt(NR)
cat tmp000.txt | awk '{sum1 += $1 ; sumq1 += $1**2 ;sum2 += $2 ; sumq2 += $2**2 ;sum3 += $3 ; sumq3 += $3**2 } END {printf "\nbase = %10d +/- %d\ntest = %10d +/- %d\ndiff = %10d +/- %d\nspeedup = %.6f\n\n", sum1/NR , 1.96 * sqrt(sumq1/NR - (sum1/NR)**2)/sqrt(NR-1) , sum2/NR , 1.96 * sqrt(sumq2/NR - (sum2/NR)**2)/sqrt(NR-1) , sum3/NR , 1.96 * sqrt(sumq3/NR - (sum3/NR)**2)/sqrt(NR-1) , (sum2 - sum1)/sum1 }'
# remove temporary files
rm -f base000.txt test000.txt tmp000.txt base001.txt test001.txt
@ppigazzini did I understand correctly you ran the test with table bases? If so, I think this indicates that it is performance neutral also with TB. However:
setoption name SyzygyPath value C:\tablebases\wdl345; C:\tablebases\dtz345
The space should be removed after the ; it might not have found the dtz files.
(as a minor nit, I would also use the _bench() function for the 'preload' phase, so the accesses to the files have been done as well).
@vondele ok, I'll redo the test w/ your corrections.
@vondele
$ ./stockfish-dyn
Stockfish 200919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
info string Found 145 tablebases
bench 16 1 20 default depth$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10
run base test diff
1 1779758 1776668 -3090
2 1804813 1803345 -1468
3 1800294 1808002 +7708
4 1806774 1802367 -4407
5 1797617 1806099 +8482
6 1800476 1800355 -121
7 1803711 1801330 -2381
8 1804935 1805425 +490
9 1810463 1802917 -7546
10 1808248 1804935 -3313
base = 1801708 +/- 5352
test = 1801144 +/- 5518
diff = -564 +/- 3154
speedup = -0.000313
bench 128 4 20 default depth$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10
run base test diff
1 3963877 3934106 -29771
2 4115473 4198858 +83385
3 4105453 4076136 -29317
4 4260282 4082531 -177751
5 4092370 4054341 -38029
6 4161935 4128838 -33097
7 4219421 4180818 -38603
8 4172937 4202060 +29123
9 4226200 4096386 -129814
10 4419807 4172611 -247196
base = 4173775 +/- 74988
test = 4112668 +/- 51324
diff = -61107 +/- 60514
speedup = -0.014641
script snippet code
#!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
bench 128 4 $depth default depth
EOF
}
if [[ $# -ne 4 ]]; then
cat << EOF
usage: $0 base test depth n_runs
fast test:
$0 ./stockfish_base ./stockfish_test 13 10
slow test:
$0 ./stockfish_base ./stockfish_test 20 10
EOF
exit 1
fi
base=$1
test=$2
depth=$3
n_runs=$4
# preload of CPU/cache/memory
(_bench $base base000.txt)&
(_bench $test test000.txt)&
wait
# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt
@vondele you are correct, the space count
$ ./stockfish-dyn
Stockfish 200919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name SyzygyPath value D:\tablebases\dtz345; D:\tablebases\wdl345
info string Found 0 tablebases
$ ./stockfish-dyn
Stockfish 200919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name SyzygyPath value D:\tablebases\dtz345;D:\tablebases\wdl345
info string Found 145 tablebases
Any chance you could run the parallel case for longer, or redo. Right now we have
diff = -61107 +/- 60514
which is still pretty uncertain.
@vondele I ran also the single thread bench because a multi thread one is always noisy.
Anyway I started a new 4 threads bench with 100 iterations.
@vondele
#!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
bench 128 4 $depth default depth
EOF
}
$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 100
run base test diff
1 4028411 3988264 -40147
2 4156718 3988999 -167719
3 4050217 4158813 +108596
4 4075110 3978422 -96688
5 4176691 4126898 -49793
6 4178989 4166439 -12550
7 4364471 4085587 -278884
8 4264779 4172560 -92219
9 4190011 4155278 -34733
10 4094093 4267374 +173281
11 4172055 4104206 -67849
12 4090508 4303651 +213143
13 4107431 4267457 +160026
14 4135663 4156299 +20636
15 4084970 4120275 +35305
16 4120186 4157927 +37741
17 4076877 4185759 +108882
18 4079553 4123875 +44322
19 4145038 4218113 +73075
20 4248117 4401430 +153313
21 4347036 4188048 -158988
22 4203408 4123750 -79658
23 4137384 4104131 -33253
24 4146845 4202925 +56080
25 4131730 4267634 +135904
26 4120837 4135726 +14889
27 4153182 4174302 +21120
28 4245287 4202612 -42675
29 4192294 4190502 -1792
30 4192049 4160818 -31231
31 4213037 4374423 +161386
32 4208960 4302531 +93571
33 4212250 4175957 -36293
34 4171165 4210936 +39771
35 4269870 4113833 -156037
36 4198744 4205718 +6974
37 4132380 4153210 +20830
38 4159630 4213226 +53596
39 4212473 4297142 +84669
40 4132997 4295945 +162948
41 4146415 4195987 +49572
42 4105173 4155161 +49988
43 4176861 4260353 +83492
44 4296667 4147231 -149436
45 4173323 4332942 +159619
46 4387542 4219682 -167860
47 4249836 4405287 +155451
48 4431291 4171066 -260225
49 4154054 4129157 -24897
50 4152967 4273454 +120487
51 4262182 4164639 -97543
52 4009779 4198586 +188807
53 4381828 4136225 -245603
54 4119134 4147451 +28317
55 4195190 4128934 -66256
56 4207024 4292524 +85500
57 4278349 4101010 -177339
58 4288053 4175406 -112647
59 4230221 4155640 -74581
60 4182236 4233959 +51723
61 4230142 4192374 -37768
62 4076088 4098255 +22167
63 4147207 4094302 -52905
64 4193308 4159861 -33447
65 4159802 4115800 -44002
66 4229147 3973837 -255310
67 4104917 4145674 +40757
68 4277626 4159346 -118280
69 4178098 4369557 +191459
70 4215423 4082302 -133121
71 4124281 4191971 +67690
72 4205383 4237827 +32444
73 4085438 4197867 +112429
74 4340990 4129501 -211489
75 4123887 4258141 +134254
76 4114655 4134322 +19667
77 3998934 4048559 +49625
78 4006870 4142400 +135530
79 4149633 4064129 -85504
80 3987847 4083555 +95708
81 4079099 4019175 -59924
82 4069896 4075554 +5658
83 4046002 4006692 -39310
84 4145279 4120274 -25005
85 4208562 4065266 -143296
86 4058375 4099311 +40936
87 4241879 4136356 -105523
88 4110025 3983375 -126650
89 4260752 4015073 -245679
90 3977838 3978929 +1091
91 3968656 4016923 +48267
92 4117722 4197500 +79778
93 4125479 4150627 +25148
94 4013158 4225544 +212386
95 4036261 4076191 +39930
96 4203485 4119686 -83799
97 4019309 4061404 +42095
98 4107073 4309239 +202166
99 4148978 4182035 +33057
100 4222571 4165275 -57296
base = 4161596 +/- 18125
test = 4161296 +/- 18229
diff = -299 +/- 22476
speedup = -0.000072
Closing this issue now that the reasons for the hang are well understood (buggy external pthread library for GCC 9.1 and 9.2 compiler suite on MinGW64).
Most helpful comment
@vondele @noobpwnftw You are right, my atomic(bool) version is not good enough.
It is not a problem of implicit mutex around the atomic (atomic(bool) is not guaranted to be lock-free by the standard, but in practice it is lock-free on all modern CPUs, see
std::atomic<bool>{}.is_lock_free()).But the semi-buzy waiting loops are consuming a little bit more of CPU than I expected, because it is not really blocking threads but doing a semi-active waiting loop. I called thread::yield to be polite to other threads but I tested it to be significantly more than 0% CPU time even in idle state.
It would pass the simplification test easily against master, because it's cheating on the scheduler :-)