The following two inputs yield different search, as part of the state is in position depends on NNUE but is not changed on option update.
wrong:
position fen rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 1
setoption name Use NNUE value true
go depth 10
ucinewgame
good:
setoption name Use NNUE value true
position fen rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 1
go depth 10
ucinewgame
a fix needs to consider added moves to the position command (i.e. we can't just recreate a Position from the FEN).
needs fixing as part of https://github.com/official-stockfish/Stockfish/issues/2823
I think it should be quite easy to remove nnue from position when we merge your template (or global variable but I like template better) solution. Function Eval::evaluate() has only position as an argument so to know which eval should be run, the bool useNnue has been used as a hack in order to not touch too much of existing code.
might help, but I'm not sure it is the full solution, since the evalList variable might need to be maintained / rewritten ? The global variable kind of works for search (go depth...), since the rootpos is being created in ThreadPool::start_thinking and the evalLists are updated, but for example eval is not so easy (a bit similar to https://github.com/official-stockfish/Stockfish/issues/2579)
For the template approach, I'm actually really surprised this doesn't remove the overhead, as it should eliminate all conditionals. Do you have an idea where this overhead might come from?
Yes, it seems that putting it in pos was actually the most straightforward solution, even if not pretty.
I have no clue about slowdown, I see less than 1% slowdown on my machines (i7-7700 and i7-8700, bmi2 compile, compiled with gcc 9.3.0 on Windows). Base compiled from master, test from nnue-player-wip. 0.6% slowdown would give maybe 1.0-1.5 Elo regression. Maybe it's worse on some other architecture/OS combinations.
make profile-build ARCH=x86-64-bmi2
Results for 20 tests for each version:
Base Test Diff
Mean 2517667 2501356 16311
StDev 10825 9397 14634
p-value: 0,133
speedup: -0,006
@vondele I can see that those measurements you have run were dominated by @noobpwnftw machines and he reported some slowdowns here. So perhaps he sees ~2.5% slowdown which would explain 5 Elo regression.
The test diff has some base changes that is irrelevant to NNUE. The slowdown I measured is about around 1%+-0.5.
This kind of slowdown is fully acceptable imo. But it should only give 2-3 Elo regression. Current 5 Elo at STC will become probably around 3 Elo at LTC so not that bad although it would be nice to cut it by Elo or two. We have people much better versed than me in C++ tricks so I hope they will find something. I may try to look at it but probably not before Friday.
1% would be definitely acceptable (~ 2 Elo). I guess I have to figure out what irrelevant changes there would be...the github diff associated with the test indeed seems to indicate other changes wrt to master, but the bench suggests they are not there..
I've started a new test where the diff looks clean: https://tests.stockfishchess.org/tests/view/5f213305c09435d870cba002 The other tests have the right bench, so I think the displayed diff is somehow an artifact.
The result is a little puzzling. Indeed local testing shows only 1.2 % slowdown using a 'build', and surprisingly a speedup using 'profile-build'. Certainly nothing that could explain a 5Elo change. Possibly this is very machine / compiler dependent.
might you have an idea where this overhead might come from?
In my engine I noticed that even changing the order of elements in the position object (or move stack object or search thread object) has influence on the speed. Probaby by getting better or worse use of cache lines. Maybe the NNUE code does something in this area? Just an idea...
Yeah, I recently only multiplied eval by 2, and somehow got speedup by 2.8%, this is very confusing area.
One thing that might influence the performance is StateInfo struct that got extended by 2 fields. Those are redundant if NNUE is not used. So we would basically need to keep original StateInfo, then have a derived one for NNUE, like StateInfoNNUE : StateInfo and create correct one depending on Use NNUE value. I cannot however even try and test it before Friday.
The result is a little puzzling. Indeed local testing shows only 1.2 % slowdown using a 'build', and surprisingly a speedup using 'profile-build'. Certainly nothing that could explain a 5Elo change. Possibly this is very machine / compiler dependent.
usually for normal sf gcc inlines the evaluation function to a huge monolithical search. I doubt that with eval type being dispatched at runtime the code increase will allow it to inline still (but probably does inline one path with pgo).
I also wonder why having compile time dispatch between old and nnue was not ported from nodchip's implementation? The current way introduces coupling in a sense that each stockfish executable should be accompanied by the net or otherwise not all features will work.
I also wonder why having compile time dispatch between old and nnue was not ported from nodchip's implementation? The current way introduces coupling in a sense that each stockfish executable should be accompanied by the net or otherwise not all features will work.
This is exactly the same as with syzygy. You don't have the files, you don't use the function.
I started a test to see if StateInfo change could be the reason for the 5 Elo regression. I removed NNUE fields from StateInfo and commented out some code just to make it compile. Test here: https://tests.stockfishchess.org/tests/view/5f21b9de2f7e63962b99f41a.
The test looks much better. Adding some padding bytes to make sizeof(StateInfo) % 64 == 0 may help?
So it seems it was StateInfo indeed. Mainly accumulator is quite big as it contains first layer parameters, i.e. 2x256 int16 numbers currently. So with bigger nets it would get even bigger. Original StatInfo size is 176 bytes, with NNUE it's 1312 bytes now. It will be a bit tricky to fix it, I don't know yet what would be the solution.
TEMPLATIZE ALL THE THINGS!!!
Interesting test results with StateInfo... hope we find a solution for that, would be nice.
I found this in do_null_move:
std::memcpy(&newSt, st, sizeof(StateInfo));
In do_move you only copy up to the key member. I don't know if you need to copy more in the null_move but you should try to avoid copy the NNUE parts?
Some simple improvements found and tested, see PR #2873.
@dorzechowski I've merged those changes.
I'll leave the issue open for the original problem. What's your opinion on using the global flag or the state in Position... any preference? I think this issue can be solved easily either way (the eval command needs a few lines of code, but nothing complicated).
The variant with the global flag looks a bit more pleasant to me. It looks nicer to have if (Eval::useNNUE) instead of if (pos.use_nnue()) and also it simplifies the code a little bit. OTOH the template variant changes a lot of things and actually performs worse, so I would not use it now.
right, I'll now work a little on the fishtest coding, but unless you beat me to it, I'll make a PR for the global flag.
Please do, I will be away now for at least 10 hours.
I believe the issue is now fully fixed with #2877 and #2876 combined.