Stockfish: [NNUE] Position state / Use NNUE

Created on 28 Jul 2020 · 24Comments · Source: official-stockfish/Stockfish

The following two inputs yield different search, as part of the state is in position depends on NNUE but is not changed on option update.

wrong:

position fen rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 1
setoption name Use NNUE value true
go depth 10
ucinewgame

good:

setoption name Use NNUE value true
position fen rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 1
go depth 10
ucinewgame

a fix needs to consider added moves to the position command (i.e. we can't just recreate a Position from the FEN).

Source

vondele

All 24 comments

needs fixing as part of https://github.com/official-stockfish/Stockfish/issues/2823

vondele on 28 Jul 2020

I think it should be quite easy to remove nnue from position when we merge your template (or global variable but I like template better) solution. Function Eval::evaluate() has only position as an argument so to know which eval should be run, the bool useNnue has been used as a hack in order to not touch too much of existing code.

dorzechowski on 28 Jul 2020

might help, but I'm not sure it is the full solution, since the evalList variable might need to be maintained / rewritten ? The global variable kind of works for search (go depth...), since the rootpos is being created in ThreadPool::start_thinking and the evalLists are updated, but for example eval is not so easy (a bit similar to https://github.com/official-stockfish/Stockfish/issues/2579)

For the template approach, I'm actually really surprised this doesn't remove the overhead, as it should eliminate all conditionals. Do you have an idea where this overhead might come from?

vondele on 28 Jul 2020

Yes, it seems that putting it in pos was actually the most straightforward solution, even if not pretty.

I have no clue about slowdown, I see less than 1% slowdown on my machines (i7-7700 and i7-8700, bmi2 compile, compiled with gcc 9.3.0 on Windows). Base compiled from master, test from nnue-player-wip. 0.6% slowdown would give maybe 1.0-1.5 Elo regression. Maybe it's worse on some other architecture/OS combinations.

make profile-build ARCH=x86-64-bmi2

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    2517667   2501356   16311     
    StDev   10825     9397      14634     

p-value: 0,133
speedup: -0,006

dorzechowski on 28 Jul 2020

@vondele I can see that those measurements you have run were dominated by @noobpwnftw machines and he reported some slowdowns here. So perhaps he sees ~2.5% slowdown which would explain 5 Elo regression.

dorzechowski on 28 Jul 2020

The test diff has some base changes that is irrelevant to NNUE. The slowdown I measured is about around 1%+-0.5.

noobpwnftw on 29 Jul 2020

This kind of slowdown is fully acceptable imo. But it should only give 2-3 Elo regression. Current 5 Elo at STC will become probably around 3 Elo at LTC so not that bad although it would be nice to cut it by Elo or two. We have people much better versed than me in C++ tricks so I hope they will find something. I may try to look at it but probably not before Friday.

dorzechowski on 29 Jul 2020

1% would be definitely acceptable (~ 2 Elo). I guess I have to figure out what irrelevant changes there would be...the github diff associated with the test indeed seems to indicate other changes wrt to master, but the bench suggests they are not there..

vondele on 29 Jul 2020

I've started a new test where the diff looks clean: https://tests.stockfishchess.org/tests/view/5f213305c09435d870cba002 The other tests have the right bench, so I think the displayed diff is somehow an artifact.

The result is a little puzzling. Indeed local testing shows only 1.2 % slowdown using a 'build', and surprisingly a speedup using 'profile-build'. Certainly nothing that could explain a 5Elo change. Possibly this is very machine / compiler dependent.

vondele on 29 Jul 2020

might you have an idea where this overhead might come from?

In my engine I noticed that even changing the order of elements in the position object (or move stack object or search thread object) has influence on the speed. Probaby by getting better or worse use of cache lines. Maybe the NNUE code does something in this area? Just an idea...

Matthies on 29 Jul 2020

Yeah, I recently only multiplied eval by 2, and somehow got speedup by 2.8%, this is very confusing area.

One thing that might influence the performance is StateInfo struct that got extended by 2 fields. Those are redundant if NNUE is not used. So we would basically need to keep original StateInfo, then have a derived one for NNUE, like StateInfoNNUE : StateInfo and create correct one depending on Use NNUE value. I cannot however even try and test it before Friday.

dorzechowski on 29 Jul 2020

The result is a little puzzling. Indeed local testing shows only 1.2 % slowdown using a 'build', and surprisingly a speedup using 'profile-build'. Certainly nothing that could explain a 5Elo change. Possibly this is very machine / compiler dependent.

usually for normal sf gcc inlines the evaluation function to a huge monolithical search. I doubt that with eval type being dispatched at runtime the code increase will allow it to inline still (but probably does inline one path with pgo).

I also wonder why having compile time dispatch between old and nnue was not ported from nodchip's implementation? The current way introduces coupling in a sense that each stockfish executable should be accompanied by the net or otherwise not all features will work.

Sopel97 on 29 Jul 2020

I also wonder why having compile time dispatch between old and nnue was not ported from nodchip's implementation? The current way introduces coupling in a sense that each stockfish executable should be accompanied by the net or otherwise not all features will work.

This is exactly the same as with syzygy. You don't have the files, you don't use the function.

I started a test to see if StateInfo change could be the reason for the 5 Elo regression. I removed NNUE fields from StateInfo and commented out some code just to make it compile. Test here: https://tests.stockfishchess.org/tests/view/5f21b9de2f7e63962b99f41a.

dorzechowski on 29 Jul 2020

The test looks much better. Adding some padding bytes to make sizeof(StateInfo) % 64 == 0 may help?

Matthies on 30 Jul 2020

So it seems it was StateInfo indeed. Mainly accumulator is quite big as it contains first layer parameters, i.e. 2x256 int16 numbers currently. So with bigger nets it would get even bigger. Original StatInfo size is 176 bytes, with NNUE it's 1312 bytes now. It will be a bit tricky to fix it, I don't know yet what would be the solution.

dorzechowski on 30 Jul 2020

TEMPLATIZE ALL THE THINGS!!!

sf-x on 30 Jul 2020

👍1

Interesting test results with StateInfo... hope we find a solution for that, would be nice.

vondele on 31 Jul 2020

I found this in do_null_move:
std::memcpy(&newSt, st, sizeof(StateInfo));

In do_move you only copy up to the key member. I don't know if you need to copy more in the null_move but you should try to avoid copy the NNUE parts?

Matthies on 31 Jul 2020

Some simple improvements found and tested, see PR #2873.

dorzechowski on 31 Jul 2020

👍1

@dorzechowski I've merged those changes.

I'll leave the issue open for the original problem. What's your opinion on using the global flag or the state in Position... any preference? I think this issue can be solved easily either way (the eval command needs a few lines of code, but nothing complicated).

vondele on 31 Jul 2020

The variant with the global flag looks a bit more pleasant to me. It looks nicer to have if (Eval::useNNUE) instead of if (pos.use_nnue()) and also it simplifies the code a little bit. OTOH the template variant changes a lot of things and actually performs worse, so I would not use it now.

dorzechowski on 31 Jul 2020

right, I'll now work a little on the fishtest coding, but unless you beat me to it, I'll make a PR for the global flag.

vondele on 31 Jul 2020

Please do, I will be away now for at least 10 hours.

dorzechowski on 31 Jul 2020

👍1

I believe the issue is now fully fixed with #2877 and #2876 combined.

vondele on 31 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings