Stockfish: NNUE ideas and discussion (post-merge).

Created on 6 Aug 2020  ·  142Comments  ·  Source: official-stockfish/Stockfish

I'll create this new issue to track new ideas and channel early questions.

NNUE

Most helpful comment

Not directed at you Sergio, but to the community as a whole... let me say why I'm pushing to support and help @nodchip with his branch. I believe we have received with NNUE a wonderful gift, and I feel it is important that we give back. This will be a beneficial journey for the full community.

All 142 comments

As NNUE makes good use of intrinsics. Optimize the availability of SIMD/Vector Extensions so that appropriate intrinsics are used where possible.
This may mean that Fishtest is changed to identify available instructions for each worker.

@Viceroy-Sam like https://github.com/glinscott/fishtest/commit/cc30c34c3b3a3713f805976c1a8f8d73f8bb86b7#diff-e8d8184dfbe43a62177e9eb695449cd2R181-R217

Apply lazy threshold before considering NNUE evaluation.

Similar, but I think the value used for lazy threshold in evaluate is more meaningful and possibly in rare cases the patch above would use NNUE eval while normal eval would return after first lazy threshold. So this patch could maybe be improved with

  1. calculate value up to first lazy threshold
  2. if value > t0 then return value
  3. if evalNNUE or value > t1 then return evaluate()
  4. return evalNNUE()

@vondele. Yes. Fishtest to collect and show architecture.

I filed an issue about the collecting and showing : https://github.com/glinscott/fishtest/issues/743

Is there an easy way to separate bishop piece types into light square and dark square bishops? The NN might benefit from having OCB information.

Is there an easy way to separate bishop piece types into light square and dark square bishops? The NN might benefit from having OCB information.

This is exclusively a training issue. The proper place is therefore https://github.com/nodchip/Stockfish/

During upload it might be a good idea to explicitly tell that only the author of the net is allowed to upload. (If it doesn't already fall under a CC0 license)

Add option to enable or disable lazyeval / hybrideval.

I've seen it gains elo in Fishtest conditions.
But quoting Dkappe:
"it performs worse on nets that don’t conform to stockfish eval, which are all of mine. It also defeats the purpose of training nets that play like other engines."

@gekkehenker the upload page already mentions that.... but yes, we might need a tick box.
(Edit: issue https://github.com/glinscott/fishtest/issues/744)

concerning the extra option, no I don't think we want to do that.

Is there an easy way to separate bishop piece types into light square and dark square bishops? The NN might benefit from having OCB information.

This is exclusively a training issue. The proper place is therefore https://github.com/nodchip/Stockfish/

Thanks, but my question was more about the non-training side. The training code seems generic enough that it won’t need to be modified for new piece types. My concern is that it seems hard to split bishops into two piece types without breaking classical eval or movegen.

he training code seems generic enough that it won’t need to be modified for new piece types.

On the contrary, it's the NN which is generic enough as to need no special treatment. The training,OTOH, does, if you want it to treat the differently-colored bishops differently (since, at present, it is heavily biased towards colorblindness).

concerning the extra option, no I don't think we want to do that.

I personally think this is a disappointing decision. Supporting nets generated in all sorts of way other than Stockfish evals and searches would make Stockfish more valuable as a research tool, and I think it would ultimately also help progress.

@romstad well, that decision is not set in stone. Furthermore, how the nets are generated is of course left fully open (i.e. trained on SF evals or otherwise). However, what I would be reluctant to have would be various different way to 'integrate' the net in SF, i.e. I'd prefer to have one approach (which could by hybrid like we have now, or pure, or additive, etc), which we pick based on playing strength. So, people should be free to test any (reasonable) net, integrated in any (reasonable) way in SF, and if it passes SPRT we'll happy to adjust. Having too many choices early on will make it difficult to change in the future (i.e. do we maintain / support / N different variants?). At least that's my 2c right now.

concerning the extra option, no I don't think we want to do that.

I don't know what would be the best solution here because on one hand, we don't want to multiply UCI options and complicate code but on the other hand, "pure NNUE" could be useful for analysis. In gameplay new patch is stronger but at the same time it is much more blind to spectacular sacrifices because the classic eval kicks in and the move is pruned into oblivion.

For example: r1b1qr1k/2p3pp/4p3/1pb1PpN1/pn3N1P/8/PPP1QPP1/2KR3R w - - 0 1
Here Rd8 actually wins but Stockfish won't realize it now.

Of course changing this because someone might produce a net that is not compatible with Stockfish eval range is not a good idea. But keeping NNUE in Stockfish relevant for analysis is something to consider IMO.

If Stockfish 12 is not an imminent release could some official statement (blog.stockfishchess.org) be released to explain what has happened with NNUE merge the last few days, maybe outline how NNUE works and where the dev is at. A lot of people are using different flavors/branches of SF and that is leading to some inaccuracies in communication.

@Viceroy-Sam we haven't been communicating much beyond the releases. I think the blog, twitter etc, are @daylen. Note also we're still ironing out the wrinkles, and progress is being made very quickly. I'd expect another 10-20 Elo by the end of the weekend.

@Viceroy-Sam @vondele Yeah, I'm happy to make a blog post/twitter with the contents of this commit message (which I think is a pretty good summary!) https://github.com/official-stockfish/Stockfish/commit/84f3e867903f62480c33243dd0ecbffd342796fc

In gameplay new patch is stronger but at the same time it is much more blind to spectacular sacrifices because the classic eval kicks in and the move is pruned into oblivion.

For example: r1b1qr1k/2p3pp/4p3/1pb1PpN1/pn3N1P/8/PPP1QPP1/2KR3R w - - 0 1
Here Rd8 actually wins but Stockfish won't realize it now.

Yes, I've noticed many positions where "pure" SF NNUE found quickly that it has become totally blind to since the hybrid patch. It's a little disappointing for me, but then "elo gain is elo gain". The way I see it, it means with the hybrid patch, SF finds moves faster in many other positions, which is "equally" important from an objective standpoint. Furthermore, if there is an objective overall elo gain (which we are confident is true given it passed SPRT bounds on fishtest), that by definition means that SF (with hybrid patch) now finds the "better" move in more positions than without the hybrid patch.

Thinking about it more, it doesn't really make sense to add more UCI switches just to better analyze some positions. It's a bit similar situation to null move pruning and many other features - in general brings Elo but sometimes hurts. Also, whatever we do, someone will complain that some other obscure option is missing. People analyzing games are free to use previous build without this change or even maintain their own fork what I'm sure a lot of them do.

If at some point pure NNUE becomes stronger, I'm sure someone will simplify it back.

I think we need a huge search retuning, especially for heuristics that use static eval.
Look at fishtest now - basically everyting is passing/close to passing...

If nothing else, having a UCI option named "Use NNUE" that, even when true, might in fact _not_ use NNUE seems confusing now. It seems there are now two possible evaluations: classical and hybrid.

Maybe replace "Use NNUE" with either "Use Classical Evaluation" or "Use Hybrid Evaluation". One or the other, and simply true or false.

Your understanding of the word "Use" is incorrect. Usage doesn't imply exclusivity.

@Sopel97 "Use Classical Evaluation" is certainly unambiguous (and is true by default). Set it to false to enable whatever chimerical NNUE-hybrid is currently considered "best".

Or, maybe "Enable NNUE" instead of "Use NNUE"?

Should I even keep testing eval patches with non-nnue settings? I would speculate that optimal static (and the success of patches) is rather different for conventional SF compared to a nnue version (where static is only used to evaluate rather unbalanced positions).

Yes, I propose that patches to the classical evaluation are tested with 'Use NNUE=false', the hybrid mode should not dictate, at least for now, how classical evaluation evolves.

But we somewhat use classical evaluation as a speedup.
Wouldn't it be logical to pass [-3;1] once on NNUE = true to not regress there?

I think I would like to avoid that for now, NNUE is evolving so quickly that this hardly matters, but this can be revised later.

The mixed evaluation approach complicates things and has "only" brought 11 Elo (+ maybe further adjustments of the threshold value might gain a bit more on top). But wouldn't you like to keep the NNUE eval separated first, at least for a period of time and retest the hybrid evaluation again later? With that high amount of currently passed patches for NNUE, hybrid eval might be obsolete soon and it interferes with current rapid development with NNUE eval.
With "hybrid" I also mean the approach of many currently tested patches that try to replace part of NNUE eval in certain types of positions with classical eval. That looks like making steps backwards.

If hybrid eval becomes obsolete it will be simplified.

Isn't it a problem the weak points of the NNUE Net can't be improved (or better: improvements can't be tested) when those weak points are being handed over to classical eval?

There will be a struggle between (evolved hybrid eval / optimized search) of a specific net architecture and other architectures of potentially higher ceiling. Up to a point it can be up to the external NN trainers to come up with convencingly superior stuff, but resource-wise its an asymmetric battle.

It would be nice for NN developers to have framework access for researching their potential. The 256halfkp selection might be like taking a 2 liter engine, and evolving it and everything around it (chassis etc) for it. A 4 liter engine will not fit and require different stuff. So once our 2 liter car is very tuned it will be very hard to justify the transition.

Obviously there is no easy solution to this local maximum architecture issue, one has to start from somewhere.

So one has to rely upon intuition on what would offer the highest long term potential and ride it all the way, a crucial decision.

Another idea is to initially offer parallel evolution of different architectures with comparable resources for a modest period and narrow down to the most promising one(s).

The other way is doable too but requires discipline: to stop optimization of 256hybrid once elo gains slow down, and transition to a different one by using the same gear and alter it. This might be more efficient if the optimal gear is similar, but it will be emotionally hard to step down a few dozens of elo and allocate effort there.

But also its not for granted that "bigger is always better". This might be true for UCT needing increased eval accuracy for filling the generalised gaps, but for SF it can well be that the highest ceiling is offered by a challenging synthesis of roles, as certain stuff could be done more efficiently outside of a NN.

Much like a F1 car using a mix of automatic, semi-automatic and manual stuff. If the driver is good he can do some stuff better than automatic modules (or good enough and profit from less weight).

@TonHaver right now training doesn't take hybrid into account. So if the weak point becomes better, we might be able to e.g. change the threshold or simplify all away. I personally would be very curious to see what happens when on optimizes a net taking into account its training data only needs to contain positions that it will later encounter in the hybrid approach. I suspect this criterion cuts out many really uninteresting positions (very unbalanced, essentially won anyway).

I think we need a huge search retuning, especially for heuristics that use static eval.
Look at fishtest now - basically everyting is passing/close to passing...

Another run of xoto’s searchconsttune would be nice.

Looks like incredible progress already:
https://tests.stockfishchess.org/tests/view/5f2f0ff49081672066536b29

@vondele By the way, it might be useful to ensure the gains persist with SMP and scaling - there are some reported issues that since the hybrid patch, scaling is hurt badly. Any chance to run a 8-core RT?

I wonder if a smaller architecture that's as fast (or faster) than the classic eval has the potential to replace it in unbalanced positions, too.

@ssj100 those reports are very likely small samples only. Right now, it would be a waste of resources to run another SMP. There will hopefully be another wave of patches, and the next RT will be SMP. More importantly, we have an important issue to fix for SMP (https://github.com/official-stockfish/Stockfish/issues/2933), since right now, we (and all other NNUE branches) probably have wrong results (likely with little impact).

Read a previous post I have known @vondele doesn't like the idea of having the option to turn on/off hybrid mode. Reading some forum posts people are still discussing and love to have that option thus they can run some game analyzing and/or do some testing. I think that kind of use/testing may later help SF be stronger since they can get more knowledge anyway. Note that the majority don't know to code or compile SF.

Perhaps we can help them but using some "hidden" options which won't be listed when getting the command "uci", normal people don't know, don't use but if someone really wants they can know and use. Thus we can help people, solve the dilemma when still keeping policy/mainstream.

I don't really understand what it is doing yet, but I noticed that rotate180() really rotates over 180 degrees (^ 0x3f), which seems rather unnatural for chess.

(This may have to do with castling rights still not being taken into account by the NN? Or am I mistaken there.)

On an entirely different note, how will patches to the classic eval now be tested? If in the NNUE/hybrid mode, then I would expect basically any patch that speeds up the classic eval (by removing feautures) to pass (and to actually gain Elo, at the cost of classic mode).

Classical eval patches will be tested with 'NNUE false', we'd like to keep it in good shape.

castling right are not yet taken into account, but there are extensions of the network architecture that do take it into account. Testing those architectures is for the (near?) future, once we have some experience with the current setup and it is stable.

I haven't checked the role of rotate180 yet.

I suspect that the 180 degree rotation might come from the fact that shogi has a point symmetric starting position, so if there is no other reason for that choice I would agree that reflecting the axial symmetry of the chess starting position in that transformation would make sense.

3 Ltc Tests passing with new net. Lowering hybrid threshold and adding new term and raising hybrid threshold. Raising threshold should be good at vltc where speedup isn't as important. With new net it might not have pawn blindspot anymore?

Can someone make a branch that loads two different nets for a good base for these type of experiments?

The question is not quite clear to me.

Note that this patch https://github.com/Vizvezdenec/Stockfish/compare/add890a10b...5129aab83a almost certainly also increases the hybrid threshold on average. You can use dbg_mean_of(foo); in the code to see the average value of a term, during a bench.

I can see how that could be. So how do we know if elo gain is more from new term or from raising or because that net was stronger or weaker in many pawn area...
I guess the question is how to merge them or decide what must be retested
Edit now i think a new net is not a factor probably. But still there is 480 or 600 as threshold

With pull requests changing parts of the code that are somehow related, or obviously overlapping, it has been up to the developers to suggest the best path forward, and ultimately up to the maintainer to decide. Note, that none of the tests has passed yet, so the question is a bit premature. So far, with the wave of patches after NNUE merge, I have committed most of them without requests for retesting. Once new-patch-rate goes a bit down, the most suitable gets merged first, the other one is rebased and retested. In this case, assuming all three pass, I would presumably consider to merge the new net and one of the patches that changes the bounds. I would have to see precise results, and maybe hear some background info by @Vizvezdenec to decide which one goes. The other one could be retested, presumably in a form with tweaked params.

I honestly think that my patch is somewhat logically better because I always preferred some scaling and not flat thresholds.
Like extra exchange w/o pawns is probably smth classical eval can handle with (we even have special eval functions for this cases) while positions with most pawns are usually the hardest for it to evaluate properly, usually positions with most pawns are "the hardest" to evaluate properly and this patch does direct scaling of using "better" eval with pawn count.
My patch starts to use NNUE at 7+ pawns more often comparing to jjosh one and honestly I think this is the area where you want to use NNUE the most.
Also I wanted to scale up this threshold with npm also - this also looks really logical.
Another unrelated thing - your patch wasn't looking to pass 2nd STC - https://tests.stockfishchess.org/tests/view/5f2f2f249081672066536b48
If anything I prefer to merge my patch (if it passes ofc) and then test some stuff as a simplifications on top of it.
Also note that at high pawn counts threshold raises up to "knight - pawn" which is important for a lot of tactics.

Yes I agree we need probably a lot more eval terms. However, I would like to get the current threshold tuned first so more complications are definitely stronger and wont be able to simplified right away. I think the complexity term is a good starting point.
Maybe its possible to do a small SPSA tune with just tuning the threshold and your new term and see what it outputs? If the new term is close to zero then a simplification patch would probably pass?

I think we need to wait for some patch to pass.
If it will be mine you can test flat 600 as a simplification (or maybe even higher value, why not).
If it's yours I will just stop my test.
Note that I've seen anything, like tests not passing LTC after 2.91 LLR ... So we need to wait patches to converge first :)

That sounds pretty reasonable. Good like with your patch it is close :)

Is it likely that adjudication rules contribute to the success of swapping to classical eval, by triggering more adjudicated wins of positions that are/could be drawn despite the high eval?

I think its alarming and should be investigated because for example if a fortress is more correctly (modestly) evaluated by nnue will grant a draw, while an incorrect high evaluation will grant a win.

And in general maybe we would like more accurate/less aggressive adjudication?

@NKONSTANTAKIS
Interesting example, but to "win" this fortress position by adjudication, it seems hybrid SF would need to have the agreement of its opponent NNUE SF that it is winning. (At least I suppose that both engines have to show a winning/losing eval for adjudication to kick in.)

Assuming NNUE SF's eval is more accurate in a particular game, I don't easily see how it could lose points to hybrid SF because of the adjudication rules. But there may be a scenario I am not thinking of...

If I might have another go for an UCI option to set NNUE to always on, I think the solution Crystal is using is very elegant, and satisfies users like me that like to use Stockfish for analysing purposes.

Please take a look at https://github.com/jhellis3/Stockfish/commit/15c515567a4d4d1af207452fed08b2d85fc7f4b0

I think we can do smth like th.marked in reduction but for positions that use "classical" eval.
If position was already evaluated by classical eval by other thread, evaluate it by NNUE instead - maybe it will be an elo gain at SMP.
I'm trying something in this direction but my knowledge of SMP is too bad to do it in any sort of robust way.

I doubt this is easy. One would need essentially an additional bit in the transposition table for that, and it is unclear if it would gain.

This could work okayish with a bloom filter. But I'm not sure if the computational cost required wouldn't just make it better to use NNUE all the way.

Well I'm just giving a basic idea :) Since we have 2 evals than maybe we can try this sort of stuff.
Interaction with transposition table is yeah, not obvious, but it's smth to work on I guess.

Is it possible that with many threads the hybrid eval is elo negative?
Not the most conclusive results, but maybe concerning.
Score of (Hybrid)stockfish_20080621_x64_bmi2 vs (NonHybrid)stockfish_20080616_x64_bmi2: 179 - 167 - 1654 [0.503]
Elo difference: 2.1 +/- 6.3, LOS: 74.1 %, DrawRatio: 82.7 %

2000 of 2000 games finished.

https://drive.google.com/file/d/1gvEKMrbwLlhRp48Aax1g_ja0gfzuomhV/view?usp=sharing

CPU: i5-10400
TC=5+0.05, 12 threads.
Noobbook 3 moves
Games are adjudicated with 6 pieces on board.

Binaries are already outdated...
But if LTC passed by 11.30 elo, then this result may point towards SMP losses inherent to hybrid.

Current regression tests may give partial answer, currently SMP is lower than singlecore which was never the case since sf 11

As for 1-core after the recent patches I thought it time to investigate again NNUE use only, but it failed rather quickly:
https://tests.stockfishchess.org/tests/view/5f32b00d66a893ef3a025e11

there is little reason to believe SMP influences the effect of hybrid eval. Who knows... however. IMO the current small difference between single-threaded and SMP could also be attributed to things like sss, and the fact that we're already in 'extreme' territory, W/L ration is near 10, draw rate difference between both tests is near 5%.
To answer the question, I've started a SPRT LTC SMP: https://tests.stockfishchess.org/tests/view/5f32dc4366a893ef3a025e4c

Hmmm, simplification passed? :-)

AVX512 fleet did help :-)

So make the threshold dependent on the cpu's instruction set?

just launch the raspberry pi 3 fleet for next test.

Just change Use NNUE from check to spin and use it as the threshold ;-)

Or threshold depending on ARCH. modern - normal, AVX2 - higher, BMI2 - even higher, AVX512 - infinity.

seems like a good plan for most constants in search...

If you can classify workers in the future, then it is possible to decide optimal values for those, I do have sufficient number of workers of each kind...

So our search will be
probCutBeta = ARCH == #archname1# ? 145 - 40 * improving : ARCH == #archname2# ? 128 - 33 * improving : 110 - 28 * improving;
etc.
?

In the future all workers will at least have avx2. Older than should remain supported but do not need to be optimised for, I would think.

There is one thing, I think at some point people still want to use a standalone fully functioning binary(which includes a net). This makes integration a lot easier for example deployment on devices/browsers.

I've seen somewhere a patch which used a linker hack to embed the net as an object file in the binary. Not sure if somebody can find it back. Not sure if we really want that..

I mean we can just bin2hex into a header file and #include it on-the-fly with a makefile flag, which compiles a self-contained binary including the default net. There are cases where downloading files on the machine that runs the engine is not trivial, or loading an external file is not possible due to platform limitations.

so for reference that's the patch I've seen: https://tests.stockfishchess.org/tests/view/5f135fc1da64229ef7dc173d i.e. https://github.com/gekkehenker/Stockfish/compare/ffebd6f652...62cbeab1f0

not saying I like this approach, but it is at least the first patch I've seen.

Clearly, also make net could write a bin2hex based source file, probably compile time won't improve with ~20Mb of sources.

We can add https://github.com/nodchip/Stockfish/issues/76 to the stack...

that's how I would do it too. Using the linker is the best.

Using include with bytes for that is... let's just say not ideal. Strings are better but MSVC has a very small limit on string literal length.
https://mort.coffee/home/fast-cpp-embeds/

Well, MSVC also doesn't have ld so one way or the other. Just trying to simplify things for end users who will just load it and run.

So, should we remove hybridisation (at least for now)?
If so I would suggest to run truncated sudo-rts on LTC/smp LTC (like with 1/4 of usual games) to see if results of them align now.

Idea: At the root position before entering search, evaluate every legal move with NNUE and sort the moves. This should provide very good move ordering right off the bat.

well now I'm trying to partially enable some threads to do clean NNUE search, one passed both STC and LTC SMP, other passed STC and is looking good at LTC.
The problem is with hash table, we will have 2 evals writing god only knows what there...

Can't you write the NNUE evals at depth + 2 or 3 so it will tend to
overwrite the worse evals?

On Tue, Aug 11, 2020 at 8:56 PM Michael Chaly notifications@github.com
wrote:

well now I'm trying to partially enable some threads to do clean NNUE
search, one passed both STC and LTC SMP, other passed STC and is looking
good at LTC.
The problem is with hash table, we will have 2 evals writing god only
knows what there...


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/official-stockfish/Stockfish/issues/2915#issuecomment-672411729,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADXIQNHQJPW3577KVTZWKTDSAHSDFANCNFSM4PWWEZHQ
.

Maybe there's something to take from these tuning tests too. Tuned at STC and at STC they pass, but at LTC they fail quickly:

Tuning: https://tests.stockfishchess.org/tests/view/5f3105c190816720665374ca

https://tests.stockfishchess.org/tests/view/5f3396e066a893ef3a02638c
https://tests.stockfishchess.org/tests/view/5f33bb9366a893ef3a0263ae

https://tests.stockfishchess.org/tests/view/5f3396fe66a893ef3a02638f
https://tests.stockfishchess.org/tests/view/5f33a9db66a893ef3a02639a

So maybe NNUE is much more sensitive to search values depending on TC? or amount of threads used? Almost certainly 1-thread has gained much more Elo since the NNUE introduction as seen in the regression tests:

1-Thread:
https://tests.stockfishchess.org/tests/view/5f2c5a25b3ebe5cbfee85b8e
https://tests.stockfishchess.org/tests/view/5f32844966a893ef3a025dde

From 83.42 to 125.60 Elo

8-Thread
https://tests.stockfishchess.org/tests/view/5f2c5a52b3ebe5cbfee85b91
https://tests.stockfishchess.org/tests/view/5f32846e66a893ef3a025de0

From 86.10 to 111.76 Elo

Will unfortunately need to do 1) disable HT on all workers, 2) do not run NNUE on more than available physical cores, 3) take AVX down clocking the whole CPU into consideration. Currently only 1-core mixed tests are "not affected" due to everything balanced out at scale but multicore tests shows obvious corner-cutting from chip vendors.

@noobpwnftw that would be something to investigate first. While the eval is avx, roughly 70% of the time goes still to search, movegen etc, that are integer/branching dominated. It could be that these actually complement/interleave? I guess we need to do measurement of the value of hyperthreading, not only for nps, but also for playing strength.

I think a non-hybrid repeat of the same multithreaded RT will be very useful for identifying how much of its underperforming is due to hybrid and how much due to scaling. Mainly due to 1-core search optimization.

Regarding the latter due I'm confident that a wide multithreaded VLTC tune will do wonders. Even with just 50K games.

A 3rd factor for underperforming is approaching the limit of winnable book exits. 8-threads SF11 can probably draw a considerable amount of openings vs any opponent and 8-moves book is low spread.

I do not think however it is worth investigating for now, as there may be many non-hardware related optimizations and bug fixes that may produce convincing positive Elo. Not sure what's the matter with hybrid at this point, it is non-regression at best by removing it so far(with the help of full AVX512 fleet), and apparently less affected by HT, ARCH, multi-core.

As for the inconsistency, it is well understood(at least to me) due to different worker types, and I recommended that one can do further optimizations by tuning on specific hardware, it has been done, for example, porting to C, or rewrite in assembly, while for the official branch we should just keep it as generic as possible and works "good enough" for everyone.

So I don't understand what is the goal people trying to achieve by measuring optimal "NNUE threshold" up to decimal accuracy, or simplify it away only to wait for someone to reintroduce it, ego again?

@syzygy1 Right, and I'm not worried anymore. I can think of a scenario but it will be rare so its effect minimal. I will expand just for theoretical interest.

Fortresses often lead to different types of fortresses, so if nnue evals the initial type correctly low (due to it appearing more in training, being closer to start pos) it might avoid it for a different variation, especially so as sergiovieri training is mainly outcome- based. Classical high eval will opt for more fortresses, part of them transforming to different types. It will take just a single high eval nnue misevaluation of a subsequent fortress to be awarded a win.

So not worried about this showing in elo, but at the same time believing that increased adjudication accuracy can only be healthy. It might introduce small but very bad type of biased noise in unforseen ways.
The extra cost should be minimal.

Am I wrong to think that NNUE has endianness problems?
For example, ReadParameters() reads a uint32_t by reading 4 chars into the memory location of the uint32_t. The outcome is dependent on the machine's endianness.

@syzygy1 I've turned this in an issue, better discussed there #3007

I did not fully research this point yet, but it seems to me that skipping the NNUE eval in hybrid mode is also skipping the incremental update part of the evaluation. There is some code in the NNUE part to trigger the incremental updates such as "update_eval()" but it is never called (as far as I can tell).

Am I missing something?

Needs a careful look, I didn't do yet, but I don't think it is a problem, since during search we don't call evaluate for each node we visit for several reasons (most obviously TT). This seems it would just be another reason. How this works exactly, I haven't checked.

However, this is maybe the 100Elo bug we've been looking for ;-)

I guess you are right that the same (non?)problem should exist in non-hybrid nnue mode.

If this problem is real, some speed gain seems possible.

The code in do_move() that sets computed_accumulation to false probably ensures that no incorrect nnue evals are returned (e.g. by calling nnue_evaluate() on a child node if the current node is not being nnue_evaluate()d).

At the moment the accumulator is copy-make (part of StateInfo). It might make sense to make it make-unmake (part of Position).

The first NNUE version of Stockfish calls RefreshAccumulator() from Transform() about 1 in 3 times. This is more than I had expected.

I'll prepare something which should either speed things up or show that I don't quite understand what is happening ;-)

I created a simple patch that calls update_eval() at the beginning of do_move() and do_null_move(). This seems to eliminate all refreshes but the speed up seems to be rather small. Anyway, I have submitted a test.

Edit: https://tests.stockfishchess.org/tests/view/5f39a68ce98b6c64b3df4218

I notticed the author of YaneuraOu in this tweet and linked article is questioning the optimal NNUE network size, although aimed at Shogi, the question is relevant to Stockfish too.

Article in Japanese
Article translated into English: Is the default network size of the NNUE evaluation function optimal?

Yes, I've mentioned elsewhere that I believe we're ready to start experimenting with different net sizes and input features.

I also think we should contribute to the learning repository (https://github.com/nodchip/Stockfish) to make the code robust and easy to use, so that more people can experiment with net building and training. A good place to look for starters are the scripts by @sergiovieri https://github.com/sergiovieri/Stockfish/tree/scripts/nnue which however rely on slightly modified learner sources.

Any pull requests for my repository are welcome.

Since nps is quite different between the beginning and end of the game, I wonder if we should allocate a little more time to the opening now. Maybe @protonspring has an intuition on how to change things..

I will slowly port my changes to nodchip's repo, so that my scripts will be compatible. However, I'm very busy with other things right now, so it will take some time.

Not directed at you Sergio, but to the community as a whole... let me say why I'm pushing to support and help @nodchip with his branch. I believe we have received with NNUE a wonderful gift, and I feel it is important that we give back. This will be a beneficial journey for the full community.

Is SF-dev not going to include the learning part?

not for now at least, the decision was to maintain that in the nodchip repo. The merge would have never happened (on that short timescale) if the scope of the project was too broad. Is a lot of effort already.

Computer shogi developers studied many things from Stockfish in the past. And we have also studied many things from the Stockfish NNUE project. This project is beneficial both for the Stockfish community and the computer shogi community.

Actually I think we can try to tune piece values.
They are used in some heuristics in search and are added to static eval, since now we have NNUE static eval they can be quite far off (?)

I'd like to tell you some things about further NNUE improvements.

I contacted yaneuraou via Twitter (the author of blogs about shogi and more recently NNUE and SF), who claimed that if some changes were made into SF (ideas from shogi engines) it could be 200-400 ELO stronger. We've seen so far it worked with NNUE, so I asked him for more details/techniques about those ideas, and see if they could be used in SF. He answered me and said he'd make a new post about it. He finally did it: http://yaneuraou.yaneu.com/2020/08/21/3-technologies-in-shogi-ai-that-could-be-used-for-chess-ai/

The main ideas are:

  • Stochastic optimization of parameters (and he adds some links); although we already have SPSA...
  • Switching the evaluation function (make different nets for the different stages of the game, as a small net increases its level very slowly when its limit is near). I think this is the most interesting one to consider
  • Automatic generation of opening book; SF would generate a book while playing (I don't know if this would be acceptable for tournaments like TCEC)

He gives more details in the blog, and I think we should take a look at them and see if they are worth testing. NNUE worked for us, maybe these might as well...

The third idea sounds like Brainfish.

2nd idea is probably the most promising.

For starters, here is a very easy way to proceed:

1) train one NN only on the first 20 moves of games, or all positions with more than 10 pawns (mg_value).
2) train another NN only on later moves of games, or all positions with 10 or less pawns (eg_value).

Then scale mg_value and eg_value as we did before.

Later, I'd expect to have different NN for different endgames, but this is a bit far in the future.

My thought is that training should not be based on move number - but more aligned with piece count. 32 to 24. 24 to 16, endgame training would be from from piece count 16 pieces to 12 , 12 to 8 - or something similar.

I had the same idea like MichaelB7. Like make NN (ideally) for every piece count from 3 (I don't think we need and NN for K vs K) up to 32. Even more "heavy but precise" approach will be to make NN for every possible material configuration with certain piece count.
Well, I know it's kinda not an easy thing to do but I'm talking about "ideally".
For now even smth like a separate engame NN will be a good thing I guess? There is a lot of room for experimenting.

these things on piececount are already being tried by Sergio and tttak, btw.

It's not urgent by any means but Brainfish-like bookbuilding is something that should be tackled at some point imo. Regardless if it would be allowed in competitions or not, it's just knowledge that an engine could use, especially that an opening phase had been very much neglected so far. Brainfish is an interesting project but its learning platform is not open and being dependent on one person doing updates is a dead end.

The third idea is indeed Cerebellum (BrainFish only being a vehicle for it). The only minor issue in the original post by yaneuraou is that the opening book tree cannot be recalculated (perfectly) with a pure minimax search, because it is not a "tree", but a graph with loops / repetitions.

@dorzechowski, I think it could be possible to make the Cerebellum platform open (I cannot decide this by myself alone). I've never thought about this, because using such a self-generated library in engine tournaments, competitions etc. was usually widely rejected, except from Stefan Pohl. Maybe because is was never possible to distinguish such an automatically self-generated opening book library from a handcrafted book, because of the used public opening book format.

Open implementation of idea number 3: https://github.com/noobpwnftw/chessdb
Full data is available, you can work on your own back propagation method as the data at the very leaf is simply SF(fairly recent) depth 24 result, shape of the tree is developed by depth 7 multi PV with a 200cp margin or a minimum of top 5 moves.

Regarding the first point, @nodchip told me about the implementation he used. I took his scripts and created a repo (https://github.com/unaiic/optimizer) where we can adapt them to SF and see how it goes. The scripts make use of Hyperopt, although we could also use Optune; we should see what is best in this case. I'll also mention this in fishtest, as they'll surely have more expertise with this :)

Close to the 2nd idea, I previously implemented HalfKP_GamePly40x4 and HalfKP_PieceCount.
They subdivide HalfKP into 4. (features : 41024 * 4 = 164096)

I'm not familiar with chess myself, so there may be a better way to implement it.
I heard that Sergio trained HalfKP_PieceCount a bit, but it didn't work well.

NNUE net can be split into two parts: feature_transformer and network.
The above implementations switch the feature_transformer part [164096->256x2], but the network part [256x2-32-32] is the same in all phases.
So it may be a little different than really switching 4 nets.

Naive question, but how does training work currently? Just feed the net with a large number of positions and desired evaluations where the desired evaluations are determined by relatively shallow SF (SF-dev?) searches?

The more SF-dev fiddles with the NNUE scores (hybrid, multiply by 5/4, dampen with rule50_count()), the more problematic this would seem to get.

To train, you need a large amount of data (around 1 billion is the norm right now) that contains a fen + eval + game result. It's usually depth 8 to 14 data. In training, the parameter "lambda" is used to gives more importance to eval or game results (1 is pure eval and 0 is pure results).
This data can either be generated from the training binary or converted from any other source (they are pgn convertors for example). You can even use non-sf data.
If you use the training binary to generate data, you have the choice to create data from classical eval or from a net by switching on and off the Use NNUE uci option. The lastest source from nodchip use hybrid when Use NNUE = true. (not sure a lot of people experimented with that yet)

I understand the decision to keep the net-training code in the nodchip repo. However, it seems to me that the net-training process is presently not set up to be in the spirit of the Stockfish project. That's because presently, so far as I can tell, net-training itself is done by individuals working individually, rather than by a community working together. This is not at all a criticism of the present net contributors. My hat is off to them in respect and gratitude for all their work and time, especially @sergiovieri. Rather, it seems to me that an open-source project like SF would benefit from being transparent and collaborative in the net-training process too, as something done as a team in community. As a team we have people (e.g. DragonMist) who have chess knowledge about important chess databases. We have good connections with the Brainfish/Cerebellum folks. And noob's database is also potential resource for net-training. These skills, connections, and resources, among others, if all brought together in a shared collaborative project, could potentially improve the net-training process. If our net-training process were collaborative and participatory, we could learn from each other what works, and what doesn't, and build on these discoveries. That seems more in keeping with the spirit of the Stockfish project.

https://github.com/sergiovieri/Stockfish/commit/da601927a61d1855bafa08ba6d771727cd1b5b91#commitcomment-41720841

yes, I agree that the process could be improved, in particular i would love to see an end-to-end workflow documented based on the nodchip repo (i.e. without local hacks). Only when we have well-established workflows will it become possible to setup a framework like fishtest do e.g. data generation. I do note that there is quite some online collaboration, but it is taking place on discord : https://discord.gg/c4aBQt (should be a short-lived inivite to discord, don't know how this goes otherwise).

I don't have any strong opinions for now. I will follow a process one the community decide it.

Thanks @vondele. The SF discord is much more active than when I visited it a month ago, and I'm glad to see that. Perhaps we could think about a list of preliminary questions we would need answered in order to arrive at a well-established workflow for distributed development of nets. The SF discord crew would likely help conduct tests to answer those questions. And someone might even volunteer to be a designated point-person for the project of developing the needed workflow.

Jouni did a test and I confirmed the result about NNUE without Syzygy can vs Classic+Syzygy. No conclusion since both tests were so small. Just idea and hope someone or Fishtest can do some serious tests.

http://talkchess.com/forum3/viewtopic.php?f=2&t=74880

I'm thinking how to make progress in my repository. My current procedure is:

  1. Implement ideas posted in the Issues in my repository.
  2. Merge pull requests from other developers.
  3. Create a branch to merge the official stockfish master.
  4. Post to fishtest to check if there are regressions in the engine part.
  5. Merge the branch to the master of my pository.
  6. Release a new binary set.

There are at least two problems in my procedure.

One is that we can not avoid to introduce bugs because of changes in the engine part. Recently, some developers reported the learn command does not work well. I guess that this is because of the introduction of hybrid eval. But I have not investigated. I think that similar enbug will be happened in the future. We need concrete methodologies to test the program to avoid these kind of enbugs.

The other is that it is hard to detect bugs in the training data generator and trainer. When I implement a new feature in the training data generator or trainer, I add a new option to disable the new feature by default. We can avoid that the new feature implicitly is enabled when he or she uses a new machine learning binary, and the results of training data generation or trainer are changed. We can also avoid to encounter new bugs in new features. But we can not avoid enbugs in new features. We also need concrete methodologies to test the program to avoid bugs in new features.

Are there any ideas?

Re the learn command... Besides hybrid we also use the 50 move rule to damp down the evals. I don't know if the 50 move rule is or should be encoded as a NNUE feature.

The current parameter sets seem to have reached their ceiling, at least without extreme effort like SPSA on all parameters. And that's good, because it means a new net architecture can be evaluated in a few days. OTOH, if extreme efforts are spent on current arch, there will not be a chance for a better arch to be proved as such.

I suggest to try changing the activation function to clamped quadratic. This would allow the network to represent products of layer outputs (using the formula 4*a*b=(a+b)^2-(a-b)^2)

I'm thinking how to make progress in my repository.
Are there any ideas?

@nodchip this is indeed a difficult problem to solve. Some form of automated testing will be needed, be it in the form of unit tests for those parts that are suitable, or in the form of regression tests. I believe that one needs to establish a few scripts (probably python glue) that run all the steps of data generation and training (obviously on small amounts of data and small number of training steps, i.e. minutes of run time), that can be tested, checking if basic properties of e.g. the optimization are in a reasonable range. While this can't capture everything, it will catch some things.

In my experience, adding options or optional features will often lead to more bugs, especially if not all options, and combinations of options, are tested carefully.

@vondele Thank you for the advice. I will create an semi-automated test at first. The test will generate training data, generate validation data, train, and check if the cross entoropy is decreased. But I think that the test will take a long time. I will think if we can make the test automatically.

I agree that adding options or options features will often lead to more bugs. On the other hand, many options would be helpful for experiments. I will think what we can do to avoid enbugs when we add options or optional features.

I think one of the challenges of the learning code is that lots of experimentation is needed to make progress, hence the options. Somehow it makes sense to encourage the experimentation, while at the same time establishing the basic scheme that is known to work. For SF we have a rather clear procedure when to modify master, which simplifies things tremendously when it comes to maintaining stability. That seems, at this point, still more difficult for the learning code. There must be some expertise concerning this question in the ML community, we're not the first to run into this problem.

Clear procedure is always a beneficial overhead regardless of the amount of moving parts involved. The more complicated how things work together means more reasons for it to go wrong, also harder to reproduce the results for someone else. If people can precisely describe what are the steps conduct their experiments, then they should have no trouble having others review their work and verify the results.

Meanwhile, tooling for performing ML has nothing to do with how one should conduct experiments, tools are only required to perform the work as described and has nothing to do with whether one would produce a good net or not, these are completely different things.

People run into such problems when conducting ML because when they fiddle around things here and there, no documentation or record was made for later reference, just like one edits code with no version control and no backup. Due to the nature of ML, it may not outright fail to compile or crash, but just produce some weird results or not good enough.

Hello all, I had a dataset of 13 million chess positions and their evaluations lying around here if it will help train NNUE.

I'll close this issue, new ideas please post in the forum, as an issue, or ideally as a patch or PR.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

niklasf picture niklasf  ·  5Comments

bftjoe picture bftjoe  ·  5Comments

NKONSTANTAKIS picture NKONSTANTAKIS  ·  6Comments

MoonstoneLight picture MoonstoneLight  ·  5Comments

fun8 picture fun8  ·  4Comments