Stockfish: SF NNUE

Created on 10 Jun 2020  Â·  183Comments  Â·  Source: official-stockfish/Stockfish

There has been much discussion on SF NNUE, which apparently is already on par with SF10 (so about 70-80 elo behind current sf dev). People have been saying it can become 100elo stronger than SF, which would basically come from the eval. Since the net is apparently not very big, maybe someone can study the activations of each layer and see if we can extract some eval info from it? In any case, it's probably worth looking into this since it shows so much promise.

NNUE

Most helpful comment

We should be open-minded and see how things evolve... it is an interesting development. Let's see how the code base evolves, the performance goes, etc. Once we have some data and understanding, we should see what the opportunities are.

All 183 comments

I don't know if its the direction the devs want to go in but I think it should be considered to maybe integrate ML into SF given the impressive results.

We should be open-minded and see how things evolve... it is an interesting development. Let's see how the code base evolves, the performance goes, etc. Once we have some data and understanding, we should see what the opportunities are.

Given that Stockfish tunes in attempts to match Leela evaluations has failed in the past, I'm not entirely sure that you can extract much useful information from another similar black box, especially since neural networks have convolution structures that make them useful and less compressible.

EDIT: I found out (anecdotally) that this Neural net doesn't use convolutions. If you want to investigate, you should probably ask on the Stockfish discord or the fork mentioned by vondele below.

I don't know much about SF NNUE. What is it? Does NNUE stand for something?

So it's been claimed on discord that NNUE is now 34elo stronger than SFDev.

I don't think anybody claimed that besides the occassional SSS result.

NNUE definitely is much worse at 10+0.1 STC, but does quickly gain elo on SF_dev as TC increases.

Just for reference, this issue refers to the fork being developed here: https://github.com/nodchip/Stockfish with an eval function based on a neural net architecture.

Data is sounding more and more convincing on this (look at jjosh and lkaufman posts):
http://talkchess.com/forum3/viewtopic.php?f=2&t=74366&start=10#p850204

"Anecdotally", I have several test positions which SF consistently takes up to 50-100 billion nodes or more (or sometimes never finds it) to find the correct move, that SF NNUE finds within a few million nodes. The difference is night and day.

Is there any chance fishtest resources could be used for this? Or if we could somehow run one of these "patches" (SF NNUE) against "master" with SPRT elo bounds at 180+1.8? I think it might pass very fast!

@ssj100 But look at the number of games though. It's not even thousands of games, just dozens. That's hardly convincing at all. I would, however, love to see an LTC match of NNUE vs SF, though I don't know if it's supported by fishtest (probably not).
@vondele

well, I think we should slowly start to think about how we can utilize fishtest to train networks and stuff like this.
This stuff seems to be really promising if it plays on level "not really worse than master" on LTC and CPUs that support AVX in just few weeks of training.
Sure, most of our hardware is quite old, but we have some modern CPUs and it can be trained even on older ones just slower...
So, what I think should be done :) - we should start to train some nets ourselves, maybe have 2 separate code bases or (even better) one code base with NN and handcrafted eval and UCI parameter to switch between them - people with older CPU can stay on handcrafted eval and people with modern CPUs can utilize NNUE.
I honestly think that NNUE will be the future, newest CPUs make it pretty fast and it can help to just walk over cornercases that corrupt sf play a lot. Honestly fact that NNUE plays on reasonable strength in it really early days is one of the main reasons why I basically stopped to write eval patches :)
I know all I say will require quite a lot of work from both developers and maybe even fishtest maintainers, but some day it still needs to be done, imho.

I honestly think that NNUE will be the future, newest CPUs make it pretty fast and it can help to just walk over cornercases that corrupt sf play a lot. Honestly fact that NNUE plays on reasonable strength in it really early days is one of the main reasons why I basically stopped to write eval patches :)

"Cornercases that corrupt SF play a lot" I'll bet there's equally many (if not more) corner cases to be met with the NNUE architecture, given that even leela has lots of trouble with its own kind of corner cases, especially those of which that are both distant to mate and require pruning exponentially larger search trees. Current SF has a reasonable combination of search code and eval code to be able to direct it to finding improvements in obscure endgames and make those problems far less difficult by deliberation. This may make it easier to identify and fix specific problems. In my experience with neural networks, specific problems are far harder to fix when trying to generalize evaluation.

Also, NNUE may not provide a higher ceiling than handcrafted evals because of the inefficiency of information packing in Neural Networks as opposed to formal handcrafted evaluation. NNUE can only be so large of a network that it'll probably hit its limit and it will stop improving after a certain point, much like how Leela's network architecture has hardly improved since it first had squeeze and excitation (SE) nets. That said, it's easier to train this NNUE than Lc0 because it's got so many fewer variables, so designing improvements (in the short term at least) may come easier to it.

So I'd still be a bit skeptical (even though I predict NNUE will be better in the near future) of the long-term implications of NNUE. I fear that SF could stuck in a local minimum with NNUE when the NN stops improving and people would lose interest in the SF project instead of returning to the handcrafted evaluations with a higher Elo ceiling.

If AlphaZero came 2 years earlier and blew everyone out of the water then, it probably would have made many people abandon SF instead of realizing there is still great potential for handcrafted evaluations.

The SF project is probably one of the largest (if not the largest) open source projects of handcrafted feature recognition and in my opinion it would be a shame if it were just to become an exhibit in a github museum.

All this said, it's just my experience from watching from the Lc0 stand of things.

The difference is that 80% of elo sf gains are improvements of search. So even if eval will be "stuck" - well, it's not THAT big of a deal, tbh.
Also no one prohibits you from continuing to improve handcrafted eval if nn will get stuck.

I don't think handcrafted evaluation should be abandoned, as the possibility of it having a higher ceiling remains. That being said, as Viz mentioned, handcrafted search appears to be "unthreatened" anyway, so the "SF project" won't become an "exhibit in a github museum" regardless. People shouldn't forget that a big reason of why SF NNUE is so strong already is because of its strong search. For example, I'd predict that if Komodo NNUE was released (Komodo being the 2nd strongest CPU-alone engine), it would still get crushed by native SF.

However, my point was that it may be prudent to do some "testing on fishtest" for the NNUE component, if just to become adept to using/testing/training it. The handcrafted eval component should still continue as much as possible, but perhaps when it comes to submitting SF for tournaments etc, the strongest version of SF should be submitted at the time (whether it's native SF or SF NNUE).

From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior.

(e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1)

Problem is you don't really have a way to decide which eval is correct and which is not even with shallow search. With native eval, people spot certain problems and write patches, they still often break more stuff than they fix by failing fishtest, so how is NNUE going to magically make this problem disappear is beyond me.

From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior.

(e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1)

Those misevaluations are mostly the result of the data its been trained on.*
It's at the end of the day still a net that has only seen a lot of depth 8 games and a bunch of depth 12 games.

Things should eventually improve, once we can get fishtest, or Leela or Noob's data to work.

Anyway, I turned skeptical about its scaling after seeing a fixed node test at 1m, 10m and 20m. But maybe Jjoshua's net has fixed that.
We'll see over at TCEC, Jjosh's net should be stronger than mine and TCEC is less likely to bork settings than CCC.

*But a lot of them will exist even if we use deeper data, SF evaluating a draw endgame as +1 is just as wrong as Leela saying +0.8 or NNUE +3.4.

what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions.

what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions.

A few others have experimented with the data but had some strange behaviour.
Either because they weren't converted correctly or maybe an issue with the learning function itself.

concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful.

@gekkehenker it's much harder* to tune a neural network to give desired relative evaluations than it is for the handcrafted alternatives.**

*might have to be proven to be known true, but stockfish's evaluations are tuned to beat other versions of itself. that makes the patches that pass alive out of fishtest very good at introducing adversarial play, which a small neural network trained on external data could not provide to such high fidelity. what ends up happening against stronger or "drawish" opponents is the neural network tends to prefer things which itself cannot evaluate properly instead of being able to focus on generating play from its own internal strengths.

**"handcrafted alternatives" rely on far more concrete values to evaluate a position, making any small differences in evaluation which might find wins/draws effect magnified. also, the deeper the search, the more false positives which the neural network generates effects how the edges of search behave, especially drawn 50-move rule bound endgames.

@noobpwnftw being able to distinguish when our handcrafted evaluations are better to use could rely on a table of precalculated values in from a file, those of which would allow us to determine what evaluation method is better for what amount of pieces on board, and what type of pieces on the board--we can create such an evaluation-accuracy piece-table by using mean square error of an evaluation to the result of a game, for which we might have to figure out how the new network's evaluations convert to "actual" win percentage. one potential downside is that might get a bit messy if different networks have different strengths.
Then again, maybe there's a-lot of slowdown in figuring out which pieces are on board and loading the table. Maybe simply using the amount of pieces are on board or some value which measures how much the tree is branching is enough.

concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful.

This link contains a few Windows compiles (popcnt, avx2, bmi2) and my current strongest net:

https://workupload.com/file/ggEUrvNVgmH

It seems like the latest binaries (same goes for the binaries on Nodchip's repo) fixed a few bugs.
No longer need to adjust slowmover, 100 works perfectly now.
Extreme elo gain, on older binaries my nets were always 100+ elo weaker than SF. They now test stronger than SFDev...

It's roughly as simple as SF now. UCI option "evalfile" has to point towards the NN file.
In files above it's by default "eval\nn.bin", but this can be changed to anything now. As long as it points towards the correct binary file.

There's sadly not a lot of centralized information because it was originally nothing more than a quick port to test if NNUE works in chess too. Whatever I know is build upon quick instructions from Twitter, looking through the learner.cpp code and google translated YaneuraOu docs:

https://twitter.com/nodchip/status/993432774387249153
https://github.com/nodchip/Stockfish/blob/master/src/learn/learner.cpp
https://github.com/yaneurao/YaneuraOu/blob/master/docs/USI%E6%8B%A1%E5%BC%B5%E3%82%B3%E3%83%9E%E3%83%B3%E3%83%89.txt

Just thought it'd be important to post some real results in my testing so far.

  1. I've been testing with these conditions for many years, including with SF8, SF9, SF10, SF11, SF12dev, H5-6, K10-14.
  1. These are the general conditions:
    -GUI = cutechess
    -1-core
    -No TB
    -Time Control = 60 seconds +0.6
    -Book = Balsa_v500.pgn (500 lines mainly up to 5 moves)

  2. This is the information for each engine:
    -SF = from abrok compile "July 11" 2020, all default settings
    -SF NNUE binary component = from nodchip compile "July 13" 2020, all default settings (it's important to use this binary, as older binaries were 50-100+ elo weaker for some reason)
    [This means both engines are using a very recent version of SF's "search code". As already discussed/mentioned in many places, the functional difference between each engine is that the abrok SF obviously uses SF's "eval code", while SF NNUE completely disables this "eval code" and uses code from a trained net ("nn.bin")]
    -SF NNUE net component = gekkehenker net from 27 June 2020 (which was created entirely using SF self-play games with a binary from June 2020)
    *Start position SF speed: ~1800Mnps
    *
    Start position SF NNUE speed: ~1100Mnps (~60% of SF speed)

  3. Here is the result so far:
    SF NNUE vs SF: 78 - 53 - 369 [0.525]
    Elo difference: 17.39 +/- 15.54
    500 of 1000 games finished.

I'm going to let it run to 1000-games mainly just for future consistency.
Some musings:

  1. You can already see SF NNUE is very likely about on par with latest SF (possibly better)
  2. The NNUE concept has likely only (publicly) been experimented with in the last few weeks in computer chess
  3. @gekkehenker literally only spent a few days creating the "eval net" above and using very limited hardware resources (literally one computer with one CPU - 6 cores/12 threads)
  4. If 1. is true, this effectively means gekkehenker has, by himself, literally managed to match (or possibly surpass) the elo strength of SF's "eval code" within a few days and with a tiny fraction of "CPU hours" of fishtest. That is, he has done what SF/fishtest (with hundreds of developers, thousands of "CPU-years" and about 12-years of hand-crafted coding/testing) has managed in a fraction of time and resources
  5. It remains to be seen if scaling for SF NNUE is good, but all the data out there so far strongly suggests that it is
  6. I can only imagine what fishtest and the SF community can achieve together with its ample resources and incredible developer talent
  7. One way forward would be to split fishtest resources, to something like as follows (assuming a default of about 1500-cores is available):
    -1000-cores to continue handcraft search improvement patches
    -100-cores to continue handcraft eval improvement patches
    -400-cores to train "NNUE"
    (Clearly this proportion can be changed accordingly as per the optimal needs etc)

Anyway, thanks to @gekkehenker and nodchip for continuing to share their knowledge publicly!

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC
is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550]
Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything
but can't see anything obviously wrong. I'm going to test 20+0.2 now.

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC
is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550]
Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything
but can't see anything obviously wrong. I'm going to test 20+0.2 now.

Yes, the first time I saw the results of the new binaries I couldn't believe them either.
"I must have done something wrong" is what I thought.

In an era where a 5 elo patch is believed as too good to be true, a 30 elo "patch" must be impossible to believe.

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC
is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550]
Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything
but can't see anything obviously wrong. I'm going to test 20+0.2 now.

Your result is "consistent" with basically every test done so far (including mine) that used nodchip's binaries (or equivalent) from July 11th or later. Again, testing with the newer binaries is crucial (probably stick with July 13th binary until we're absolutely certain of the strength improvement), as older binaries were for some reason 50-100+ elo weaker - SF is so far ahead of the rest that it was still a relatively strong engine, around the level of Komodo 14.

It appears that the elo difference at 10+0.1 (and likely even shorter TC) is likely bigger than at 60+0.6. The elo difference seems to be around 30-50 at the shorter TCs, and around 15-35 at the longer TCs. It'd be interesting to see if fishtest can verify these numbers - ideally test at its usual TC for patches - 10+0.1 and 60+0.6 with 1-thread, and 5+0.05 and 20+0.2 with 8-threads, all to 40,000 games each or similar.

Yeah fishtest tests would be quite something if that is possible. My own test for 20+0.2 I stopped when it was giving a similar result:

20+0.2: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 506 - 292 - 1224 [0.553]
Elo difference: 36.91 +/- 9.47

and then I started the more interesting 60+0.6 and that, while with little amount of games so far, did as well:

60+0.6 hash64: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 204 - 105 - 663 [0.551]
Elo difference: 35.51 +/- 12.23

Just to follow up on my testing from above. The 1-core test finished as follows:

SF NNUE vs SF: 161 - 103 - 736 [0.529]
Elo difference: 20.17 +/- 11.02
1000 of 1000 games finished.

2-core test with exactly the same conditions as above, currently showing even better results, although sample sizes are tiny to draw any conclusions about scaling:

SF NNUE vs SF: 81 - 30 - 327 [0.558]
Elo difference: 40.64 +/- 16.14
438 of 1000 games finished.

So, with the net from @gekkehenker (c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1 nn.bin) and current master (https://github.com/nodchip/Stockfish.git 7a13d4ed60b09a9ce1b5aee46aa2a596bc4ca0fd) I get the following results:

STC (10.0+0.1 @ 1 thread)
Score of master vs nnue: 940 - 2206 - 3973  [0.411] 7119
Elo difference: -62.4 +/- 5.3, LOS: 0.0 %, DrawRatio: 55.8 %

LTC (20.0+0.2 @ 8 thread)
Score of master vs nnue: 189 - 463 - 1332  [0.431] 1984
Elo difference: -48.3 +/- 8.7, LOS: 0.0 %, DrawRatio: 67.1 %

That's a bit better than the results posted previously. The cutechess cmdline is quite standard:

./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16  -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn.bin -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn

Tests on ccc seems to indicate that nnue can't handle more than 64 threads though? Is that true or is ccc nnue set up incorrectly? Anyways I highly doubt blitz tests represent the true strength difference at vltc (I'm talking about tcec conditions). I expect at best +20 elo in those conditions (which by the way was my prediction on how much better leela was back when a horde of leela fans were claiming +50 at least.).

well it is unlikely that fundamentally nnue would show worse threading behavior. After all, this is just changing eval, which is really threading-independent. However, there could be threading related bugs, or new threading-related bottlenecks that haven't been found. That could happen in a relatively new code. Another thing to consider is that there might be a difference in performance wrt. hyperthreading as the nnue has different characteristics (e.g. avx2 intensive). A first test at a higher thread count here seems fine:

VLTC (20.0+0.2 @ 16 threads)
Score of master vs nnue: 292 - 698 - 2202  [0.436] 3192
Elo difference: -44.4 +/- 6.6, LOS: 0.0 %, DrawRatio: 69.0 %

On CCC it was previously running on WINE, hence the 64 thread limit.
It's running on 90 threads now.

well I think that 40+ elo perf is enough to justify putting effort into it :)

One way of possible integration would be to keep the UCI option to skip loading NN but have the normal eval used as the base, also used in case a NN file is not present. That way on fishtest people could also test against it by setting options. Currently in NNUE if you skip loading eval it will just have none at all.

Beyond that the fact that the normal eval gets about double the nps could mean that it's still more efficient in some form, maybe for endgames? Lazy eval comes to mind as an example of an elo-gaining change in evaluation depending on the game.

Meanwhile some limited results at 20.0+0.2 @ 250 threads, looks consistent with the other numbers so far.

Score of master vs nnue: 13 - 34 - 153  [0.448] 200
Elo difference: -36.6 +/- 23.0, LOS: 0.1 %, DrawRatio: 76.5 %

Played 22.000games TC: 10s+1s with sf-nnue-bmi2-256halfkp : http://ipmanchess.yolasite.com/i9-7980xe.php
+37Elo ,Ordo shows +39.6Elo above Stockfish 11 !

1) sf-nnue-bmi2-256halfkp 3530.1 : 22000 (+14921,=6644,-435), 82.9 %

vs.                                :  games (     +,    =,   -),   (%) :    Diff,    SD, CFS (%)
Stockfish 11 x64 bmi2              :   1000 (   281,  631,  88),  59.6 :   +39.6,   3.1,  100.0
Stockfish 10 x64 bmi2              :   1000 (   389,  547,  64),  66.3 :   +87.9,   3.0,  100.0
asmFishW 2018-06-12 bmi2           :   1000 (   379,  574,  47),  66.6 :  +113.2,   2.8,  100.0
Komodo 14 64bit bmi2               :   1000 (   499,  457,  44),  72.8 :  +174.1,   3.2,  100.0
Houdini 6.03 Pro x64 bmi2          :   1000 (   536,  431,  33),  75.2 :  +175.1,   3.1,  100.0
Komodo 13.3 64bit bmi2             :   1000 (   531,  427,  42),  74.5 :  +187.5,   3.0,  100.0
Ethereal 12.13  x64 pext           :   1000 (   658,  332,  10),  82.4 :  +285.1,   2.9,  100.0
Ethereal 12.00 x64 pext            :   1000 (   693,  293,  14),  84.0 :  +294.8,   3.0,  100.0
Komodo 13.2.5 x64 bmi2 MCTS        :   1000 (   732,  260,   8),  86.2 :  +308.5,   2.7,  100.0
Komodo 13.3 x64 bmi2 MCTS          :   1000 (   719,  268,  13),  85.3 :  +310.5,   3.2,  100.0
Xiphos-0.6-w64-bmi2                :   1000 (   671,  318,  11),  83.0 :  +310.7,   3.2,  100.0
Fire 7 x64 popcnt                  :   1000 (   700,  293,   7),  84.7 :  +325.1,   3.1,  100.0
Xiphos-0.5.3-w64-bmi2              :   1000 (   699,  291,  10),  84.5 :  +329.8,   3.0,  100.0
rofChade 2.3 bmi2                  :   1000 (   780,  211,   9),  88.5 :  +378.9,   3.1,  100.0
Laser 1.7 bmi2                     :   1000 (   822,  174,   4),  90.9 :  +407.8,   3.0,  100.0
Fire 6.1 x64 popcnt                :   1000 (   818,  176,   6),  90.6 :  +409.1,   3.0,  100.0
rofChade 2.203 bmi2                :   1000 (   799,  199,   2),  89.8 :  +420.9,   3.2,  100.0
Defenchess 2.2 pop                 :   1000 (   825,  170,   5),  91.0 :  +434.6,   3.1,  100.0
Ginkgo 2.18 bmi2                   :   1000 (   844,  150,   6),  91.9 :  +440.2,   3.1,  100.0
Ginkgo 2.1 bmi2                    :   1000 (   840,  154,   6),  91.7 :  +446.7,   3.1,  100.0
Booot 6.4 x64 pop                  :   1000 (   856,  140,   4),  92.6 :  +453.1,   3.2,  100.0
RubiChess 1.7.2                    :   1000 (   850,  148,   2),  92.4 :  +455.3,   3.3,  100.0

@Ipmanchess can you specify exactly which version of the code and the the net you used (git sha, sha256sum of net?). That should help to understand the difference 39 Elo vs SF11 or >40 Elo vs SFdev. This might also be a book effect (I've been using the noob_3moves.epd book).

But the result is +281, =631 , -88 (59.65 % vs sf 11) isn't that like a +68 elo preformance? or is the elo calculator i use simply for a different elo calculation.

To note is also that nnue do not have contempt

The code currently doesn't work with contempt (changing contempt doesnt change evals at all) so it could just be underperformance against weak opponents?

I just tested 384 sized first layer, 30mb net (which is 50% larger ) stockfiNN 0.1 with a fixed binary from 7-14 at 10s+0.1s and despite even further slowdown it still beats sf-dev, 1000 games. It still gets almost 60% speed of sf-dev on my Zen2 arch

Score of stockfinn1 vs stockfish_20070321_x64_modern: 356 - 226 - 418 [0.565]
Elo difference: 45.4 +/- 16.5, LOS: 100.0 %, DrawRatio: 41.8 %
repeated
Score of stockfinn1 vs stockfish_20070321_x64_modern: 376 - 223 - 401 [0.577]
Elo difference: 53.6 +/- 16.7, LOS: 100.0 %, DrawRatio: 40.1 %
Score of stockfinn2 vs stockfish_20070321_x64_modern: 368 - 196 - 435 [0.586]
Elo difference: 60.4 +/- 16.2, LOS: 100.0 %, DrawRatio: 43.5 %
Score of sf-nnue-avx2-256halfkp-Pleomati 7-9 vs stockfish_20070321_x64_modern: 711 - 444 - 845 [0.567]
Elo difference: 46.7 +/- 11.6, LOS: 100.0 %, DrawRatio: 42.3 %



md5-638a5d57b3624bf50dfc7118763c8be1



Score of stockfinn1 vs stockfish_20070321_x64_modern: 140 - 79 - 101 [0.595]
Elo difference: 67.1 +/- 31.9, LOS: 100.0 %, DrawRatio: 31.6 %



md5-364e70f6c344d140dea92f8dd90adcb0



Score of stockfinn2 vs stockfish_20070321_x64_modern: 139 - 82 - 99 [0.589]
Elo difference: 62.6 +/- 32.0, LOS: 100.0 %, DrawRatio: 30.9 %



md5-364e70f6c344d140dea92f8dd90adcb0



sf-nnue-avx2-256halfkp-Pleomati 7-9 bundled with gek 2706
Score of sf-nnue-avx2-256halfkp-Pleomati 7-9 vs stockfish_20070321_x64_modern: 125 - 88 - 107 [0.558]
Elo difference: 40.4 +/- 31.2, LOS: 99.4 %, DrawRatio: 33.4 %



md5-24eea7302cd82e8bd81d58220eb325f8



Score of stockfinn2 vs stockfish_20070321_x64_modern: 448 - 207 - 345 [0.621]
Elo difference: 85.4 +/- 17.7, LOS: 100.0 %, DrawRatio: 34.5 %

Massive elo gains, with regular and draw reducing books!

strange since we calculate contempt effect in search.cpp which shouldn't be really changed (?)

but it is a bonus added in evaluate.cpp to the actual score. This for example never happens with nnue
https://github.com/official-stockfish/Stockfish/blob/master/src/evaluate.cpp#L834

Ah, I think they changed where it was. Then yeah, it may be a contempt (lack of) effect. Against sf11 nnue shows 70 elo in this test, slightly lower than should but this is also (I guess) due to lack of contempt/luck/etc.

1 engine on CCRL blitz currently.

http://ccrl.chessdom.com/ccrl/404/cgi/engine_details.cgi?print=Details&each_game=1&eng=Stockfish%2BNNUE%20150720%2064-bit%204CPU#Stockfish%2BNNUE_150720_64-bit_4CPU

ELO difference on ipmanchess might be smaller than H2H because it isn't stomping the weaker engines quite as hard as you'd expect it to based on it's SF11 performance. Probably contempt.

So, with the net from @gekkehenker (c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1 nn.bin) and current master (https://github.com/nodchip/Stockfish.git 7a13d4e) I get the following results:

STC (10.0+0.1 @ 1 thread)
Score of master vs nnue: 940 - 2206 - 3973  [0.411] 7119
Elo difference: -62.4 +/- 5.3, LOS: 0.0 %, DrawRatio: 55.8 %

LTC (20.0+0.2 @ 8 thread)
Score of master vs nnue: 189 - 463 - 1332  [0.431] 1984
Elo difference: -48.3 +/- 8.7, LOS: 0.0 %, DrawRatio: 67.1 %

That's a bit better than the results posted previously. The cutechess cmdline is quite standard:

./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16  -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn.bin -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn

Thanks for testing this @vondele! The "20.0+0.2 @ 8 thread" result is identical conditions to fishtest SPRT SMP LTC tests, and I'd gather would have passed the SPRT bounds in less than 1000-games?

And yes, I think the different absolute results are likely at least due to different books etc.

and last number from my side for today, using a bit a longer TC (120.0+1.2)

Score of master vs nnue: 364 - 904 - 2798  [0.434] 4066
Elo difference: -46.4 +/- 5.9, LOS: 0.0 %, DrawRatio: 68.8 %

@vondele ,you can always find some comments/info under Testings and choose right system : http://ipmanchess.yolasite.com/testings-i9-7980xe.php and i also use noob 3moves on my i9 7980XE.

Accidentally ran my 1000 book twice but got results same setup 10s+0.1s with net 2706 now

Score of sf-nnue-avx2-256halfkp-Pleomati 7-9 vs stockfish_20070321_x64_modern: 711 - 444 - 845 [0.567]
Elo difference: 46.7 +/- 11.6, LOS: 100.0 %, DrawRatio: 42.3 %

Finished 1 elo ahead of stockfinn1 obviously within error bars.
EDIT: Updated previous post to have all results!

This is very exciting and all, but what now? Do we just completely abandon handcrafted eval?

Or do we keep trying to improve it? With nnue being only 60% as fast as regular sf if handcrafted eval can be improved to even just 80% of nnue regular sf would be on top again.

we can leave it as is and maybe have patches on eval running with lower prio. But it's up to maintainers ofc

Handcrafted eval is very hard to develop, SF10->SF11 gained only 15 eval elo in 1 year. And now we have the first primitive nets already so much superior with a search which is completely optimized for the handcrafted eval. So the search definitely has to split as well, in order to further unlock the NNUE potential.

It can be a shocking realisation that handcrafted eval was abruptly obsoleted. Its asset was speed, so it could battle neck and neck with Leela, but NNUE is 60% as fast, not 1000 times slower.

So I would say to prioritize optimizing SF NNUE, but of course emotionality is understandable, and also developing eval for the fun of it, so why not let all options.

@adentong Lets sloppily say x2 speed = 50 elo, and NNUE is 50 elo ahead of vanilla. Lets also pretend that SF search is equally efficient for NNUE and vanilla. This more or less means NNUE eval is 100 elo ahead, so if we increase eval progress to 20 elo/year instead of 15, it will take us 5 years to reach the current performance of NNUE, for which it took a few weeks at a home pc.

So I think I can safely abandon my successful career of translating chess-oriented logic into rationale, which in turn had to be translated into coding logic, and focus at areas where I truly shine, such as statistics :kappa:

No need to abandon 'hand-crafted' eval IMO
A UCI option could be used to simply turn NNUE 'on,' & 'off'

I agree with Norman.
Also we should think of how we can use fishtest to actually make training of NNUE. I think that we can achieve much better results if we use fishtest resources than someone who used his 5 machines for this... But it requires a lot of work from maintainers and fishtest admins ofc.

change evaluate.cpp line 895
to UCI option

if !defined(EVAL_NNUE)

Value Eval::evaluate(const Position& pos) {
return Evaluation(pos).value();
}

endif // defined(EVAL_NNUE)

(and do the same in evalute_nnue.cpp)
should work I think

I'll be reading a bit the code and try to generate my own net. That seems like a good step for any of the devs interested in this technology.

It is not just about having good Elo performance, we need people that understand the code, can maintain (or bug fix) and refine it. In lines of code, it is roughly doubling the current code base, but there are various parts to the code that are not directly the engine (i.e. the learning infrastructure). We already have a few SF regulars active, on the code, so that is a good start.

Right now, there are still non-Elo-related refinements that can easily go on in the @nodchip branch, for example, making sure it passes the typical CI process, or improving the comments, or making sure all architectures are supported at least in a basic form.

I'll be reading a bit the code and try to generate my own net. That seems like a good step for any of the devs interested in this technology.

Thanks @vondele - I'm normally not a fan of Discord (either), but would you consider participating there?

@vondele While reading the code and trying to understand it I have done a bit of work that I have pushed to my branch of SF NNUE here: https://github.com/dorzechowski/Stockfish-nnue/tree/nnue-player-wip. Maybe it can be useful.

  • I removed code that is not relevant to playing, so all data generation and network learning is not there and the code is substantially smaller.

  • I refactored and cleaned up a lot of code that was added to SF codebase. The whole NNUE part is almost untouched (apart from cosmetic changes, renamings, etc.) but is now quite well isolated I think.

  • The same executable should handle both NNUE and dev code and eval variant can be switched by UCI option. So in makefile the standard 'build' and 'profile-build' target should be used, all nnue targets are removed.

  • Executable crashed for me compiled using C++14, but works well with C++17 with the same gcc 9.3.0 (on Windows). There are some C++17 concepts in the code now as I used it anyway but they are not necessary. There are 2 compiler warnings left.

  • It plays well (with gek2706 net 256x2), I didn't have any crashes so far on both NNUE and dev eval variants. I get nps ratio around 67% on bench 128 1 20.

@dorzechowski useful work indeed. I'll have a look, but might only get to this for real next weekend.

The fact that sf nnue requires a more recent compiler might actually make this a little more difficult to deploy on fishtest, some people are still with older toolchains. Eventually, a first step could be to test such a variant of the code for non-regression on fishtest (or accurately measure the Elo loss), before one tries to tackle the more tricky aspect of enabling testing with different nets.

@vondele I believe C++17 version is much more careful with alignment which is crucial for AVX2 instructions. The makefile in my branch requires C++17 but even if it doesn't compile on some machines, it's fine for now. Unfortunately we cannot choose requested CPU capabilities on creating the test but we should start trying anyway I think.

I think a good first test would be to treat it as "normal" Stockfish. I may change default UCI option to not use NNUE and it should at least compile, pass bench test and run some quick master vs master test on some machines. If this works then it would be possible to proceed from there. What do you think?

yes, test as a 'normal' Stockfish and do some non-regression test as a first step was what I was suggesting.

I can see the need for C++17... and in principle don't object. Just that it might be problematic for some older machines on fishtest. However, that's something we can eventually try to fix / workaround.

Great, I' will try to push a test in a minute.

@vondele Unfortunately, doesn't let me create a test. I get:
image
I don't know what's the problem with bench, looks correct to me. Here are my parameters:
image

@dorzechowski I think it assumes that the master branch of the test repo (dorzechowski/Stockfish-nnue) is actually the SF master with the matching bench.... is that the case?

@vondele Yes, it's updated to the latest master. But it complains about bench of base master, not test branch, I'm confused.

Base will still be a branch from your repository (not from official-stockfish)... Thus, if I look at https://github.com/dorzechowski/Stockfish-nnue/commits/master (which is what it will pick up as base), it will presumably fail to find the proper base signature (i.e. the latest bench in that branch won't match your number).

Ah, that's right! I have correct master in Stockfish repo but not in Stockfish-nnue. I will push it to my Stockfish repo and try again.

Edit: test pushed!

looks like it fails to start. I guess the next hurdle is that the ARCH=x86-64-modern option has been removed, which is what the workers will use by default (IIRC). Maybe that could be hacked around using the x86-64-sse42 options in the makefile.

Oh, I didn't notice that some options were removed. I will reintroduce x86-64-modern and try again. It's ok just for trying if it compiles but actual NNUE will be very slow on this arch.

Edit: pushed again.

yes, sure. Haswell introduced avx2, roughly 7 years ago, so I'm not too concerned if that would be the required 'modern' for NNUE.

A version of SF being tested on Fishtest does not have to be the strongest possible compile since it will only play against a small modification of itself. It is much more important that the barrier of entry for testers is low. So I think making C++17 mandatory would be a bad idea.

The existence of CFish shows that SF has zero need for any of the fancy C++ stuff.

@dorzechowski the current Elo performance (https://tests.stockfishchess.org/tests/view/5f154f61da64229ef7dc17ca) seems to come from a rather significant slowdown on the branch, just as measured by the nps of a bench (about 12% for me, roughly 26 Elo). I guess the origin of that would be useful to figure out & fix.

@vondele This is a bit unexpected as I haven't seen any slowdown with my compiles (gcc 9.3.0, ARCH=x86-64-bmi2, Windows, CPU i7 Kaby Lake). Here are my results from fishbench (base is SF master, test is my nnue-player-wip branch):

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    1837380   1829760   7620      
    StDev   43427     40355     9416      

p-value: 0,209
speedup: -0,004

I noticed that all machines running this task on fishtest use Linux but I cannot really test on Linux right now. Also what CPU you have and which ARCH you used to compile it?

Strange... I used make -j ARCH=x86-64-modern profile-build on Linux, using gcc version 9.3.0. I'll check again, maybe it was a pilot error.

I see, likely due to different compiler flags being passed on master and branch, so a makefile issue. I have

-Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-use -fno-peel-loops -fno-tracer -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_POPCNT -DUSE_SSE2 -flto 

vs.

-Wall -Wcast-qual -fno-exceptions -std=c++11 -fprofile-generate -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -flto

Indeed, it seems to change the popcnt part of the Makefile

I guess that we also need to merge the HEAD of https://github.com/nodchip/Stockfish, and add "sse3 = yes" to "x86-64-modern". Because the dorzechowski's Makefile does not add "-msse3" when "popcnt = yes".

https://github.com/nodchip/Stockfish/blob/master/src/Makefile#L104
https://github.com/dorzechowski/Stockfish/blob/nnue-player-wip/src/Makefile#L397
https://github.com/official-stockfish/Stockfish/blob/master/src/Makefile#L330

@nodchip Thanks, I fixed my Makefile with pointed changes. @vondele I pushed the change, can you retest speed?

I didn't really do anything in Makefile except getting rid of nnue targets, I must have missed that it was changed before. I only used bmi2 arch and was happy with the performance.

@dorzechowski yes, looks good now.

@vondele Great, I pushed the test again (stopped it before). Fingers crossed.

@noobpwnftw any idea why your workers are not able to join the test https://tests.stockfishchess.org/tests/view/5f156bf5da64229ef7dc17de ? One possible reason would be the used gcc version (needs to support C++17). If so, what do you use?

@vdbergh We don't really use too many fancy C++17 stuff syntax-wise (and what we may have, we can live without). The point is that older versions don't handle AVX2 instructions properly. Sources can be compiled but if AVX2 are not aligned correctly, executable crashes at runtime. Not that it works but slower, it exits with core dump.

@vondele They use devtoolset-7. gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)

Going to update to devtoolset-9. Workers going offline for the change.

@dorzechowski the tests looks good, basically, consistent with no significant slowdown for normal running. A quick local tests using with nnue enabled shows performance comparable to what I've seen before (right now: -29.8 +/- 18.7 after about 500 games).

@vondele That's good to hear. Now if we could make chosen nn.bin available for workers, we could in principle just set two UCI options: EvalFile=path/nn.bin and Use NNUE=true and try to run it on fishtest without any other changes. So far AVX2 code wasn't really executed, even if present in binary so I expect many rough edges depending on CPU/compiler/Makefile ARCH combination, It may even run SF dev no problem and crash running NNUE.

Perhaps the nets should be provided the same way as books, i.e. downloaded once from trusted Stockfish or fishtest repo. Certainly not good to make workers risk downloading a big binary file from some random github place. Is there a way to tell workers to download a specific file from official repo, provide checksum, etc.?

Why is it slower even with options turned off?

I think the guys doing the training and producing the nets would be looking to test different evals....
so tracking that (via name/version etc.) would be very useful of course...
Are you considering something to track nn.bin development as well?

@noobpwnftw as far as I can tell the speed is essentially the same.

@noobpwnftw If it's slower for you, double check compile options. Speed is the same for me, see above https://github.com/official-stockfish/Stockfish/issues/2728#issuecomment-660908228.

Do you want bmi2 build on AMD? I currently use x86-64-bmi2 on Intels and x86-64-modern on AMDs.
Currently it shows consistently slower performance on Intels with bmi2. Compiler version is: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC). Is that something related to Makefile?

@FireFather @dorzechowski the next step is non-trivial, i.e. integration in fishtest. I don't know yet how to best do this, it will definitely need some fishtest development. At least the following issues need to be resolved:

  • be able to build the right architecture, probably x86-64-avx2 (on AMD) or x86-64-bmi2, but we need to be careful not to break backwards compatibility (e.g. to build SF11 for regression tests), when the makefile target is not available (x86-64-modern is default). We could upgrade what x86-64-modern implies (e.g. make it x86-64-avx2).
  • how to deal with nets. Having a fixed net would be quite doable (e.g. upload to the books repository), but I agree we want to be able to do SPRT tests with new net proposals (probably with adjusted bounds). The nets are not too big, could be stored in git alongside the code, but that has disadvantages. I do think we need to avoid using a default name (nn.bin) I think we could use something like nn.bin-12digitSHA.

suggestions welcome.

@noobpwnftw These are my CXXFLAGS for x86-64-bmi2 from my branch Makefile: -Wall -Wcast-qual -fno-exceptions -std=c++17 -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE42 -msse4.2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE3 -msse3 -DUSE_SSE2 -DUSE_PEXT -mbmi2 -flto

For AMD Zen2 there is ARCH=x86-64-avx2, on older AVX2 is slow so they shouldn't be used for NNUE.

Mine says:

CXXFLAGS: -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE42 -msse4.2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE3 -msse3 -DUSE_SSE2 -DUSE_PEXT -mbmi2 -flto
LDFLAGS:  -m64 -Wl,--no-as-needed -lpthread -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE42 -msse4.2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE3 -msse3 -DUSE_SSE2 -DUSE_PEXT -mbmi2 -flto

/proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping        : 3
microcode       : 0xd6
cpu MHz         : 3699.865
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear spec_ctrl intel_stibp flush_l1d
bogomips        : 6816.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

@noobpwnftw I seem to measure 1.5% slowdown using bmi2 (on zen2, branch vs master)... is that similar to your number ? There are a few extra branches (basically check for nnue being enabled) that could cause a slight slowdown. Other reason could of course be code generation due to the different flags.

@vondele Yes, a minor one, but it does run a bit slower for whatever reason.

Options are the same I used. You have Skylake, I tested on Kaby Lake but it should be the same basically. Obviously some very slight slowdown is expected but this is no problem.

A Git repo containing frequent-changing binary files should be cloned for obvious reasons. However, you can still put many files there and have Github host them for you, like the current book repo. Fishtest can implement an extra field of input for such a file to be downloaded upon use and cache locally.

Nice.
And to submit an eval...maybe a button or link to open a dialog box for uploading the file, which would get placed in the testing queue.
Currently nnue allows the nn.bin to be named anything...
So perhaps a strict naming convention...some unique identifier may be needed.

We could also need to let the net file creators to write how he or she created his or her net file. In detail,

  • How the training data are generated?

    • If they are generated with the training data generator, what are the command?

    • If they are converted from pgn files, what are the original files?

  • How the net is trained?

    • What are the commands?

    • What book is used?

    • What endgame table is used?

These are necessary to confirm the reproducibility. Other net file creators will study good knowledge from those descriptions.

By the way, I will stop modifying my repository. Because I don't want to interfere with the works in this thread.

If there are questions, or something that I can help you guys, please feel free to ask me.

@vondele To start working on it on fishtest, I suggest taking one step at a time, use just one eval file for the time being and place it manually in the fishtest repo. Known good net is nn-256-gek2706-c157.bin, sha256: c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1. If at least some workers can run it, it's a good start.

Then maybe push the branch nnue-player-wip to the official Stockfish repo (maybe with a better name), so that people can start working on it. There is certainly a lot of things that can be done before starting to test different nets, even in terms of just further adapting, optimizing and cleaning up the code. There are also obvious QOL improvements to do, such as reading network size/architecture from the file header (it's all hardcoded now and needs recompiling) or supporting gzipped nets (eval file is pretty sparse and compresses easily to at least 50% size).

Dear @dorzechowski , did you consider that parameter sets are not given by aliens, but are generated using the code you happily removed, which therefore needs to be maintained anyway, and your removal is just busywork?

@sf-x Feel free to fix it in your branch. I was clear from the start that I wanted to have code only for playing.

From my side, I'll be discussing with @snicolet how we can best proceed later this week. However, I will be off-line until the weekend, so some patience, please, but we'll eventually come up with a plan. I appreciate @nodchip support, and we'll try to keep things in sync. Having a few people study the branch by @dorzechowski seems like a good idea, this appears a good starting point.

However, it needs to be appreciated that this is a rather large project, which is best done in steps, to keep master in production-quality shape. Starting with playing-only capability seems quite natural, and already at that point needs some changes to the fishtest infrastructure that will require some effort to implement. Having our normal SPRT basic process for changes to the code will be important to make sure that we keep having steady progression.

Just a reminder that this code _could be_ useful https://github.com/glinscott/fishtest/pull/547 .

yes, @protonspring work could be useful in this context.

IMO, nodchip's NNUE source code is ready to be integrated immediately. It's been very well distributed and widely tested for months now. The whole codebase is wholly functional and integrates perfectly to sf-dev (it's been kept current), all this mainly due to nodchip's attention to detail.

I really think there's no need to wait, of course I'm not talking about fishtest...just the simple testing and integration of nuue into the master branch. It was written in a very similar manner/style to SF and would be super easy...just a handful of simple changes to Stockfish dev in order to utilize a UCI option to turn NNUE 'off' and 'on'. I (and others) have done this and it works great.

You keep jumping to conclusions, so far, there is one fixed game test that looked fine(which is not even an non-regression test), against two different branches of master(which the new branches contains master changes other than NNUE, tested against older master), and there is not a PR being filed that is ready to merge. Bottom line, what exact branch of NNUE fork is ready to merge is still unclear to me, your tested one or nnue/master?

I agree that there is no need to wait for someone to clarify those.

I'll announce this weekend what the plan for integration is, however it will start from the nnue-player-wip branch discussed above.

As I said before, we'll go in steps and it needs updates to the fishtest infrastructure, which can't be done immediately but are essential.

yes, sure. Haswell introduced avx2, roughly 7 years ago, so I'm not too concerned if that would be the required 'modern' for NNUE.

Intel is still segmenting on AVX, so there are Comet Lake (i.e. another Skylake re-release…) CPUs released this very year that don't have AVX.

be able to build the right architecture, probably x86-64-avx2 (on AMD) or x86-64-bmi2

A better idea would be to enable BMI2 but avoid the pext and pdep instruction specifically on AMD, perhaps by providing x86-64-bmi2-pext and x86-64-bmi2-no-pext "architectures"? The remaining BMI and BMI2 instruction are fine, and indeed sped things up the last time I benchmarked it, it's only these two that absolutely murder performance due to being microcoded.

Nothing is ready to be merged to master and also it's not "super easy" at all. The thing to do first is to push nnue stuff as a branch to Stockfish official repo in similar manner as we have for example a branch for cluster version. Then we can work on it, create pull requests for this branch, etc. It's still a long way to go before it can be merged into master and maintainers shouldn't (and I'm sure won't) give in to the pressure from overhyped individuals who demand someone(TM) to do it now.

I suppose that's addressed to me. Please note: I didn't demand anything...I was simply responding to vondele's "suggestions welcome".

The wip branch contains just a tiny part of the codebase, and doesn't work...ie loading the nn.bin causes it to crash. I'm just not understanding the value of starting this way. Anyone kind enough to explain without insults, it would be appreciated.

You keep jumping to conclusions, so far, there is one fixed game test that looked fine(which is not even an non-regression test), against two different branches of master(which the new branches contains master changes other than NNUE, tested against older master), and there is not a PR being filed that is ready to merge. Bottom line, what exact branch of NNUE fork is ready to merge is still unclear to me, your tested one or nnue/master?

I agree that there is no need to wait for someone to clarify those.

What is the value of the one test? It didn't actually use any of the nnue code...that's where I'm at a loss to understand.
nodchip's repository (the original) is the only one that should be considered IMO...

@FireFather for me it seems possible to load a network and play with that branch. However, maybe there is a sequence of command that fails in which case it needs fixing.

The rationale for starting from the player part only is because that will be well testable. We will be able to test all code changes with the usual procedure of running SPRT tests, guaranteeing steady Elo progress and a lean codebase. This part of the code should also be 'easily' integrated with our standard CI procedure to catch bugs.

The additional tools needed to generate training data, and to train networks have different metrics to judge quality, like many of the other tools that are in use to develop engines. Right now it seems best to have those evolve independently, even if we'll try to avoid divergence.

Understood and agree completely...
The codebase is however full of (built with) dozens of preprocessor directives, one of which is #define EVAL_NNUE
so for ex: defining that only, creates a 'player only' compile, with none of extra training code included.

The wip branch I saw had many many changes, many hundreds of lines of code deleted, different namespace, etc.
It just seems overly complicated, labor intensive, and completely unnecessary to me. Reassembling it all, restoring all the individual deleted lines of code, and getting the training code to work is possibly going to prove much more difficult IMO.

Anyway, I am fully aware you guys are capable and need time...
Thanks

Well I hope this will be reasonably soon, I want to try my ideas in search on a branch that will be actually the strongest one ;)

@FireFather I didn't have any crash on loading nn.bin. Please describe steps to reproduce.

Keeping code full of #define EVAL_NNUE sections is not a promising way forward imo, and actually there would be no hope of having such things in master. Otherwise we would need to produce two different binaries every time to keep it in sync, even worse, we would need to ask abrok guy to do this for us and then hear complaints all the time that a binary for a specific CPU/eval combination is not there. Also in every single test telling workers which version to compile would quickly become a nightmare. Using NNUE or not should be optional for end user and there should be one binary to rule them all.

I believe introducing new code, especially that huge, should be kept to the minimal essentials that are neccessary for playing games and can be easier maintained and adapted to SF standards. Otherwise first thing to do would be to remove those thousands of lines and run simplification test. There is always time later to add more things, although I don't think Stockfish engine is the place for all the code for network creation, and training, this should be a separate tool.

My branch is absolutely not perfect, in fact it has WIP in the name. If there are some better options, they should obviously be used instead but frankly I don't see any better concrete suggestions as yet.

Of course I'm not recommending keeping the #defines. My recommendation is to compile everything together, and remove/replace the preprocessor directives as UCI options. In this manner achieving one binary...keeping the #defines (temporarily) to exclude code. Pretty simple approach.

I agree, for now, having a 'play' only binary makes sense...my point is you can have that now by changing one line.

As far as "there would be no hope of having such things in master", they do exist there already I believe.

But if you want to re-write everything as you go...that's you-guy's decision. But it's hard to understand it, the original code by nodchip is so well done and perfectly compatible and functional now.

Feel free to do it your way, point maintainers to your branch and let them decide.

There are no #define blocks excluding code in Stockfish as far as I know.

Everyone likes nodchip's repo and appreciates his work but then there is nothing stopping anyone from using it. We talk about merging it to master and testing it on fishtest and this is not that easy. There are many things to be adapted, both in fishtest and introduced code which should conform to Stockfish coding style, pass a review, SPRT tests and CI tests before being integrated.

IMO It's not obvious whether completely separating out the training code will spare or create more difficulties. For example it would be nice to at least share some header files to keep common data types and struct definitions in sync. A relevant reference from the past is the Tuning branch. It started out separately but was eventually(recently) merged into master due to high cost of maintenance. Would having one code base but separate make targets for playing and training executables be a helpful alternative? I'm glad @vondele and @snicolet are taking their time to make this decision carefully and looking forward to contributing to NNUE.

@mstembera It's true that those things will need to be kept in sync eventually. It's just my belief that it's better to have possibly smallest and easiest first step, make it clean, make it work and only then add on to it when we know what to add, how and why. I also have full faith in our maintainers to come up with the right approach.

One more thing:
Nodchips repository is kept synced, up-to-date on a regular basis. It is essentially very recent stockfish dev with nnue added. All of which has been compiled, used and tested thoroughly on a very wide spectrum of hardware and OS's. Every part of it put to the test for at least 2 months now. It's even played in TCEC. There are no bugs and I have never seen it crash, except in the very beginning when it couldn't find nn.bin.

Acknowledging that and accepting the changes a bit more freely would be save tons of work, time and energy. A branch could be created and the current NNUE repository used 'as is'...

Concerning the #defines..the codebase relies heavily upon these preprocessor directives (which can be very effective), for ex: he uses a script to produce 30 or more binaries each release...(every couple of days). But obviously only a few of these would be needed for a 'play-only' release...the same ones that are produced for regular SF. (each having a new UCI option for UseNNUE or something similar) .

PS There's currently a freeze on nodchip's repository for the past couple days...he's aware you guys are starting to integrate it.

The combined said "testing" effort is however nowhere close to a single LTC worth of effort on fishtest. While people are in the process of adding functions to fishtest so that the code can be tested with a trained net, the effort of introducing a minimal player-only code is WIP, and so far haven't passed any of the standard procedures for a patch to get merged.

Your suggestion is essentially skipping all testing procedures and dev-ops people agreed upon, and ask people do what you and your friends say and so on. What does it matter to you to have such a branch created "here and now"? Are you unable to use whatever repo/fork/branch that is already there and do what you want? If it is for some kind of ego then I think it is meaningless, if it is for better maintenance or quality control, then procedures have to be followed instead of "save tons of work, time and energy".

Also I don't see any point for anyone to feature freeze while we work on a integration, it isn't like the patch is final and nothing can change afterwards.

I just can't make sense of it: on the one hand you think the other repo is already perfect, while on the other hand you want it copy-pasted here without any oversight, testing or further work, so what goal do you want to achieve?

I don't think it advisable to skip testing procedures, at all
I don't think it's a good idea to copy/paste without oversight...
run all the tests necessary...tests are good

I'm suggesting to keep it simple, not cutting the code into a hundred pieces only re-write/re-assemble it.
It's fully functional now.

What does it matter to me? I don't know, why does it matter to anybody?

So why exactly do you think people rewrote it instead of copy-pasting? Do they do it simply for fun or just wanted to annoy you?

Honestly what's the use in arguing about it? The maintainers will decide what to do and that should be that.

Because there are more people who don't talk while doing the work, and there are people who just keep "suggesting" you should to do this and that, now you don't like people talk too much?
I hate to break this to you but this is likely what to expect for the upcoming NN hype, and I'm going to return the favor at the same level.

@noobpwnftw Not sure if you were replying to me, but if you were I have to say I'm not sure I understand what you mean. Regardless everyone needs to calm down here.

@vondele: Is not it possible for the purpose of single test to add nnue-net as patch to sources of @dorzechowski branch, add path to it in sources and run vs master on fishtest right now without any changes to fishtest?

@adentong My point is that people are working to get it integrated, there are a lot of discussions offline, just not everyone live updates what they are doing 24/7. There are people who just are just unsatisfied about almost everything, they are free to express their opinions and so do I.

Is it even sensible to Fishtest NNUE against master? The ratio of NNUE v master performance is hardware dependent, and e.g. NNUE may pass on hardware A but fail on B. It is a lesser version of the GPU v CPU problem.

@noobpwnftw Oh yes I 100% agree. Everyone's free to voice their opinions and I was by no means saying otherwise, though personally I would have just ignored any overzealous suggestions instead of arguing. But hey you're free to do what you want.

Is it even sensible to Fishtest NNUE against master? The ratio of NNUE v master performance is hardware dependent, and e.g. NNUE may pass on hardware A but fail on B. It is a lesser version of the GPU v CPU problem.

I've just made sure that every fishtest worker I have is fully capable of running NNUE with AVX2 + BMI2(which if I understand correctly, is all what it needs to perform well). Although I have mentioned that on AMD there are certain BMI2 operations that are just slower, I can still update the build commands should there be a new one that overcomes the specific problem.

I've started an issue to track the merge and tasks https://github.com/official-stockfish/Stockfish/issues/2823
lets keep that issue specific to work on the merge, I'll keep this one open for more general discussion.

Let's have fun with this new development!

just a note, testing clang++ (10.0.0-4ubuntu1) vs gcc (Ubuntu 9.3.0-10ubuntu2), I find that clang gives about 4% more nps on x86-64-avx2 for NNUE bench.

How do we solve a (hypothetical) issue that we get a rock-paper-scissor situation? A > B > C > A...

After passing SPRT against old Devnet, quick regression test against HC + selected previous nets as a sanity check?

that was already raised as an issue elsewhere. However, that's not very different from the current situation we have with patches... we don't see it often. Probably up to the maintainer to break the cycle, and wait for D >> A,B,C ?

We can always check for regressions also against Stockfish with the standard eval, perhaps adjusting TC to make it a closer match. For example, if at STC the NNUE version is stronger by about 50 Elo, the "classic" SF could get 25% more time and be a valid quasi-independent 3rd party to check for regressions against. I don't think this will be a problem. The only change needed would be to introduce asymmetric TC in fishtest.

I would wait for the situation to occur before we find solutions for it ... other things need to be sorted out first.

@vondele The modern CPU compile available at abrok.eu is not a SSE 4.1 built, please can you make sure that SSE 4.1 compiles are available on abrok.eu hence forth. Thank You very much for your help.

@vondele Before merge please also consider releasing the last non-nnue SF version as Stockfish 12. Even if we don't have +50 Elo there yet, the current 25-30 Elo is not bad at all (and there will be at least few more patches in). More importantly, it would serve as a nice reference point.

Is there an issue for the fact that contempt doesn't change evaluation
currently? Last I checked it didn't really do anything, but probably could
do a simple addition to output of eval? Whether or not that gains elo is a
different question, but if it doesn't it would default to zero, and it
would still influence whether 3 folds are taken.

On Thu, Jul 30, 2020 at 9:53 AM Dariusz Orzechowski <
[email protected]> wrote:

@vondele https://github.com/vondele Before merge please also consider
releasing the last non-nnue SF version as Stockfish 12. Even if we don't
have +50 Elo there yet, the current 25-30 Elo is not bad at all (and there
will be at least few more patches in). More importantly, it would serve as
a nice reference point.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/official-stockfish/Stockfish/issues/2728#issuecomment-666378525,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADXIQNDIMIGLVV4HVTVBWHDR6F3OBANCNFSM4N2PD3TQ
.

I don't like the idea of making the latest sf dev as sf 12, regular sf should also keep developing normally and sf 12 should only be released when that +50 elo target is achieved.

@Amplaytro I'll contact the abrok owner. modern will automatically be 'OK', but it would be nice if avx2 builds are available.

not yet sure about sf12 prior to the merge, spontaneously, I would have said a while after the merge would be better.

Is there an issue for the fact that contempt doesn't change evaluation currently? Last I checked it didn't really do anything, but probably could do a simple addition to output of eval? Whether or not that gains elo is a different question, but if it doesn't it would default to zero, and it would still influence whether 3 folds are taken.
…
On Thu, Jul 30, 2020 at 9:53 AM Dariusz Orzechowski < @.*> wrote: @vondele https://github.com/vondele Before merge please also consider releasing the last non-nnue SF version as Stockfish 12. Even if we don't have +50 Elo there yet, the current 25-30 Elo is not bad at all (and there will be at least few more patches in). More importantly, it would serve as a nice reference point. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2728 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXIQNDIMIGLVV4HVTVBWHDR6F3OBANCNFSM4N2PD3TQ .

Contempt code currently resides in eval.
This could be reverted back to being situated inside the search code. But that's something for Fishtest in the future.

yes, concerning contempt, that is not there with NNUE. Notice also our contempt is not just a change of drawvalue or some such thing... this likely can't be easily 'added' to NNUE (but it presumably is built-in in part during the training phase, i.e. the net should have learned the contempt it was trained with). Stuff for future work :-)

I won't claim to know anything beyond the current domain of knowledge, but will some search improvements hinder Stockfish and improve Stockfish nnue? Seems to me that search optimizations were built around evaluations of fixed sizes, and sharing the search code between classical evaluation and NNUE evaluation might create some rather difficult decisions regarding certain search methods.

For example, if for some reason the Stockfish project thought it was a good idea to include a very large network, other search methods as wild as an MCTS-like method could be considered due to the MCTS bottleneck decreasing.

Are we not planning for that type of specialization by just keeping everything that search needs in a single file, or splitting the search files between NNUE and classical?
If it were somehow more practical with less slowdown to keep NNUE and classical search in one file when attempting to combine the best of both worlds, would "blindly" splitting the search files lead to an impractical and unnecessary code overhaul?

If we merge, my current feeling is that search improvements should be directed to the single method that performs best. Doubling the (human) resources needed to develop and maintain the code is not a good idea. I still need to decide how to proceed after an eventual merge, so that we make progress quickly without disrupting code too much. I think the first thing to do is to focus on improving the networks till we have more or less a steady state. This could be reached in a couple of weeks. Afterwards we start improving search to match the new eval and start exploring further evolution of the functional form of the new eval (i.e. new networks with new features, or combinations with the classic evaluations). Other opinions welcome.

I also think testing search patches against whichever version of eval performs best is the way to proceed.

Once you start having lots of search patches based on a certain network size and set of input parameters there is a good chance it will get locked into that one since it will be too much effort for one person to switch all of it and to reoptimize similarly. It would require a big or universal improvement.
Unless there is a method to switch architectures based on a smaller fixed amount of compute (which is common for tuning NN architectures).

that's indeed something to figure out. I would expect that initially the big changes to the engine will come predominantly from changes in net, and our standard tuning of search will just work more or less. It will be interesting to see how much there is to be gained from adjusting search. I would naively assume that the net has an eval somewhat more like a low depth search (as that's the training input), and thus the actual search might have to look a bit more like our usual high depth search, so somewhat less pruning and the like. However, that's pure speculation until we can test.

I worry what will happen to sf’s successful incremental development model if sf-nnue becomes the primary focus. I am also not yet convinced that sf-classic is dead.

One thing I noticed from nodchip’s description of the nn tuning algorithm is that it could be equally applied to SF’s classic eval. Another thing is that part of the Elo gain in sf-nnue is due to avx2. Again could also try to use avx2 in sf-classic.

The avx2 related Elo gain comes from the implementation of the NNUE evaluation (i.e. network), nothing to gain in classic mode.

I also hope that classical eval keeps on improving. This is for sure something I try to keep possible. Evaluation patches will still be tested the normal way. As long as NNUE and classic evaluation are separate there is no problem. It will be more delicate if we hybridize.

Concerning the successful incremental development, also that is an important goal. I'd like all patches to pass our usual sprt testing. However, that will require that we have one goal, not a combination of two goals (e.g. search can only be optimized for one evaluation method, not two at the same time).

The avx2 related Elo gain comes from the implementation of the NNUE evaluation (i.e. network), nothing to gain in classic mode.

Well in principle I do not see why vector instructions could not be used to speed up a traditional eval also... In some sense the mg/eg mixed values are already a primitive vectorisation.

I also hope that classical eval keeps on improving. This is for sure something I try to keep possible. Evaluation patches will still be tested the normal way. As long as NNUE and classic evaluation are separate there is no problem. It will be more delicate if we hybridize.

Concerning the successful incremental development, also that is an important goal. I'd like all patches to pass our usual sprt testing. However, that will require that we have one goal, not a combination of two goals (e.g. search can only be optimized for one evaluation method, not two at the same time).

In principle, using avx2 for the classical eval is probably possible, but likely very difficult, and speedups small. The matrix-vector operations needed for the NN are very suitable however. Hardware will likely make the NN evaluation even faster in the future.

I would prefer to lock onto one network architecture, at least in the first period, and see how far we can get. Mainstream network architecture in shogi is halfkp_256x2-32-32, the same we have now in the branch. After that we could play with different layer sizes to try for example halfkp_384x2-32-32 but as a trivial change of a constant, without changing the input layer. When we have a strong baseline and good understanding what we are doing, we can try other things.

It's very easy to come up with basically infinite combinations of inputs and layer number/sizes and this is not the way to proceed imo. I'm worried to see many such attempts thrown monkey style at fishtest without any consideration because some people seem to think fishtest can handle everything and give an answer in 5 minutes.

A credit system can be implemented so that one must contribute enough CPU hours for a test to be started, each costs some credits and reward bonus credits for any successful tests, can take average test pass rate for the reward multiplier.
Not necessarily a complex system, just enough to prevent people from spamming junks without contributing anything.

I agree (as mentioned above), we should first test the now supported halfkp_256x2-32-32 and maximize it performance. Let's first assume there will be considerate use of the resources, before we implement/enforce policies. As before, we have approvers that can step in if the resources are not wisely used, but otherwise, I expect we get a long stretch by just communicating what what we think is the right approach.

This is even more complicated by the fact that larger networks have been shown to be better at longer TC and smaller ones at shorter TC. (Not proven for SFNNE yet but other NN based engines.)

It help there, that the extra search speed is likely worth less and less compared to more knowledge in the net the more nodes there is time for once the numbers gets big.

I really hope we don't end up having 100+ MB networks running at 10% of current Stockfish speed, I would put the limit at 50% slowdown. One of Stockfish trademarks is that it's a fast and deep searcher. For slow engines we should refer to other projects.

@vondele Re you comment about nnue eval being like a low depth search. I ran a couple of 10k fixed depth matches
SF Depth 5 vs NNUE Depth 1
4494 - 4910 - 596 [0.479] 10000 -14.5 +/- 6.6, LOS: 0.0 %, DrawRatio: 6.0 %
SF Depth 6 vs NNUE Depth 1
6456 - 2879 - 665 [0.679] 10000 130.0 +/- 7.0, LOS: 100.0 %, DrawRatio: 6.7 %
So it looks like it's just slightly better than a depth 5 search.

Interesting result. Maybe worthwhile to see what happens at Depth 10 vs Depth 6 (or similar offset in depth).

Some new unexpected results given the first ones.
SF Depth 6 vs NNUE Depth 6
2972 - 6161 - 867 [0.341] 10000 -114.8 +/- 6.8, LOS: 0.0 %, DrawRatio: 8.7 %
SF Depth 7 vs NNUE Depth 6
6092 - 2837 - 1071 [0.663] 10000 117.4 +/- 6.7, LOS: 100.0 %, DrawRatio: 10.7 %
Looks like the difference between the evals here is less than 1 ply worth of search.
On what depth training data was the network trained?

I think it was trained on depth 8 or depth 12 (@gekkehenker ?). However, I think this must not be too surprising, we know Elo gain at STC depths is something like 30-60Elo, which is less than what 1 ply of depth is worth (at around STC depths).

I think it was trained on depth 8 or depth 12 (@gekkehenker ?). However, I think this must not be too surprising, we know Elo gain at STC depths is something like 30-60Elo, which is less than what 1 ply of depth is worth (at around STC depths).

Net was trained on both depth 8 and depth 12 games.
Was first fed the depth 8 games only, then trained the resulting net on depth 12 games.

@vondele thanks for your hard work in getting NNUE merged - just wondered what SV net is being run on fishtest now?

@vondele thanks - just wondered what the corresponding net number etc is from here:
https://www.comp.nus.edu.sg/~sergio-v/nnue/

Also which binary is used?

don't know, you should be able to find it from a matching sha256sum netname | cut -c1-12

nn-97f742aaefcd.nnue is 20200801-1515.bin

has anyone tried to use NNUE in FRC? doesn't seem to work for some.

Hmm worked OK for me here: https://lichess.org/yV7J1imd

I haven't tried but in principle it should work. NNUE only touches eval. Also the classical eval had almost no special handling of FRC (one term if I recall correctly).

In my experience NNUE will play some FRC positions and crash in the rest.

hmm, will be the added code in position that might wrong in that case.

I am behind the times. . . is this really ~90 ELO better than master on the same hardware?

Correct - this will be a 100+ Elo gain merge or so - give or take a few Elo.

The mother of all merges.

I am behind the times. . . is this really ~90 ELO better than master on the same hardware?

90 elo conservatively.

On a modern CPU with normal LTC conditions and a PGO build it's a bit stronger than that ;)

Note there are certain incompatibilities on old hardware that would make it significantly less efficient.

Also, there are hints there is some significant elo compression at very long time controls with increment.

Also note that contempt has yet to be implemented, which has the potential to present itself as an ELO gainer.

...ALSO note that it's likely just much stronger from the start position than it is from some many-ply-long books, but that claim has yet to be sufficiently backed up.

I would not get too excited about contempt. Contempt was designed for use against weaker engines. Against equal or stronger engine, it’s just about worthless. So the only thing contempt does is squeeze a few extra elo out of much lower rating opponents. I would be hard pressed to say contempt makes it better - it squeezes a few Elo out of weaker opponents. It falls into the realm of being a vanity of vanities.

@MichaelB7 Not having contempt cost SF the qualification into the TCEC SuFi one season.

NNUE evaluation has been merged, I'll close this issue. Thanks for the discussion.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bftjoe picture bftjoe  Â·  5Comments

mstembera picture mstembera  Â·  5Comments

NKONSTANTAKIS picture NKONSTANTAKIS  Â·  6Comments

d3vv picture d3vv  Â·  4Comments

maelic13 picture maelic13  Â·  3Comments