Stockfish: NNUE/NNUE hybrid

Created on 28 Oct 2020  路  7Comments  路  Source: official-stockfish/Stockfish

I was able to train a 192-24-24 (flip) net which is somewhere between 30 and 60 elo stronger (STC) than the last commit before the merge. The idea to have a smaller net as a lazy eval is basically as old as the project itself. Here are the speeds for comparison (bmi2 CPU):

commit before nnue d24 128 mb hash: nps 1784335 (100%)
192-24-24 pure d24 128 mb hash: nps 1565392 (87.7%)
256-32-32 pure d24 128 mb hash: nps 1301399 (72.9%)
(Note: the NNUE speeds are using a version without the latest accumulator update which gave about 2-3% more speed)

I think a hybrid of a 256x net and hce is flawed in that hce is getting worse the more pruning we introduce. Having a net as fast as HCE while having a higher elo/node is certainly to be preferred the more pruning is introduced. My goal is a net as fast as classical SF with higher elo/node. Next I will try a 176-24-24 net.

Here are the nets that I produced (I don't know which one is stronger):

(I will not try to code a NNUE/NNUE hybrid as I lack expertise and motivation)

Edit: Links should be working again.

Most helpful comment

Alright, I have 2 big updates:

  • All nets from 32-16 up to 128-16 are easily winning against SF-classical by about 80 to 100 elo after 1b fens of training (rotate halfkp) while being faster than the classical eval. With the current speedups from master compared to nodchip even 128-16-16 will be faster.
  • @Sopel97 did a lot of work trying to implement the multinet idea.

I'd like to show you the speeds various nets achieved on my bmi2 machine with nodchip's repo in "pure mode". Note: nodchip's repo is much less optimised than current master and especially so for the net sizes that I tested. With that in mind, here's the list:

commit before nnue d24 128 mb hash: nps 1784335 (100%)
32-16 pure d24 128 mb hash: nps 2238685(125%) 
64-16 pure d24 128 mb hash: nps 2116030 (118%)
96-16-16 pure d24 128 mb hash: nps 1885670 (105%)
128-16 pure d24 128 mb hash: nps 1829810 (102%)
160-16-16-16-16 pure d24 128 mb hash: 1571094 (88.0%) 
256-32-32 pure d24 128 mb hash: nps 1301399 (72.9%)

I had some nets with ReLu layer sizes of 24x but these aren't good for optimisation esp. on AVX512 so I removed them from this list.

Now for the second and even more important part here's the description of the initial multinet branch:

https://github.com/Sopel97/Stockfish/tree/multinet

This branch allows training a network composed of one feature transformer and 2 (or more) different layer stacks after.
Only one layer is trained at a time and has to be selected at compile time.
The layer to be trained is chosen by specifying `net=N` for make, where `N` is the 0-based index of the layer stack.
The default halfkp setup in this branch contains two network stacks:
// Input features used in evaluation function
using RawFeatures = Features::FeatureSet<
    Features::HalfKP<Features::Side::kFriend>>;

// Number of input feature dimensions after conversion
constexpr IndexType kTransformedFeatureDimensions = 256;

namespace Layers {

    // Define network structure
    using InputLayer = InputSlice<kTransformedFeatureDimensions * 2>;
    using HiddenLayer1 = ClippedReLU<AffineTransform<InputLayer, 32>>;
    using HiddenLayer2 = ClippedReLU<AffineTransform<HiddenLayer1, 32>>;
    using OutputLayer = AffineTransform<HiddenLayer2, 1>;

    // Define network structure
    using InputLayerB = DoubleInputSlice<64, kTransformedFeatureDimensions>;
    using HiddenLayer1B = ClippedReLU<AffineTransform<InputLayerB, 16>>;
    using HiddenLayer2B = ClippedReLU<AffineTransform<HiddenLayer1B, 16>>;
    using OutputLayerB = AffineTransform<HiddenLayer2B, 1>;

}  // namespace Layers

using Network = NetworkSet<Layers::OutputLayer, Layers::OutputLayerB>;
The network at index `0` is the normal halfkp, the network at index `1` is 64-16-16-1.
DoubleInputSlice allows taking two input slices from the feature transformer output.
The arguments are `<HalfSize, Stride, Offset=0>`. Stride should be equal to kTransformedFeatureDimensions
so that we take 64 entries from each feature transformer output.

For this architecture it is specified that the feature transformer has frozen weights and biases when the
network index is different than `0` - so `0` is assumed the main net, and `1` the auxillary one.

with `Use NNUE value pure` this branch plays with the net selected for training
with `Use NNUE value false` this branch plays with classical
with `Use NNUE value true` this branch plays with a hybrid, where main evaluation is done 
by network `0` and classical evaluation is replaced by network `1`.

To train this double network first one needs to get a good net `0` (`make net=0 ...`), and then
train on top of it (don't skip loading eval) with a trainer for the second net (`make net=1 ...`).
Note that when training network `1` the feature transformer is frozen and no factorizer
is used (which is indicated by the summary output from before training starts)

This however turned out to not work as the ReLu layers can't properly increase move accuracy during training.
Luckily we have two other branches _multinet2_ and _multinet3_. I can't explain what _multinet2_ does but _multinet3_ has a 256+128 feature transformer and the 256 slice is used with the 32-32 ReLu layers while not using the other slice. With Use NNUE value true when classical eval would be chosen we instead use the 128 slice with the smaller ReLu layers.

It is possible to copy current master net to a _multinet2_ net. I currently try that out. _multinet3_ will probably be able to yield a bigger gain but will need to be trained from scratch which I will also try.

Keep in mind that it is very possible there are still bugs to be squashed, these branches are just a day old.

The last few days have been very busy on the Stockfish discord and this is one of the results.
A huge thanks to @Sopel97 for providing the code and another huge thanks for @vondele and his monster machine providing the needed training data. @noobpwnftw has also found some improvements to gensfen. Things are moving forwards.

All 7 comments

With the links now dead, I suspect this can be closed.

It would be nice to have the ability for other sized nets. I would bet on it being the next significant gain. However, how do we synchronize SF w/ the training code? Some discussion about plans going forward is needed.

With the links now dead, I suspect this can be closed.

I accidentally deleted them from my google drive. I can put them back on although I have some other nets in the work which. Give me one or two days and I'll have an even stronger net in the 192-24-24 arch. I will make then rotate nets as they seem to be stronger with halfkp (kind of makes sense as halfkp was designed with rotational symmetry in mind). I succeeded in creating an even smaller net with elo gain to pre-NNUE. I'm still exploring how small the nets can get and how much speed is gained.

Keep us posted. Thanks.

Alright, I have 2 big updates:

  • All nets from 32-16 up to 128-16 are easily winning against SF-classical by about 80 to 100 elo after 1b fens of training (rotate halfkp) while being faster than the classical eval. With the current speedups from master compared to nodchip even 128-16-16 will be faster.
  • @Sopel97 did a lot of work trying to implement the multinet idea.

I'd like to show you the speeds various nets achieved on my bmi2 machine with nodchip's repo in "pure mode". Note: nodchip's repo is much less optimised than current master and especially so for the net sizes that I tested. With that in mind, here's the list:

commit before nnue d24 128 mb hash: nps 1784335 (100%)
32-16 pure d24 128 mb hash: nps 2238685(125%) 
64-16 pure d24 128 mb hash: nps 2116030 (118%)
96-16-16 pure d24 128 mb hash: nps 1885670 (105%)
128-16 pure d24 128 mb hash: nps 1829810 (102%)
160-16-16-16-16 pure d24 128 mb hash: 1571094 (88.0%) 
256-32-32 pure d24 128 mb hash: nps 1301399 (72.9%)

I had some nets with ReLu layer sizes of 24x but these aren't good for optimisation esp. on AVX512 so I removed them from this list.

Now for the second and even more important part here's the description of the initial multinet branch:

https://github.com/Sopel97/Stockfish/tree/multinet

This branch allows training a network composed of one feature transformer and 2 (or more) different layer stacks after.
Only one layer is trained at a time and has to be selected at compile time.
The layer to be trained is chosen by specifying `net=N` for make, where `N` is the 0-based index of the layer stack.
The default halfkp setup in this branch contains two network stacks:
// Input features used in evaluation function
using RawFeatures = Features::FeatureSet<
    Features::HalfKP<Features::Side::kFriend>>;

// Number of input feature dimensions after conversion
constexpr IndexType kTransformedFeatureDimensions = 256;

namespace Layers {

    // Define network structure
    using InputLayer = InputSlice<kTransformedFeatureDimensions * 2>;
    using HiddenLayer1 = ClippedReLU<AffineTransform<InputLayer, 32>>;
    using HiddenLayer2 = ClippedReLU<AffineTransform<HiddenLayer1, 32>>;
    using OutputLayer = AffineTransform<HiddenLayer2, 1>;

    // Define network structure
    using InputLayerB = DoubleInputSlice<64, kTransformedFeatureDimensions>;
    using HiddenLayer1B = ClippedReLU<AffineTransform<InputLayerB, 16>>;
    using HiddenLayer2B = ClippedReLU<AffineTransform<HiddenLayer1B, 16>>;
    using OutputLayerB = AffineTransform<HiddenLayer2B, 1>;

}  // namespace Layers

using Network = NetworkSet<Layers::OutputLayer, Layers::OutputLayerB>;
The network at index `0` is the normal halfkp, the network at index `1` is 64-16-16-1.
DoubleInputSlice allows taking two input slices from the feature transformer output.
The arguments are `<HalfSize, Stride, Offset=0>`. Stride should be equal to kTransformedFeatureDimensions
so that we take 64 entries from each feature transformer output.

For this architecture it is specified that the feature transformer has frozen weights and biases when the
network index is different than `0` - so `0` is assumed the main net, and `1` the auxillary one.

with `Use NNUE value pure` this branch plays with the net selected for training
with `Use NNUE value false` this branch plays with classical
with `Use NNUE value true` this branch plays with a hybrid, where main evaluation is done 
by network `0` and classical evaluation is replaced by network `1`.

To train this double network first one needs to get a good net `0` (`make net=0 ...`), and then
train on top of it (don't skip loading eval) with a trainer for the second net (`make net=1 ...`).
Note that when training network `1` the feature transformer is frozen and no factorizer
is used (which is indicated by the summary output from before training starts)

This however turned out to not work as the ReLu layers can't properly increase move accuracy during training.
Luckily we have two other branches _multinet2_ and _multinet3_. I can't explain what _multinet2_ does but _multinet3_ has a 256+128 feature transformer and the 256 slice is used with the 32-32 ReLu layers while not using the other slice. With Use NNUE value true when classical eval would be chosen we instead use the 128 slice with the smaller ReLu layers.

It is possible to copy current master net to a _multinet2_ net. I currently try that out. _multinet3_ will probably be able to yield a bigger gain but will need to be trained from scratch which I will also try.

Keep in mind that it is very possible there are still bugs to be squashed, these branches are just a day old.

The last few days have been very busy on the Stockfish discord and this is one of the results.
A huge thanks to @Sopel97 for providing the code and another huge thanks for @vondele and his monster machine providing the needed training data. @noobpwnftw has also found some improvements to gensfen. Things are moving forwards.

The multinet approach is based on two new layer types:

  1. DoubleInputSlice. The feature transformer outputs kHalfDimensions of output values, and this is done twice, once for each perspective. So in the end it produces kHalfDimensions*2 values. What DoubleInputSlice does it it takes N<kHalfDimensions values from the first kHalfDimensions and N values from the second kHalfDimensions of values. Effectively it allows splitting the output of the feature transformer into two sets, each one disjoint.
  2. SemiAffineTransform (or maybe better named MultiAffineTransform). This layer replaces one big affine transform with multiple smaller ones. While this reduces the expressive power it allows reusing the full feature transformer output instead of just some part of it while being faster than a single affine transform. This approach can be further improved by changing the layout and will be improved in this manner before the final version. The improved versions would change the layout from 0000111122223333 to 0123012301230123 so that it requires less horizontal additions in the end and is better suited for simd (note that we are free to choose what we sum where, we're no constrained by any particular shape).

There were 3 approaches coded:

  1. multinet branch https://github.com/Sopel97/Stockfish/tree/multinet - Main network works as usual. the secondary network uses DoubleInputSlice to take a part of the feature transformer output and then does full affine transforms on top. Initial tests were on a buggy version so no definitive results, but I don't think a reasonable net can be tought from just a part of the (frozen) feature transformer.
  2. multinet2 branch https://github.com/Sopel97/Stockfish/tree/multinet2 - Main network works as usual. The secondary network uses the full feature transformer output but starts with a faster SemiAffineTransform layer (by default it does 4 separate smaller affine transforms), also with less outputs (512->16->16->1 instead of 512->32->32->1).
  3. multinet3 branch https://github.com/Sopel97/Stockfish/tree/multinet3 - The feature transformer is made larger, it now contains N outputs per feature for the main net, and M outputs per feature for the secondary net. The first try is 256+128. The main net uses DoubleInputSlice to view the first 256*2 values in each perspective's output, the secondary network views the remaining 128*2 values and has smaller affine transforms.

The third approach is the most promising and sound. Hopefully it turns out well.

I'm currently feeling very sick so I will pause my work on multinets for now and not post updates.
Feel free to continue without me.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fun8 picture fun8  路  4Comments

Silver-Fang picture Silver-Fang  路  7Comments

ghost picture ghost  路  5Comments

NKONSTANTAKIS picture NKONSTANTAKIS  路  6Comments

MoonstoneLight picture MoonstoneLight  路  5Comments