Stockfish 🚀 - NNUE merge | bleepingcoder.com

some tasks needed for integration:

[x] create network server with download/upload api
- [x] merge https://github.com/glinscott/fishtest/pull/736
- [x] delete now unused repo (https://github.com/official-stockfish/networks)
- [x] make readme documenting process for NNUE testing (https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests)
- [x] all nets use a nn-sha256sum[12 digits].bin name for reference from engine
- [x] upload API (https://tests.stockfishchess.org/upload)
- [x] add first net (and more: https://tests.stockfishchess.org/nns)
[x] create nnue-player-wip branch in Stockfish repo, work in this branch till ready for a merge.
- [x] create branch (https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip)
- [x] review code comments throughout
- [x] go through open issues
- [x] remove non-player related code
- [x] properly adjust readme
- [x] clamp abs(NNUE eval) < VALUE_TB_WIN_IN_MAX_PLY
- [x] fix eval / evalnn output
- [x] no compiler warning generated
- [x] fix debug=yes build
- [x] passes the usual CI tests (sanitizers, valgrind, etc) when run by hand
  - [x] verify bench
  - [x] verify perft
  - [x] verify reprosearch
  - [x] instrumented valgrind
  - [x] instrumented valgrind-thread (very slow! (multiple hours))
  - [x] instrumented sanitizer-undefined
  - [x] instrumented sanitizer-thread
- [x] verify equal bench on all (18!) architectures supported (done except for armv7, armv8, ppc-32)
- [x] adjust bench to cover both handcrafted and NNUE evaluation (needs solution for network download as part of CI)
- [x] consider random / synthetic network to perform bench / pgo.
- [x] decide when to init_nnue() (startup, EvalFile, Use NNUE, is_ready, cmdline, ucinewgame, ....), add assert.
- [x] changes in core engine
  - [x] minimize
  - [x] coding style
  - [x] minimize performance impact on handcrafted running style
- [x] UCI default net name matching a net in networks repo.
- [x] complete/maintain AUTHORS file
- [x] integrate in CI
  - [x] deal with std::aligned_alloc
  - [x] perform NNUE testing in CI (needs net download solution)
  - [x] update appveyor / MSCV to use C++17
  - [x] update travis environment to support C++17
  - [x] unfreed memory (fixed with newer toolchain)
  - [x] clang build (needs at least clang 7.0)
  - [x] gcc build (needs at least gcc 7.4)
  - [x] valgrind uninit memory on loading TB
  - [x] valgrind uninit memory
  - [x] memset
  - [x] memcpy
- [x] figure out reason for bench change wrt to nodchip repo (we are up-to-date with master on the branch).
- [x] test performance locally.
- [x] Figure out if we need to update Position objects on a 'Use NNUE' change (https://github.com/official-stockfish/Stockfish/issues/2859)
- [x] SPRT test classical eval impacted less than 5 Elo on fishtest (looking good after https://github.com/official-stockfish/Stockfish/pull/2873, STC: https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b LTC: https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d)
- [x] SPRT test NNUE more than 30 Elo gain on fishtest (skipping SPRT. See >80Elo gain: https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c and https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58)
- [x] Merge master
- [x] squash commits ?
- [x] create PR
[x] update fishtest to support NNUE testing
- [x] communicate / document / update compiler requirements to C++17
- [x] communicate / document removal of custom_make.txt
- [x] download networks from network server as required by binary.
- [x] build code with proper architecture (avx2, bmi2, ?)
- [x] instruct cutechess to do the right thing (i.e. Use NNUE, find network)
- [x] adjust minimal nps for running ?
- [x] decide how to deal with hardware that is slower for nnue.
[x] contact abrok, possible new targets for build (but see https://abrok.eu/stockfish-nnue/)

vondele on 25 Jul 2020

🎉11

Let's use this issue specifically for synchronizing work on these or related tasks. General discussion use https://github.com/official-stockfish/Stockfish/issues/2728

vondele on 25 Jul 2020

debug=yes build is broken it seems.

vondele on 25 Jul 2020

A quick verification / benchmark (single bench run) on the targets I can test easily (on zen2 architecture), gives a feeling for the order of magnitudes. Good thing is all benches match. If people can test x86-64-avx512 armv7 armv8 ppc-32 ppc-64 that would be appreciated.

| target | nodes | nodes NNUE | nps | nps NNUE |
| --: | --: | --: | --: | --: |
| general-32 | 4578298 | 3377227 | 1725705 | 134486 |
| x86-32-old | 4578298 | 3377227 | 1783520 | 133497 |
| x86-32 | 4578298 | 3377227 | 1758178 | 133635 |
| general-64 | 4578298 | 3377227 | 2423662 | 308000 |
| x86-64 | 4578298 | 3377227 | 2485503 | 436108 |
| x86-64-modern | 4578298 | 3377227 | 2611693 | 424648 |
| x86-64-sse3 | 4578298 | 3377227 | 2470749 | 1131778 |
| x86-64-sse3-popcnt | 4578298 | 3377227 | 2675802 | 1199299 |
| x86-64-ssse3 | 4578298 | 3377227 | 2577870 | 1166975 |
| x86-64-sse41 | 4578298 | 3377227 | 2445672 | 1243914 |
| x86-64-sse42 | 4578298 | 3377227 | 2549163 | 1113126 |
| x86-64-avx2 | 4578298 | 3377227 | 2678933 | 1635461 |
| x86-64-bmi2 | 4578298 | 3377227 | 1906829 | 1411884 |

Edit: since 'x86-64-modern' on master enables both sse3 and popcnt already, we can easily adjust that one (actually, incorrect, needs ssse3 not sse3, the target x86-64-sse3 enables both, increased modern to require ssse3).

vondele on 25 Jul 2020

Fix for debug build: PR #2827.

dorzechowski on 25 Jul 2020

👍1

Some more tasks as seen from the CI (patches welcome):

[DONE] appveyor testing needs to add '/std:c++17' somewhere in the right place (needs solution for aligned_alloc with MSCV)
clang: nnue/evaluate_nnue.cpp:60:45: error: no member named 'aligned_alloc' in namespace 'std' (on osx)
[DONE] position.cpp:199:39: error: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘struct StateInfo’; use assignment or value-initialization instead [-Werror=class-memaccess]
[DONE] position.cpp:714:51: error: ‘void* memcpy(void, const void, size_t)’ writing to an object of a non-trivial type ‘struct StateInfo’ leaves 1272 bytes unchanged [-Werror=class-memaccess]
[DONE] unfreed memory: in use at exit: 72,704 bytes in 1 blocks (possibly only with older version of toolchain)
[DONE] valgrind: uninit memory access: at 0x426526: Position::set(std::__cxx11::basic_string, std::allocator > const&, bool, bool, StateInfo, Thread) (position.cpp:222)

vondele on 25 Jul 2020

* clang: nnue/evaluate_nnue.cpp:60:45: error: no member named 'aligned_alloc' in namespace 'std'

Compiler bug. std::aligned_alloc is standard C++17.

sf-x on 25 Jul 2020

or header missing?

vondele on 25 Jul 2020

hmm. std::aligned_alloc should exist after #include <cstdlib> which exists in "types.h", which is included from "position.h", which is included from nnue/evaluate_nnue.cpp...

sf-x on 25 Jul 2020

OK, needs clang 7.0 (https://godbolt.org/z/rE3joM), CI is still based on 6.0. .travis.yml needs updating
gcc needs to be >= 7.4

vondele on 25 Jul 2020

I have add a commit to upgrade the travis settings on Linux... probably somebody needs to come with a similar fix for the osx of our CI.

Edit: seems like this upgrade didn't work. I'll revert that patch and we'll address the issue later

vondele on 25 Jul 2020

@vondele

I'm afraid that assemblies for the x86_64 and x86_32 architecture (without sse3 or popcnt) are useless for testing in the framework, since they are much slower than the classic Stockfish assemblies for the same architecture and are much weaker. Testing them can significantly affect the quality of the tests.

JavaMast on 25 Jul 2020

almost certainly all machines on fishtest support x86-64-modern (which is right now equivalent to x86-64-sse3-popcnt) on the branch. The majority of the machines support avx2 or bmi2. We'll have to deal with that eventually, but right now it is not too much of a concern.

vondele on 25 Jul 2020

On compiling for macOS

On macOS 10.15.6 (Catalina) make build ARCH=x86-64-bmi2 COMP=clang has two issues:

The first one: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/malloc/_malloc.h:50:10: note: 'aligned_alloc' has been marked as being introduced in macOS 10.15 here, but the deployment target is macOS 10.9.0

So we need to change -mmacosx-version-min=10.9 in the Makefile to -mmacosx-version-min=10.15 but according to statcounter only ~52% of Mac users are on 10.15.

If we _do_ change the min version, second issue:
nnue/evaluate_nnue.cpp:60:40: error: no member named 'aligned_alloc' in namespace 'std'; did you mean simply 'aligned_alloc'?
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/malloc/_malloc.h:50:10: note: 'aligned_alloc' declared here

It seems like the C++ library that ships with Xcode dev tools has aligned_alloc in the global namespace. Could maybe get around this with a #if __APPLE__.

daylen on 25 Jul 2020

Fix for memset/memcpy warnings and valgrind issue: PR #2830.

I can't see unfreed memory issue. Could you check again after applying these fixes?

dorzechowski on 26 Jul 2020

👍1

error on build

$ make  build ARCH=x86-64 COMP=mingw

Config:
debug: 'no'
sanitize: 'no'
optimize: 'yes'
arch: 'x86_64'
bits: '64'
kernel: 'MINGW64_NT-10.0-18363'
os: 'Windows_NT'
prefetch: 'yes'
popcnt: 'no'
sse: 'yes'
sse3: 'no'
ssse3: 'no'
sse41: 'no'
sse42: 'no'
avx2: 'no'
pext: 'no'
avx512: 'no'

Flags:
CXX: g++
CXXFLAGS: -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2
LDFLAGS:  -static

Testing config sanity. If this fails, try 'make help' ...

C:/Program Files/Git/mingw64/bin/make.exe ARCH=x86-64 COMP=mingw all
make[1]: Entering directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o benchmark.o benchmark.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o bitbase.o bitbase.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o bitboard.o bitboard.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o endgame.o endgame.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o evaluate.o evaluate.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o main.o main.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o material.o material.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o misc.o misc.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o movegen.o movegen.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o movepick.o movepick.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o pawns.o pawns.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o position.o position.cpp
position.cpp: In member function 'Position& Position::set(const string&, bool, bool, StateInfo*, Thread*)':
position.cpp:199:39: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct StateInfo'; use assignment or value-initialization instead [-Wclass-memaccess]
  199 |   std::memset(si, 0, sizeof(StateInfo));
      |                                       ^
In file included from position.cpp:31:
position.h:40:8: note: 'struct StateInfo' declared here
   40 | struct StateInfo {
      |        ^~~~~~~~~
position.cpp: In member function 'void Position::do_move(Move, StateInfo&, bool)':
position.cpp:714:51: warning: 'void* memcpy(void*, const void*, size_t)' writing to an object of a non-trivial type 'struct StateInfo' leaves 1272 bytes unchanged [-Wclass-memaccess]
  714 |   std::memcpy(&newSt, st, offsetof(StateInfo, key));
      |                                                   ^
In file included from position.cpp:31:
position.h:40:8: note: 'struct StateInfo' declared here
   40 | struct StateInfo {
      |        ^~~~~~~~~
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o psqt.o psqt.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o search.o search.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o thread.o thread.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o timeman.o timeman.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o tt.o tt.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o uci.o uci.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o ucioption.o ucioption.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o tune.o tune.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o syzygy/tbprobe.o syzygy/tbprobe.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o nnue/evaluate_nnue.o nnue/evaluate_nnue.cpp
nnue/evaluate_nnue.cpp: In function 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&)':
nnue/evaluate_nnue.cpp:60:45: error: 'aligned_alloc' is not a member of 'std'; did you mean 'aligned_union'?
   60 |     pointer.reset(reinterpret_cast<T*>(std::aligned_alloc(alignof(T), sizeof(T))));
      |                                             ^~~~~~~~~~~~~
      |                                             aligned_union
make[1]: *** [<builtin>: nnue/evaluate_nnue.o] Error 1
make[1]: Leaving directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
make: *** [Makefile:595: build] Error 2

MichaelB7@VM-894787 MINGW64 ~/home/Github/Stockfish/src (nnue-player-wip)

MichaelB7 on 26 Jul 2020

@MichaelB7 Please provide more details, such as what version of g++ you have and your OS/platform. Also, you could try https://github.com/official-stockfish/Stockfish/pull/2831 and see if that fixes your issue—I think the PR makes the aligned_alloc issue more compatible in general for platforms other than macOS.

daylen on 26 Jul 2020

$ stockfish
Stockfish 250720 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
compiler

Compiled by g++ (GNUC) 10.1.0 on MinGW64
__VERSION__ macro expands to: 10.1.0

Windows 10 Pro
Version 10.0.18363 Build 18363
will try your patch

MichaelB7 on 26 Jul 2020

Update: The patch did not resolve the issue:

$ make  build ARCH=x86-64 COMP=mingw

Config:
debug: 'no'
sanitize: 'no'
optimize: 'yes'
arch: 'x86_64'
bits: '64'
kernel: 'MINGW64_NT-10.0-18363'
os: 'Windows_NT'
prefetch: 'yes'
popcnt: 'no'
sse: 'yes'
sse3: 'no'
ssse3: 'no'
sse41: 'no'
sse42: 'no'
avx2: 'no'
pext: 'no'
avx512: 'no'

Flags:
CXX: g++
CXXFLAGS: -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2
LDFLAGS:  -static

Testing config sanity. If this fails, try 'make help' ...

C:/Program Files/Git/mingw64/bin/make.exe ARCH=x86-64 COMP=mingw all
make[1]: Entering directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2   -c -o nnue/evaluate_nnue.o nnue/evaluate_nnue.cpp
nnue/evaluate_nnue.cpp: In function 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&)':
nnue/evaluate_nnue.cpp:60:40: error: there are no arguments to 'aligned_malloc' that depend on a template parameter, so a declaration of 'aligned_malloc' must be available [-fpermissive]
   60 |     pointer.reset(reinterpret_cast<T*>(aligned_malloc(alignof(T), sizeof(T))));
      |                                        ^~~~~~~~~~~~~~
nnue/evaluate_nnue.cpp:60:40: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
nnue/evaluate_nnue.cpp: In instantiation of 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&) [with T = Eval::NNUE::FeatureTransformer; Eval::NNUE::AlignedPtr<T> = std::unique_ptr<Eval::NNUE::FeatureTransformer, Eval::NNUE::AlignedDeleter<Eval::NNUE::FeatureTransformer> >]':
nnue/evaluate_nnue.cpp:79:43:   required from here
nnue/evaluate_nnue.cpp:60:54: error: 'aligned_malloc' was not declared in this scope; did you mean '_aligned_malloc'?
   60 |     pointer.reset(reinterpret_cast<T*>(aligned_malloc(alignof(T), sizeof(T))));
      |                                        ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
      |                                        _aligned_malloc
nnue/evaluate_nnue.cpp: In instantiation of 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&) [with T = Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::InputSlice<512>, 32> >, 32> >, 1>; Eval::NNUE::AlignedPtr<T> = std::unique_ptr<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::InputSlice<512>, 32> >, 32> >, 1>, Eval::NNUE::AlignedDeleter<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::InputSlice<512>, 32> >, 32> >, 1> > >]':
nnue/evaluate_nnue.cpp:80:31:   required from here
nnue/evaluate_nnue.cpp:60:54: error: 'aligned_malloc' was not declared in this scope; did you mean '_aligned_malloc'?
   60 |     pointer.reset(reinterpret_cast<T*>(aligned_malloc(alignof(T), sizeof(T))));
      |                                        ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
      |                                        _aligned_malloc
make[1]: *** [<builtin>: nnue/evaluate_nnue.o] Error 1
make[1]: Leaving directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
make: *** [Makefile:595: build] Error 2

But if I make the suggested change from the error message above , it worked.

replacing this on line 60:

  pointer.reset(reinterpret_cast<T*>(aligned_alloc(alignof(T), sizeof(T))));

with this:

  pointer.reset(reinterpret_cast<T*>(_aligned_malloc(alignof(T), sizeof(T))));

MichaelB7 on 26 Jul 2020

@dorzechowski one more valgrind warning triggers when loading TB files:

setoption name SyzygyPath value ../tests/syzygy/
==122921== Conditional jump or move depends on uninitialised value(s)
==122921==    at 0x13DF43: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, StateInfo*, Thread*) (position.cpp:222)
==122921==    by 0x13EFDE: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Color, StateInfo*) (position.cpp:407)
==122921==    by 0x16D412: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>::TBTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:366)
==122921==    by 0x173AB9: void __gnu_cxx::new_allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>((anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (new_allocator.h:147)
==122921==    by 0x171CFC: void std::allocator_traits<std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >&, (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (alloc_traits.h:484)
==122921==    by 0x171D97: void std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::_M_push_back_aux<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:496)
==122921==    by 0x170C2F: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>& std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::emplace_back<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:174)
==122921==    by 0x16DAFB: (anonymous namespace)::TBTables::add(std::vector<PieceType, std::allocator<PieceType> > const&) (tbprobe.cpp:489)
==122921==    by 0x16F2A6: Tablebases::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:1367)
==122921==    by 0x167A8C: UCI::on_tb_path(UCI::Option const&) (ucioption.cpp:44)
==122921==    by 0x1699F0: UCI::Option::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ucioption.cpp:202)
==122921==    by 0x163837: (anonymous namespace)::setoption(std::__cxx11::basic_istringstream<char, std::char_traits<char>, std::allocator<char> >&) (uci.cpp:100)
==122921== 
==122921== Conditional jump or move depends on uninitialised value(s)
==122921==    at 0x13DF43: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, StateInfo*, Thread*) (position.cpp:222)
==122921==    by 0x13EFDE: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Color, StateInfo*) (position.cpp:407)
==122921==    by 0x16D62F: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>::TBTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:385)
==122921==    by 0x173AB9: void __gnu_cxx::new_allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>((anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (new_allocator.h:147)
==122921==    by 0x171CFC: void std::allocator_traits<std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >&, (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (alloc_traits.h:484)
==122921==    by 0x171D97: void std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::_M_push_back_aux<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:496)
==122921==    by 0x170C2F: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>& std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::emplace_back<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:174)
==122921==    by 0x16DAFB: (anonymous namespace)::TBTables::add(std::vector<PieceType, std::allocator<PieceType> > const&) (tbprobe.cpp:489)
==122921==    by 0x16F2A6: Tablebases::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:1367)
==122921==    by 0x167A8C: UCI::on_tb_path(UCI::Option const&) (ucioption.cpp:44)
==122921==    by 0x1699F0: UCI::Option::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ucioption.cpp:202)
==122921==    by 0x163837: (anonymous namespace)::setoption(std::__cxx11::basic_istringstream<char, std::char_traits<char>, std::allocator<char> >&) (uci.cpp:100)
==122921== 
info string Found 35 tablebases

Edit: this should be fixed by:

diff --git a/src/position.cpp b/src/position.cpp
index 52ba710ae..8f2c4b029 100644
--- a/src/position.cpp
+++ b/src/position.cpp
@@ -404,7 +404,7 @@ Position& Position::set(const string& code, Color c, StateInfo* si) {
   string fenStr = "8/" + sides[0] + char(8 - sides[0].length() + '0') + "/8/8/8/8/"
                        + sides[1] + char(8 - sides[1].length() + '0') + "/8 w - - 0 10";

-  return set(fenStr, false, use_nnue(), si, nullptr);
+  return set(fenStr, false, false, si, nullptr);
 }

This set() method is basically just there to create a materials key, and starts from a freshly created Position() object.

vondele on 26 Jul 2020

@MichaelB7 Thank you. I can now at least compile. However, I believe the parameters for aligned_alloc and _aligned_malloc should be reversed. Also we should probably match it with _aligned_free(ptr) instead of std::free(ptr).

mstembera on 26 Jul 2020

first green CI badge on Linux/gcc
https://travis-ci.org/github/official-stockfish/Stockfish/jobs/711914367

vondele on 26 Jul 2020

🚀4

@vondele Another thing to ascertain before we can use NNUE in SF is a license to use it. As nodchip's implementation doesn't have any license attached, I'm not sure we can even use it legally now. We assume so but it's not enough I think. This should be solved together with @nodchip and those files should have either GPLv3 or compatible license attached.

dorzechowski on 26 Jul 2020

@dorzechowski would be great @nodchip confirms this, but as the code (binaries and sources) are being distributed already, the license is GPL, this is also what the repository Copying file in the repository mentions.

vondele on 26 Jul 2020

Meanwhile also green on Linux/clang : https://travis-ci.org/github/official-stockfish/Stockfish/jobs/711961757

vondele on 26 Jul 2020

@dorzechowski would be great @nodchip confirms this, but as the code (binaries and sources) are being distributed already, the license is GPL, this is also what the repository Copying file in the repository mentions.

That's great. All this licensing can be very confusing.

dorzechowski on 26 Jul 2020

I can confirm the x86-64-avx512 gives the same 3377227 bench as above.

mstembera on 27 Jul 2020

👍1

@vondele I believe task "remove non-player related code" is done already. I see nothing that could be removed without breaking it. Of course possible optimizations is another thing.

dorzechowski on 27 Jul 2020

@dorzechowski OK, I'll tick that box, assuming you had a careful look.

vondele on 27 Jul 2020

I have also verified bench on linux-rhel7-ppc64le using gcc version 9.1.0 (GCC)

There are warnings however:

nnue/evaluate_nnue.cpp: In function ‘Value Eval::NNUE::ComputeScore(const Position&, bool)’:
nnue/evaluate_nnue.cpp:124:61: warning: requested alignment 64 is larger than 16 [-Wattributes]
  124 |         transformed_features[FeatureTransformer::kBufferSize];
      |                                                             ^
nnue/evaluate_nnue.cpp:126:61: warning: requested alignment 64 is larger than 16 [-Wattributes]
  126 |     alignas(kCacheLineSize) char buffer[Network::kBufferSize];
      |                                                             ^

seemingly that's not a bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68267) and gcc provides a __BIGGEST_ALIGNMENT__ macro which is arch specific (https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html). It is however different from 64 for most arch I tried (except knl).

Maybe there is an easy solution?

vondele on 27 Jul 2020

@vondele Another thing to ascertain before we can use NNUE in SF is a license to use it. As nodchip's implementation doesn't have any license attached, I'm not sure we can even use it legally now. We assume so but it's not enough I think. This should be solved together with @nodchip and those files should have either GPLv3 or compatible license attached.

AFAIK YaneuraOu is the original author and wrote about it with his full support:

http://yaneuraou.yaneu.com/2020/06/21/stockfish-nnue-subjective-review/

https://twitter.com/nodchip/status/1286947373374087168/retweets/with_comments

gekkehenker on 28 Jul 2020

YaneuraOu are distributed under GPLv3.
https://github.com/yaneurao/YaneuraOu
I also intended that the ported code and code that I wrote are distributed under GPLv3. If there are additional works to express that those code are distributed under GPLv3, please let me know.

nodchip on 28 Jul 2020

Do we know why the bench of the @nodchip branch with the default net is 3909820 but our nnue-player-wip is 3377227?
The reason I ask is because I played a local STC match between SF and SFNNUE and after 100 games SF won them all. Obviously something seems broken. @dorzechowski Did you do any play testing w/ your branch and what were the results?

mstembera on 28 Jul 2020

@nodchip thanks for confirming the license. There is nothing particular you need to do, I might add a copyright header to the added files, that makes things clear.

vondele on 28 Jul 2020

@mstembera I think I did play with the branch, and had good results. Obviously that's something to clarify. I'm not sure both branches have the same version of master merged.

vondele on 28 Jul 2020

@mstembera quick check of the current status of master vs branch looks good for me:

Score of master vs nnue: 17 - 49 - 84  [0.393] 150
Elo difference: -75.3 +/- 36.7, LOS: 0.0 %, DrawRatio: 56.0 %

vondele on 28 Jul 2020

@vondele Thanks. I tried testing using the chessbase gui and all is fine there. Seems I only have the issue when using cutechess-cli. If I figure out what I'm doing wrong I'll report.

mstembera on 28 Jul 2020

I have used:

./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16  -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn-c157e0a5755b.bin option."Use NNUE"=true -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn

If I have to guess, most likely missing the full path to EvalFile ?

Meanwhile results with a few more games:

Score of master vs nnue: 498 - 1019 - 1879  [0.423] 3396
Elo difference: -53.7 +/- 7.8, LOS: 0.0 %, DrawRatio: 55.3 %

vondele on 28 Jul 2020

Those are encouraging results.
I wasn't using the EvalFile path because I have the net in the same directory as the .exe Adding it anyway seems to make no difference. When I test nnue vs nnue things seem normal except the draw ration is only about 10%. Most wins are by adjudication as normal.

mstembera on 28 Jul 2020

@mstembera I think cutechess has the current working directory independent of where the engine binary is, and as a result the binary doesn't find the EvalFile, unless given explicitly with the full path. Adding -debug should show that in the output. After printing the error message the code continues, with a zero-ed net.

vondele on 28 Jul 2020

@vondele Yes that was it. Thanks! I would have never guessed cutechess was overriding the current directory. I wonder if we should just fail when "Use NNUE"=true but the network fails to load?

mstembera on 28 Jul 2020

you are touching on a difficult point, and that is how to deal with the net in general.

For example, do we expect people who build the code to download the default net to the src directory ? This would make it relatively easy to enable profile-build. However, to figure out the name of the default net, one should first build the binary, execute uci and look at EvalFile. So, require a build, afterward a profile-build.

Similar question for the bench command. One could test both NNUE and handcrafted eval in a single bench, if we assume the developer has the default net in the src directory. Similar question of course if 'Use NNUE' would default to true.

Probably that's something we could expect for developers, but I'm sure we will all be bitten by this requirement at some point. Other opinions?

vondele on 28 Jul 2020

Yes this isn't straight forward. I think it's important than when inevitably something goes wrong it doesn't happen silently. This makes it much harder for a person to track down. The current net is about 21MB and compresses to about 8MB. Is it out of the question to simply add it to the source directory along with the source files?

mstembera on 28 Jul 2020

👍1

@vondele Here is a commit by @nodchip to solve the issue of allowing the net path to be relative to the binary.
https://github.com/nodchip/Stockfish/commit/fbdb373b6482db2b462b30d7399c3c7fed1d4f26

mstembera on 28 Jul 2020

will profile build without net in directory give same optimizer compiler flags as with net in directory? does it matter if use nnue is false default?
agree with @mstembera - good to include default net in source directory and have net in same directory as exe by default

stockchess on 28 Jul 2020

@mstembera the problem with adding to the source directory is that the git repo will become very large to clone 20Mb with each new net. We might also run into github bandwidth limits if there are too many downloads. However, I don't think we should add the code you referenced for 'finding' the binary. This is OS/GUI specific, and right now, the problem is not very different from specifying the syzygy directory. Also in that case a relative path doesn't work.

@stockchess there is some impact of having NNUE false by default for the profile-build ... if the code is not executed the profiling will not gather statistics for it, and thus optimize less. The difference is likely not spectacular, but will be visible.

vondele on 28 Jul 2020

👍2

Meanwhile two tests with using nnue=false show some overhead, wrt to master.

https://tests.stockfishchess.org/tests/view/5f1fddb7c09435d870cb9f15
https://tests.stockfishchess.org/tests/view/5f1ff3bdc09435d870cb9f25

it is just below the self-put target of 5 Elo, but would be nice to improve upon this.

vondele on 28 Jul 2020

@vondele Can't the old net be purged from the repo each time we add a new one so that it never has more than one and doesn't grow?

mstembera on 28 Jul 2020

technically, yes, but that's kind of opposite of what a version control system like git wants, and certainly not good style
https://docs.github.com/en/github/managing-large-files/removing-files-from-a-repositorys-history

vondele on 28 Jul 2020

@mstembera Yes, I played a lot of test matches in cutechess-cli, LittleBlitzer and Arena and the results were as good as reported by other people. I don't have any Chessbase GUI to test. I also carefully checked that bench didn't change after I removed/refactored some code. I also updated it to latest master. It's still possible that I've missed something but I'm pretty confident I haven't.

dorzechowski on 28 Jul 2020

yes, I checked that the nodchip branch and our branch are just diverged a little due to patches on master being present on our branch. The difference in bench is thus likely not important.

vondele on 28 Jul 2020

@vondele Maybe we could create a very small net that may be not any good for playing but would verify the nnue bench signature? For example, if in first layer we would use only 8x2 parameters instead of current 256x2, it would shrink the size to below 700kB if I calculated correctly. For sure it could be even smaller. Then we could just add it to the sources and always use it for bench.
Networks will be distinguished by sha and we would always have an info which one is the latest so also running bench on it would be redundant.

dorzechowski on 28 Jul 2020

for profile-build it would be nice to have the same network structure as the actual run. For bench/pgo One could even use a zero or better pseudo-random network.... I wonder if somebody can derive an algorithm to prefill a network such that it gives material count only?
Edit: shouldn't it be quite doable to get material + psqt in the first layer of the net?

vondele on 28 Jul 2020

I'll merge the proposed change of extension of network data https://github.com/official-stockfish/Stockfish/issues/2849 (any extension will still work, but it is a meaningful change).

vondele on 28 Jul 2020

OK, but I thought we just want to avoid having big net in the sources. Another idea: add a small tool that will produce a reference network on user side filled perhaps with random data (but obviously always the same). Then compile and run it before compiling Stockfish and running bench. Then downloading an official net would be done after that passes.

dorzechowski on 28 Jul 2020

@dorzechowski yes, this tool could be inside the binary. See comment above, I think it should even be possible to have better than random, e.g. material (+psqt?). It would be actually quite useful to have. @nodchip do you have an idea if it would be possible to initialize a net such that it gives material difference as an output ?

vondele on 28 Jul 2020

That would also require re-adding some code used for net creation that is not there now.

dorzechowski on 28 Jul 2020

well some code to initialize the net data-structure (instead of reading the file) not creating/optimizing nets/files etc. I suspect one can write the network coefficients just analytically for material difference only, but I don't know if this is true, or how difficult that would be.

vondele on 28 Jul 2020

I may take a look at it later. Of course we don't need whole net generation machinery but I think at least a function to write header and parameters needs to be re-included (now we have only read functions there). I also think that material only net is not very good for bench as it's not sensitive enough imo. Such eval relatively rarely changes and many changes in the code could produce the same bench. Pseudo-random net would probably be better and much easier to produce taking also into account possible different net architectures down the way..

dorzechowski on 28 Jul 2020

@dorzechowski actually material only eval is already surprisingly strong (>2400 Elo, https://github.com/vondele/Stockfish/tree/simpleEval, https://lichess.org/@/simpleEval), and would almost certainly be good enough for a bench. In fact, the main advantage over random is that it would exercise search in a more or less realistic way (good for PGO). Of course, one could super-impose a small random component (as the simpleEval is doing actually), or if feasible PSQT. Anyway, I think it would be very valuable to have a synthetic internal net.

vondele on 28 Jul 2020

Yes, I'm aware of simpleEval bot. Material + random component would certainly be good but I think it's an overkill for the following reason: if (or more likely when) we change net architecture or size, the whole concept of how to make the net a material counter would need to be revisited for every such change. This is not a problem for random nets as we wouldn't care about inputs interpretation and we would generate a 'bench' net always in the same way. Or maybe you think that this synthetic net should be fixed once for all? I might be confused there.

dorzechowski on 28 Jul 2020

I agree that having this 'material net' adapt to architecture and size will be a lot (and possibly too much) developer overhead. In that case random might be good enough. I'm just slightly concerned about random being good enough to do PGO. Actually, one could presumably just add the random net output (hoping it is small) to e.g. pos.psq_score() (in Eval::evaluate()) and have something good enough for bench and PGO ? What do you think?

vondele on 28 Jul 2020

I didn’t find any mention nor code about training data for NNUE! Did I miss? Will we have that part? If yes, when? What the plan?

nguyenpham on 28 Jul 2020

The plan is to have the player, the training code will remain https://github.com/nodchip/Stockfish for the time being.

vondele on 28 Jul 2020

I think that it is hard to generate network parameters which outputs material differences. Because the activation function is Clipped ReLU. It will clip big values to 127. I don't come up with any ideas to output material differences with halfkp_256x2-32-32.

nodchip on 28 Jul 2020

uh... It might be possible... We could count the number of the friend's pieces for each piece type with the network parameters between the input layer and the first hidden layer, propagate the numbers to the last hidden layer, and multiply the material value with the network parameter between the last hidden layer and the output layer. Though implementation would be hard...

nodchip on 28 Jul 2020

Alternatively, (if material values are 15 bits) sum 5 most significant bits in one "neuron", 5 next in another and 5 LSB in a third, propagate it to the final layer and sum them there.

sf-x on 28 Jul 2020

IMO, having training and playing code in different repos seems be a trouble in long term. It’s harder to maintain and optimise. Sometimes there will be misunderstandings, delays between teams, some works may become double jobs

nguyenpham on 28 Jul 2020

Actually, one could presumably just add the random net output (hoping it is small) to e.g. pos.psq_score() (in Eval::evaluate()) and have something good enough for bench and PGO ? What do you think?

But psq is not used in bench with Use NNUE=true. We will have different execution paths hence different pgo builds depending of default Use NNUE value. Or you want to always build it with Use NNUE=false only?

dorzechowski on 28 Jul 2020

Actually, one could presumably just add the random net output (hoping it is small) to e.g. pos.psq_score() (in Eval::evaluate()) and have something good enough for bench and PGO ? What do you think?

But psq is not used in bench with Use NNUE=true. We will have different execution paths hence different pgo builds depending of default Use NNUE value. Or you want to always build it with Use NNUE=false only?

the bench could switch (automatically) between Use NNUE=trueand Use NNUE=false if we have the internal ('random') net. PGO would thus always result in the same binary. As far as I know, pos.score_psq() is always available also if NNUE (it is part of put_piece and updated via do_move).

vondele on 28 Jul 2020

But profile-build exercise should resemble normal use as much as possible. Adding stuff to NNUEval only during profiling would be Wrong™.

sf-x on 28 Jul 2020

Alternatively, (if material values are 15 bits) sum 5 most significant bits in one "neuron", 5 next in another and 5 LSB in a third, propagate it to the final layer and sum them there.

material values are small (currently at most 2682).

vondele on 28 Jul 2020

Alternatively, (if material values are 15 bits) sum 5 most significant bits in one "neuron", 5 next in another and 5 LSB in a third, propagate it to the final layer and sum them there.

material values are small (currently at most 2682).

OK, 4 bits per "neuron". It would help to avoid overflow, too.

sf-x on 28 Jul 2020

But profile-build exercise should resemble normal use as much as possible. Adding stuff to NNUEval only during profiling would be Wrong™.

that's actually why adding something would be reasonable, as a surrogate for having a more accurate net. It would add 1 branch difference to the full code.

vondele on 28 Jul 2020

Overall, I simply don't understand the point of a complicated surrogate. Why does one need fast NNUE in PGO build without a good set of eval parameters? And if one is at hand, why not simply use it for PGO build?

sf-x on 28 Jul 2020

@sf-x because we want to simplify the build process, and have it reproducible among users and covering as much as possible the code base (for e.g. bench output). Ideally one can build from sources without any additional download.

vondele on 28 Jul 2020

Why would one want to build and not use?

sf-x on 28 Jul 2020

cause the code will have both handcrafted and NNUE eval at the same time. Maybe one just wants to use the handcrafted part...

vondele on 28 Jul 2020

Change profile command to stockfish <input.txt >/dev/null
where input.txt is something like

setoption name Use NNUE value true
bench

If the user doesn't care for NNUE, let him build it with input.txt containing just bench, or all-zero eval params...

sf-x on 28 Jul 2020

In Eval::evaluate() we could have a third option that returns

  return (NNUE::evaluate(pos) + Evaluation<NO_TRACE>(pos).value()) / 2;

This would solve a couple of issues.
1) We could produce a single bench that depends on both the hand crafted eval and NNUE.
2) We could PGO optimize both code paths in a single build.

This could be used only internally for bench or we could expose all 3 eval options to UCI as something like
Classic, NNUE, Blend.

mstembera on 29 Jul 2020

Any opinions on how we should refer to the existing eval 'handcrafted' or 'classic' or '...' ?

vondele on 29 Jul 2020

'legacy'?

nodchip on 29 Jul 2020

I would go for a term with positive connotation ... the condensed knowledge in the eval is impressive and succinctly expressed.

vondele on 29 Jul 2020

@nodchip more technical question, have people tried 'delta learning' in this context, i.e. learn the difference between 'classical eval' and a search based eval ? I know some other fields where this is successful, but don't know for shogi or chess.

vondele on 29 Jul 2020

Any opinions on how we should refer to the existing eval 'handcrafted' or 'classic' or '...' ?

I called it 'standard' because this type of eval is de facto standard used for decades. I don't particularly like 'classic' or 'legacy' as they hint that it's something old, that maybe was cool one day but not anymore. If NNUE type of eval prevails and becomes new standard, we can refer to old one as 'classic' but not before that imo.

dorzechowski on 29 Jul 2020

If 'delta learning' means 'Temporal difference learning', it is half true. The trainer uses a TDLeaf(λ)-ish method called 'elmo method' to adjust network parameters. In detail, the trainer adjusts network parameters to approximate a search based eval and the game result by a quiescence search base eval.

The detail is written in this slide.
https://www.slideshare.net/HisayoriNoda/in2017-nodchip-2017-1223-nodchip

In formal, the problem is modeled as:

argmin_{ω} L(ω)

where ω and L are the network parameters and an object function, respectively. For simplicity, we assume the evaluation function is a liner function, and ω is the coefficient vector. L(ω) can be expressed as:

L(ω) = Σ_{i} l(ω, φ_{i}, ω', φ'_{i}, w_{i})

where l, i, φ_{i}, ω', φ'_{i} and w_{i} are a loss function, the index of a position in the training data, the feature vector of the leaf node of the quiescence search in the i-th position, the coefficient vector used in the training data generation, the feature vector of the leaf node of the search in the i-th position, the result of the game containing the i-th position.

The trainer uses the liner sum of two cross entropies as the loss function. The first one is cross entropy between the winning probability distribution calculated from a quiescence search base eval and the winning probability distribution calculated from a search base eval. The second one is cross entropy between the winning probability distribution calculated from a quiescence search base eval and the winning probability distribution calculated from a game result (which must be 0.0 or 1.0).

We define p and q using the approximate relationship between the winning probability and the pawn advantage (https://www.chessprogramming.org/Pawn_Advantage,_Win_Percentage,_and_Elo):

p = σ(α w'^{t} φ')
q = σ(α w^{t} φ)

Where α is the coefficient in the wiki page above. We also define t:

t = 1.0 (if the player who moved in the i-th position won the game.)
0.0 (if the player who moved in the i-th position lost the game.)

l can be expressed as:

l = λ H(p, q) + (1.0 - λ) H(t, q)
= λ E_{p}[-log q] + (1.0 - λ) E_{t}[-log q]
= -λ Σ_{x} p(x) log q(x) - (1.0 - λ) Σ_{x} t(x) log q(x)

x takes the two values, win or lose:

= - λ (p log q + (1 - p) log (1 - q)) - (1.0 - λ) (t log q - (1 - t) log (1 - q))

The gradient of l is:
l' = ∂ l / ∂ ω_{i} = - λ (q - p) - (1.0 - λ) (q - t)

The trainer uses this gradient in back propagation to adjust network parameters.

nodchip on 29 Jul 2020

👍3

I would just refer to the old eval as non-NN eval or non-neural evaluation. Then just refer to NNUE’s evaluation as either neural eval, NNUE eval, or NN eval.

neuralEvaluation and nonNeuralEvaluation are both clear and descriptive names. “Use neural evaluation” is a straightforward UCI option.

gonzalezjo on 29 Jul 2020

I think legacy implies something is on the way out more so than classic does. Standard or original are good names too.

@nodchip I wonder about a network trained on top of our normal eval (that is the nn eval added to sf normal eval both during training and play) It would presumably have less information to store and could focus more on nuances rather than basics like material values etc. I'm sure we'll try eventually.

mstembera on 29 Jul 2020

@vondele I'm sorry but I misunderstood your question. The short answer should be yes. When a net author creates the initial net, he or she creates training data by self-play using the classical eval. Each training datum contains a position, the search based eval value, and so on. The trainer uses those information to adjust network parameters.
Could you please correct me if I still don't understand your question?

nodchip on 30 Jul 2020

@mstembera It sounds interesting. We could try it.

nodchip on 30 Jul 2020

@nodchip thanks for the answer on learning (and sorry for the late reply, I was away for two days). What I meant was also what @mstembera suggested, i.e. train a network that is additive to the handcrafted/classic eval, for the same reasons. This would become something quite easy to test after the merge.

vondele on 31 Jul 2020

I don't see how additive eval could work. With additive effect it becomes "try to guess how bad is stockfish for exactly this position" instead of "try to maximize chance of winning this game". Is there any reasoning behind additive effects why would it be better? Additive effect would also make network file totally useless if an eval hole is fixed in next version of stockfish because the thing that it learned (how bad stockfish is for specific position) would change but weights stay same

liamlim on 31 Jul 2020

@vondele I realized that I still misunderstood your question. I understand that your question is if we can create a net which outputs the difference between a classic eval value and a search based eval value. It is yes, technically.
https://github.com/nodchip/Stockfish/blob/master/src/learn/learner.cpp#L719
If we change the code above to
psv.score = evaluate_leaf(pos, pv1) - Eval::evaluate(pos);
The differences between the classic eval values and the search based eval values are recorded to training data. The trainer will train a net which outputs the differences. When we use the net, we need to change the eval to sum up both the values of the classic eval and the net output. (Though I have not tested...)

My concern is that we could need to implement three evals, "classic eval", "NNUE", and "classic eval + NNUE". It might make users confused.

nodchip on 31 Jul 2020

thanks for the info! Obviously as a first step we would need to see if it works at all, it could be it is just always weaker, or just much stronger. It was just an idea for experimenting. One thing for example that I think the current net can't capture is the castling rights, that's something that is used in eval, but not in NNUE, is that correct?

vondele on 31 Jul 2020

Castling rights are not used in halfkp_256x2-32-32 network architecture which is a standard architecture of NNUE. But I already created the input feature implementation of castling rights (and enpassant), and a network architecture which uses it.
https://github.com/nodchip/Stockfish/blob/master/src/eval/nnue/features/castling_right.cpp
https://github.com/nodchip/Stockfish/blob/master/src/eval/nnue/architectures/halfkp-cr-ep_256x2-32-32.h
We could try it and compare the elo between halfkp_256x2-32-32 and halfkp-cr-ep_256x2-32-32.

nodchip on 31 Jul 2020

Nice... after the merge we should try these things!

vondele on 31 Jul 2020

Tttak has also created a lot of possible features for NNUE:

One of them has mobility encoded into it, other has the game divided into multiple stages (every 40 ply new stage), another switches based on pieces left, etc.

Halfkpe4 (The E stands for Effect.)
It is a subdivision of HalfKP, taking into account the number of our and our opponent's effects to the square with each piece.
halfkpe4 is 4 times as big, halfkpe9 would be 9 times as big.

KK is King-King, a combination of both white and black King positions.
HalfKP-KK is the standard HalfKP plus KK as input features.

They can also all be combined so you can get some sort of chimera: HalfKP-CastlingRight-EnPassant-Pawn-KK...

Basically an infinite amount of possibilities.

https://github.com/tttak/Stockfish/tree/features_20200712/src/eval/nnue/features
https://github.com/tttak/Stockfish/tree/features_20200712/src/eval/nnue/architectures

gekkehenker on 31 Jul 2020

armv8 (aws g6c.large) is dead slow with NNUE while as fast as x86_64 with SF

Total time (ms) : 26378
Nodes searched : 4366686
Nodes/second : 165542

Built from nodchip master branch with:
make profile-nnue ARCH=armv8 COMP=clang COMPCXX=clang++ CXXFLAGS='-DIS_ARM'

dmytroleonenko on 31 Jul 2020

I guess that will require somebody to add neon intrinsic code to the branch (after the merge). Thanks for testing, this means the branch compiles on armv8. However, the bench is different from what I would expect. Did you have the net nn-c157e0a5755b.nnue loaded? Also, looks like we need to add the -DIS_ARM somewhere in the makefile.

Edit: would be nice if you could test our nnue-player-wip branch :-)

vondele on 31 Jul 2020

I can test your branch sure. Net used was sergio-v 20200728-2138. But it is orders of magnitude slower than x86_64 while on the later SF and NNUE are much closer NNUE being maybe two times slower. So I don't see a lot of sense benchmarking it now until the NNUE code is optimized for armv8. (BTW c6g.16xlarge gives me up to 100M n/s from SF master)

dmytroleonenko on 31 Jul 2020

benchmarking for speed is not needed, but reporting the bench for our current testing net (nn-c157e0a5755b.nnue see https://github.com/official-stockfish/networks) would confirm the code is working on armv8. It should be 3377227 with the network loaded and Use NNUE true. (see https://github.com/official-stockfish/Stockfish/issues/2823#issuecomment-663863116)

vondele on 31 Jul 2020

make profile-nnue ARCH=armv8 COMP=clang COMPCXX=clang++ CXXFLAGS='-DIS_ARM'

Of course it's slow, you disabled all optimizations!

sf-x on 31 Jul 2020

Ah, well spotted. Yes, try:
CXXFLAGS='-DIS_ARM' make profile-nnue ARCH=armv8 COMP=clang COMPCXX=clang++
to just add the additional flag

vondele on 31 Jul 2020

The code in the nnue-player-branch is getting closer to completion. I'd like to ask some help with two remaining tick boxes in particular:

Check coding style in the core eninge (@snicolet or others). Basically, have a look at git diff master src/evaluate.cpp src/evaluate.h src/misc.cpp src/misc.h src/position.cpp src/position.h src/types.h src/uci.cpp src/uci.h src/ucioption.cpp (where master is still at d89730d5c8dcf10eb9e1d91a81f903d9fc3c949a)
Have a look at the code comments in the nnue/ subdirectory and fix small mistakes as appropriate (some leftover characters, or strange auto-translations)

vondele on 1 Aug 2020

👍1

This may be off-topic, but the README is suggesting that the network can be downloaded from tests.stockfish.org, which doesn't seem to be correct.

How about letting the Makefile check for the presence of a network file and downloading one if it is missing (with wget, if available on the system).

syzygy1 on 1 Aug 2020

@syzygy1 that's work in progress, should work once the production server has the patches merged. See also https://github.com/glinscott/fishtest/pull/736

vondele on 1 Aug 2020

Naming like Human mode and AI mode would more interesting.

lingamj on 1 Aug 2020

after commit Use a global instead of a variable in pos it looks something in eval doesn't work as expected?

e.g. this position isn't solved anymore:
r1b1qr1k/2p3pp/4p3/1pb1PpN1/pn3N1P/8/PPP1QPP1/2KR3R w - - 0 1 bm Rd8

stuwph on 1 Aug 2020

do you have the exact sequence of UCI commands that worked before the commit, but failed after the commit? In principle the commit was non-functional, except if the sequence triggered some bug in the old implementation.

vondele on 1 Aug 2020

for now I can only reproduce it in the Fritz17 GUI
will futher investigate

stuwph on 1 Aug 2020

* Have a look at the code comments in the `nnue/` subdirectory and fix small mistakes as appropriate 
(some leftover characters, or strange auto-translations)

@vondele See PR #2883.

dorzechowski on 1 Aug 2020

👍2

@stuwph Make sure you compare behaviour on 1 core. On many cores search is not deterministic and different results in different runs are normal.

dorzechowski on 1 Aug 2020

branch still looking good for me:

Score of master vs nnue: 348 - 743 - 1409  [0.421] 2500
Elo difference: -55.4 +/- 8.9, LOS: 0.0 %, DrawRatio: 56.4 %

vondele on 1 Aug 2020

after some more research i can verify this happens in fritz 17 gui when compiling (msys2 10.2.0) the nnue master with additional nnue preset:

diff --git a/src/ucioption.cpp b/src/ucioption.cpp

o["Use NNUE"] << Option(false, on_use_nnue);
o["EvalFile"] << Option("nn-c157e0a5755b.nnue", on_eval_file);

->

o["Use NNUE"] << Option(true, on_use_nnue);
o["EvalFile"] << Option("eval/nn.bin", on_eval_file);

since commit Use a global instead of a variable in pos

tested using this batch (but also with some other more optimized ones) :

@ECHO OFF
@cd src\
@del *.exe *.s *.o *.gcda *.gcno .depend
@set ts=%date:~-10,2%%date:~-7,2%%date:~-2,2%
make -f MakeFile profile-build ARCH=x86-64-bmi2 COMP=mingw
strip stockfish.exe
move Stockfish.exe ..\SF_NN_%ts%.exe
@del *.s *.o *.gcda *.gcno _stock+profiled.exe .depend
@PAUSE

stuwph on 1 Aug 2020

still after latest [NNUE] Remove not used network architecture

maybe an issue of the gui or my setup, wonder if someone else is having this?

stuwph on 1 Aug 2020

@stuwph Make sure you compare behaviour on 1 core. On many cores search is not deterministic and different results in different runs are normal.

happens on 1 core too

stuwph on 1 Aug 2020

@stuwph I think I have an idea what could it be. Could you go to UCI options in Fritz, unmark and then mark again "Use NNUE" option? Perhaps you use standard Stockfish eval there as you changed the sources and never actually set NNUE in GUI.

dorzechowski on 1 Aug 2020

already tried it, also deleting the net path and reinsert it

stuwph on 1 Aug 2020

both in single and together

stuwph on 1 Aug 2020

@stuwph I think I have an idea what could it be. Could you go to UCI options in Fritz, unmark and then mark again "Use NNUE" option? Perhaps you use standard Stockfish eval there as you changed the sources and never actually set NNUE in GUI.

pr #2890 does the job :)

stuwph on 2 Aug 2020

@stuwph did you have a modification to the sources that did set the default value of 'Use NNUE' to 'true' ?

vondele on 2 Aug 2020

do you mean if i have changed something else in the source?

for testing i used the plain NNUE branch with only changing

o["Use NNUE"] << Option(true, on_use_nnue);
o["EvalFile"] << Option("eval/nn.nnue", on_eval_file);

stuwph on 2 Aug 2020

which was equivalent to introducing a bug :-)

But of course, the underlying reason needs fixing.

vondele on 2 Aug 2020

so #2890 doesn't really fix that underlying reason?

stuwph on 2 Aug 2020

yes, or at least one possible approach, i.e. make sure that the flag used in the code to signal the Use of NNUE is equivalent to the default value of the option.

vondele on 2 Aug 2020

I have a question on how GUIs behave.. would it make sense to output a

info string NNUE evaluation using nn-[foo].nnue

prior to any search, or are there GUIs where this leads to a popup window or some annoying thing?

vondele on 2 Aug 2020

We already have info string about tablebases found and large pages and no complaints that it has broken anything so I believe it's safe to go ahead. I think it's nice to have printed confirmation in GUI that NNUE is being used.

dorzechowski on 2 Aug 2020

However, those two strings are output only once, the one I propose would be output for ever search.

I agree it would be nice to have a hint in the log which eval is used. This will be useful.

Let me add it to PR #2893, we can always drop it if not OK.

vondele on 2 Aug 2020

👍1

Does anybody know if the pext instruction on AMD bdver4 (gcc arch) is also slow (as it is on znver1 znver2 ), and thus one should avoid building the x86-64-bmi2 target on this architecture ?

vondele on 2 Aug 2020

I have never read about pext being slow on pre-Zen AMD cpus.

syzygy1 on 2 Aug 2020

👍1

I have never read about pext being slow on pre-Zen AMD cpus.

~Because they don't support it.~ Hm, Excavator might, but it's slow on Ryzen obviously due to a patent issue. Why would they add it then remove?

sf-x on 2 Aug 2020

I thought it was just a case of reducing transistor count by implementing pext in (extremely slow) microcode. AMD figured nobody was using pext/pdep anyway.

syzygy1 on 2 Aug 2020

👍1

The branch https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip looks increasingly close to being final. This is a good point to test it, and report issues if they are found. Anybody wanting to have a look at the issues https://github.com/official-stockfish/Stockfish/issues?q=is%3Aissue+is%3Aopen+label%3ANNUE is more than welcome, I believe the Makefile one is probably the most pressing one.

Meanwhile, fishtest development work is progressing well.

vondele on 3 Aug 2020

Note that the fishtest server has now functionality to upload .nnue nets. Authors of nets can upload them (needs login to fishtest) already now. Feel free to upload nets.

Testing nets is not yet possible, this needs fishtest worker changes that are work in progress.

vondele on 3 Aug 2020

👍2

first test of the upgraded fishtest with the @gekkehenker net running here:
https://tests.stockfishchess.org/tests/view/5f2878a0a5abc164f05e4c1d
the testing infrastructure seems to work.

Many thanks to @tomtor @ppigazzini @noobpwnftw @daylen for getting fishtest to the new state.

I'll continue some work on the branch before doing the real performance tests.

vondele on 3 Aug 2020

👍10 🎉3

Another problem:
If we do PGO compiles as it is now with fishtest, then the NNUE code path is not profiled because it does not have a net to load during bench, which results in less optimizations for the path when NNUE eval is active in the final compiles.

This probably affects every PGO compiles so far unless they have a default net to load and it is used during profiling stage. The performance difference can be measured and see if this matter is worth to be addressed.

noobpwnftw on 4 Aug 2020

👍1

@noobpwnftw I have this PR https://github.com/official-stockfish/Stockfish/pull/2879 to address profiling of both code paths in one PGO compile. However, you are right there has to be a net available to load.

mstembera on 4 Aug 2020

yes, PGO builds / default bench is an open issue, but I might as well postpone that one. It does require to have a net available at all times, and maybe it is better to wait a little till we get a feel for that requirement. However, it is quite easy on most systems with the new Makefile target make net. It would require somebody to come up with something similar in Microsoft PowerShell to do this for our appveyor CI (any experts?)

I haven't seen measurements yet of the performance benefit of PGO for NNUE.

vondele on 4 Aug 2020

Meanwhile merged master and updated to a new default net by Sergio Vieri, with impressive testing results:

https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58

vondele on 4 Aug 2020

👍6 👀1 🚀1

Stockfish 12 will surely become the Rybka 3 of Stockfish releases.

gekkehenker on 4 Aug 2020

@vondele: BanksiaGUi displays info string in a special panel, not a popup as the bellow image, thus users won't be interrupted. If they don't open the panel, they don't see that info either.

Screen Shot 2020-08-04 at 6 21 57 pm

However, BanksiaGUi keeps displaying all info string until the engine is reloaded. Thus it is not nice if some strings are displayed repeatedly (such as info about Syzygy loading) thus users may see them all the time.

Below is another confusion for users when SF NNUE load data twice when starting.

Screen Shot 2020-08-04 at 6 25 15 pm

nguyenpham on 4 Aug 2020

@nguyenpham that's presumably not with the latest branch (at least for the found & loaded, which is not on the branch anymore).

vondele on 4 Aug 2020

Is there currently a plan for how testing will work for patches that may have different impacts on different nets? Is the current "best" net going to be used for testing new patches to search?

cshahmirzadi on 4 Aug 2020

Is the current "best" net going to be used for testing new patches to search?

IMO, yes.

vondele on 4 Aug 2020

👍3

I have updated the code and done a quick check, there is no redundant info string when starting. Well done!

Screen Shot 2020-08-04 at 9 16 32 pm

nguyenpham on 4 Aug 2020

A few thoughts about EvalFile: setting up EvalFile seems to be one of the new steps which may confuse users since they have to set it correctly before they can play: aware to set up, know where to find eval files, try-error with short or full path forms...

We can help users by doing some auto works to setup EvalFile. To do that we just scan all data files (.nnue) in the SF folder as well as subfolders to get the list of all data files. Now there are two ways to show the list:

EvalFile, type as a string, the default value is . If users use the default value, we will use the latest file from the list. Note that when implementing this way, we may not need the real list. Just compare files and keep the latest one. Advantages: users always have a network to run, they can also set manually to use any networks they want. Disadvantages: if users want to set up manually, they have to find/select it from the disk as before
EvalFile, type as a combo: sort the list by descent time and put their file names only (for being short) as the combo choices. The first item of the combo as well as the default value are to auto pick the latest network. Advantages: users can simply select any networks from the combo without worrying about files/folders/paths. Disadvantages: they can’t select networks outside the main folder.

Your thoughts?

nguyenpham on 4 Aug 2020

How about this (which is similar to how Lc0 solves this problem, but adapted to the possibility of not needing a net):

Scan the Stockfish folder for .nnue files.
If there aren't any:
- Use NNUE defaults to false
- EvalFile defaults to <empty>
else:
- Use NNUE defaults to true
- EvalFile defaults to the most recent .nnue file present.

This would have to be done before Stockfish responds to the initial uci command, so that the defaults sent to the GUI (via option UCI lines) are correct.

(I don't speak C++ so I can't offer to actually write this, sorry...)

fohristiwhirl on 4 Aug 2020

Having said all that, the fact that 2 options are needed is potentially confusing, and will probably lead to some users using classical eval by accident. How about just removing the Use NNUE option and:

Scan the Stockfish folder for .nnue files.
If there aren't any:
- EvalFile defaults to <empty>
else:
- EvalFile defaults to the most recent .nnue file present.

(Again, doing all this before replying to the initial uci command.)

The GUI can still override the choice with an explicit setoption command, of course. When the time comes to actually evaluate positions, use classical eval only if EvalFile is <empty>.

fohristiwhirl on 4 Aug 2020

I feel it is better to clearly indicate which net is default/recommended with the given binary, and would encourage against running the binary with a non-default net, except for experts/enthusiasts. I'm hopeful that we will have process that basically makes sure that the best net available will be the default.

Adding too much logic to the code is not good IMO, and if so best done by the GUI. If the GUI makes sure the engines working directory contains the default net, there user just needs to click the 'Use NNUE' option. The way we have a unique name for the net (with the sha) makes finding this net relatively easy for the GUI.

It is clear that .nnue files will not remain of the same format. They will evolve in features, content, and what exactly is part of the evaluation. Unless the user is an expert, picking non-default nets will lead to surprising behavior.

vondele on 4 Aug 2020

👍5

I haven't seen measurements yet of the performance benefit of PGO for NNUE.

On my build (x86-64-avx2), the effect is actually quite significant, around 10% speedup. Anyway... not urgent IMO.

vondele on 4 Aug 2020

I have decided to postpone those items that require running with a net during build / CI to after a merge, it is better to get some feeling for the download/build process first. The step is likely not very difficult, but if done incorrectly would be annoying.

Similarly we don't really need to worry on older hardware on fishtest, performance is good even if included.

The mostly leaves the Makefile issue for fixing.

vondele on 4 Aug 2020

Makefile issues fixed. SPRT NNUE vs NNUE testing OK.

Small note for classical vs NNUE tests (e.g. regression tests, FYI @Alayan-stk-2), since the precise result of such a match depends quite strongly on the hardware used, we might expect some variability in future RT (SF11 vs SFdev). It is not worth trying to fix this with time odds. It does however mean that also our statistical tests to decide if a worker is bad will fail for RT, so it is important to switch 'auto-purge' to off for those tests. (see also https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c)

vondele on 5 Aug 2020

@vondele I see you changed the default net in the code for this test
https://tests.stockfishchess.org/tests/view/5f2a4d35a5abc164f05e4cf6
Are the options on the test page not working properly if you use
EvalFile=nn-c157e0a5755b.nnue
?

mstembera on 5 Aug 2020

@mstembera indeed, on fishtest using EvalFile as an option on test submission is not supported, at least not right now. It would need some code to download this net, and translate it to the proper cutechess option (add path). Not impossible, but not there. There is some WIP documentation on testing nets:
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests

vondele on 5 Aug 2020

Some bench results for armv8 (Raspberry Pi 4, 64bit), armv7 (Raspberry Pi 3, 32bit) and x86, node count consistency check with apple silicon:

target | nodes | nodes NNUE | nps | nps NNUE
-- | -- | -- | -- | --
X86-64-bmi2 | 4746616 | 4254913 | 2139078 | 1444301
apple-silicon | 4746616 | 4254913 | |
armv8 | 4746616 | 4254913 | 648444 | 64236
armv8-neon | 4746616 | 4254913 | 648444 | 183916
armv7 | 4746616 | 4254913 | 174456 | 16878

edit: armv7 added.

domschl on 5 Aug 2020

👍1

@domschl see https://github.com/official-stockfish/Stockfish/pull/2899#issuecomment-669331705

vondele on 5 Aug 2020

The code has been merged, I'll close this issue. Thanks for the contributions!

vondele on 6 Aug 2020

🎉5

Stockfish: NNUE merge

Most helpful comment

All 157 comments

Related issues