The NNUE branch maintained by @nodchip has demonstrated strong results and offers great potential, and we will proceed to merge it into master. This will assure that Stockfish remains a reference engine based on the top computer chess technology for the foreseeable future. This merge will introduce machine learning based coding to the engine, thus enlarging the community of developers, bringing in new skills. We are eager to keep everybody on board, including all developers and users of diverse hardware, aiming to be an inclusive community. We will keep and evolve the handcrafted evaluation, not only for the beauty and insight it offers, but for additional reasons, including its value on diverse hardware platforms, and for the potential of hybridization with NNUE, and use in search. The initial goal for the merge is the NNUE playing capability only, this focus makes sense given the complexity of the project, and the aim of Stockfish as a production quality chess engine. We will try to make changes such that the @nodchip repo and ours can be kept in sync as easily as possible. Written by knowledgeable people, NNUE is already of good quality, we will make small modifications needed to make it pass our standard CI, and integrate nicely in the existing engine. Rigorous testing will continue to be a cornerstone of our development, and accordingly integration in fishtest needs to be properly implemented. Afterwards, we will guarantee continuous progression in playing strength testing core engine and networks with our usual SPRT process. Some initial steps of the merge will be outlined below, the precise steps needed will become clearer as we proceed, I look forward to working with the community to make this happen!
some tasks needed for integration:
debug=yes build init_nnue() (startup, EvalFile, Use NNUE, is_ready, cmdline, ucinewgame, ....), add assert.custom_make.txtLet's use this issue specifically for synchronizing work on these or related tasks. General discussion use https://github.com/official-stockfish/Stockfish/issues/2728
debug=yes build is broken it seems.
A quick verification / benchmark (single bench run) on the targets I can test easily (on zen2 architecture), gives a feeling for the order of magnitudes. Good thing is all benches match. If people can test x86-64-avx512 armv7 armv8 ppc-32 ppc-64 that would be appreciated.
| target | nodes | nodes NNUE | nps | nps NNUE |
| --: | --: | --: | --: | --: |
| general-32 | 4578298 | 3377227 | 1725705 | 134486 |
| x86-32-old | 4578298 | 3377227 | 1783520 | 133497 |
| x86-32 | 4578298 | 3377227 | 1758178 | 133635 |
| general-64 | 4578298 | 3377227 | 2423662 | 308000 |
| x86-64 | 4578298 | 3377227 | 2485503 | 436108 |
| x86-64-modern | 4578298 | 3377227 | 2611693 | 424648 |
| x86-64-sse3 | 4578298 | 3377227 | 2470749 | 1131778 |
| x86-64-sse3-popcnt | 4578298 | 3377227 | 2675802 | 1199299 |
| x86-64-ssse3 | 4578298 | 3377227 | 2577870 | 1166975 |
| x86-64-sse41 | 4578298 | 3377227 | 2445672 | 1243914 |
| x86-64-sse42 | 4578298 | 3377227 | 2549163 | 1113126 |
| x86-64-avx2 | 4578298 | 3377227 | 2678933 | 1635461 |
| x86-64-bmi2 | 4578298 | 3377227 | 1906829 | 1411884 |
Edit: since 'x86-64-modern' on master enables both sse3 and popcnt already, we can easily adjust that one (actually, incorrect, needs ssse3 not sse3, the target x86-64-sse3 enables both, increased modern to require ssse3).
Fix for debug build: PR #2827.
Some more tasks as seen from the CI (patches welcome):
* clang: nnue/evaluate_nnue.cpp:60:45: error: no member named 'aligned_alloc' in namespace 'std'
Compiler bug. std::aligned_alloc is standard C++17.
or header missing?
hmm. std::aligned_alloc should exist after #include <cstdlib> which exists in "types.h", which is included from "position.h", which is included from nnue/evaluate_nnue.cpp...
OK, needs clang 7.0 (https://godbolt.org/z/rE3joM), CI is still based on 6.0. .travis.yml needs updating
gcc needs to be >= 7.4
I have add a commit to upgrade the travis settings on Linux... probably somebody needs to come with a similar fix for the osx of our CI.
Edit: seems like this upgrade didn't work. I'll revert that patch and we'll address the issue later
@vondele
I'm afraid that assemblies for the x86_64 and x86_32 architecture (without sse3 or popcnt) are useless for testing in the framework, since they are much slower than the classic Stockfish assemblies for the same architecture and are much weaker. Testing them can significantly affect the quality of the tests.
almost certainly all machines on fishtest support x86-64-modern (which is right now equivalent to x86-64-sse3-popcnt) on the branch. The majority of the machines support avx2 or bmi2. We'll have to deal with that eventually, but right now it is not too much of a concern.
On compiling for macOS
On macOS 10.15.6 (Catalina) make build ARCH=x86-64-bmi2 COMP=clang has two issues:
The first one: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/malloc/_malloc.h:50:10: note: 'aligned_alloc' has been marked as being introduced in macOS 10.15 here, but the deployment target is macOS 10.9.0
So we need to change -mmacosx-version-min=10.9 in the Makefile to -mmacosx-version-min=10.15 but according to statcounter only ~52% of Mac users are on 10.15.
If we _do_ change the min version, second issue:
nnue/evaluate_nnue.cpp:60:40: error: no member named 'aligned_alloc' in namespace 'std'; did you mean simply 'aligned_alloc'?
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/malloc/_malloc.h:50:10: note: 'aligned_alloc' declared here
It seems like the C++ library that ships with Xcode dev tools has aligned_alloc in the global namespace. Could maybe get around this with a #if __APPLE__.
Fix for memset/memcpy warnings and valgrind issue: PR #2830.
I can't see unfreed memory issue. Could you check again after applying these fixes?
error on build
$ make build ARCH=x86-64 COMP=mingw
Config:
debug: 'no'
sanitize: 'no'
optimize: 'yes'
arch: 'x86_64'
bits: '64'
kernel: 'MINGW64_NT-10.0-18363'
os: 'Windows_NT'
prefetch: 'yes'
popcnt: 'no'
sse: 'yes'
sse3: 'no'
ssse3: 'no'
sse41: 'no'
sse42: 'no'
avx2: 'no'
pext: 'no'
avx512: 'no'
Flags:
CXX: g++
CXXFLAGS: -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2
LDFLAGS: -static
Testing config sanity. If this fails, try 'make help' ...
C:/Program Files/Git/mingw64/bin/make.exe ARCH=x86-64 COMP=mingw all
make[1]: Entering directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o benchmark.o benchmark.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o bitbase.o bitbase.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o bitboard.o bitboard.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o endgame.o endgame.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o evaluate.o evaluate.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o main.o main.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o material.o material.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o misc.o misc.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o movegen.o movegen.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o movepick.o movepick.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o pawns.o pawns.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o position.o position.cpp
position.cpp: In member function 'Position& Position::set(const string&, bool, bool, StateInfo*, Thread*)':
position.cpp:199:39: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct StateInfo'; use assignment or value-initialization instead [-Wclass-memaccess]
199 | std::memset(si, 0, sizeof(StateInfo));
| ^
In file included from position.cpp:31:
position.h:40:8: note: 'struct StateInfo' declared here
40 | struct StateInfo {
| ^~~~~~~~~
position.cpp: In member function 'void Position::do_move(Move, StateInfo&, bool)':
position.cpp:714:51: warning: 'void* memcpy(void*, const void*, size_t)' writing to an object of a non-trivial type 'struct StateInfo' leaves 1272 bytes unchanged [-Wclass-memaccess]
714 | std::memcpy(&newSt, st, offsetof(StateInfo, key));
| ^
In file included from position.cpp:31:
position.h:40:8: note: 'struct StateInfo' declared here
40 | struct StateInfo {
| ^~~~~~~~~
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o psqt.o psqt.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o search.o search.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o thread.o thread.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o timeman.o timeman.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o tt.o tt.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o uci.o uci.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o ucioption.o ucioption.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o tune.o tune.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o syzygy/tbprobe.o syzygy/tbprobe.cpp
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o nnue/evaluate_nnue.o nnue/evaluate_nnue.cpp
nnue/evaluate_nnue.cpp: In function 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&)':
nnue/evaluate_nnue.cpp:60:45: error: 'aligned_alloc' is not a member of 'std'; did you mean 'aligned_union'?
60 | pointer.reset(reinterpret_cast<T*>(std::aligned_alloc(alignof(T), sizeof(T))));
| ^~~~~~~~~~~~~
| aligned_union
make[1]: *** [<builtin>: nnue/evaluate_nnue.o] Error 1
make[1]: Leaving directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
make: *** [Makefile:595: build] Error 2
MichaelB7@VM-894787 MINGW64 ~/home/Github/Stockfish/src (nnue-player-wip)
@MichaelB7 Please provide more details, such as what version of g++ you have and your OS/platform. Also, you could try https://github.com/official-stockfish/Stockfish/pull/2831 and see if that fixes your issue—I think the PR makes the aligned_alloc issue more compatible in general for platforms other than macOS.
$ stockfish
Stockfish 250720 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
compiler
Compiled by g++ (GNUC) 10.1.0 on MinGW64
__VERSION__ macro expands to: 10.1.0
Windows 10 Pro
Version 10.0.18363 Build 18363
will try your patch
Update: The patch did not resolve the issue:
$ make build ARCH=x86-64 COMP=mingw
Config:
debug: 'no'
sanitize: 'no'
optimize: 'yes'
arch: 'x86_64'
bits: '64'
kernel: 'MINGW64_NT-10.0-18363'
os: 'Windows_NT'
prefetch: 'yes'
popcnt: 'no'
sse: 'yes'
sse3: 'no'
ssse3: 'no'
sse41: 'no'
sse42: 'no'
avx2: 'no'
pext: 'no'
avx512: 'no'
Flags:
CXX: g++
CXXFLAGS: -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2
LDFLAGS: -static
Testing config sanity. If this fails, try 'make help' ...
C:/Program Files/Git/mingw64/bin/make.exe ARCH=x86-64 COMP=mingw all
make[1]: Entering directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_SSE2 -c -o nnue/evaluate_nnue.o nnue/evaluate_nnue.cpp
nnue/evaluate_nnue.cpp: In function 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&)':
nnue/evaluate_nnue.cpp:60:40: error: there are no arguments to 'aligned_malloc' that depend on a template parameter, so a declaration of 'aligned_malloc' must be available [-fpermissive]
60 | pointer.reset(reinterpret_cast<T*>(aligned_malloc(alignof(T), sizeof(T))));
| ^~~~~~~~~~~~~~
nnue/evaluate_nnue.cpp:60:40: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
nnue/evaluate_nnue.cpp: In instantiation of 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&) [with T = Eval::NNUE::FeatureTransformer; Eval::NNUE::AlignedPtr<T> = std::unique_ptr<Eval::NNUE::FeatureTransformer, Eval::NNUE::AlignedDeleter<Eval::NNUE::FeatureTransformer> >]':
nnue/evaluate_nnue.cpp:79:43: required from here
nnue/evaluate_nnue.cpp:60:54: error: 'aligned_malloc' was not declared in this scope; did you mean '_aligned_malloc'?
60 | pointer.reset(reinterpret_cast<T*>(aligned_malloc(alignof(T), sizeof(T))));
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
| _aligned_malloc
nnue/evaluate_nnue.cpp: In instantiation of 'void Eval::NNUE::Detail::Initialize(Eval::NNUE::AlignedPtr<T>&) [with T = Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::InputSlice<512>, 32> >, 32> >, 1>; Eval::NNUE::AlignedPtr<T> = std::unique_ptr<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::InputSlice<512>, 32> >, 32> >, 1>, Eval::NNUE::AlignedDeleter<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::ClippedReLU<Eval::NNUE::Layers::AffineTransform<Eval::NNUE::Layers::InputSlice<512>, 32> >, 32> >, 1> > >]':
nnue/evaluate_nnue.cpp:80:31: required from here
nnue/evaluate_nnue.cpp:60:54: error: 'aligned_malloc' was not declared in this scope; did you mean '_aligned_malloc'?
60 | pointer.reset(reinterpret_cast<T*>(aligned_malloc(alignof(T), sizeof(T))));
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
| _aligned_malloc
make[1]: *** [<builtin>: nnue/evaluate_nnue.o] Error 1
make[1]: Leaving directory 'C:/Users/MichaelB7/home/Github/Stockfish/src'
make: *** [Makefile:595: build] Error 2
But if I make the suggested change from the error message above , it worked.
replacing this on line 60:
pointer.reset(reinterpret_cast<T*>(aligned_alloc(alignof(T), sizeof(T))));
with this:
pointer.reset(reinterpret_cast<T*>(_aligned_malloc(alignof(T), sizeof(T))));
@dorzechowski one more valgrind warning triggers when loading TB files:
setoption name SyzygyPath value ../tests/syzygy/
==122921== Conditional jump or move depends on uninitialised value(s)
==122921== at 0x13DF43: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, StateInfo*, Thread*) (position.cpp:222)
==122921== by 0x13EFDE: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Color, StateInfo*) (position.cpp:407)
==122921== by 0x16D412: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>::TBTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:366)
==122921== by 0x173AB9: void __gnu_cxx::new_allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>((anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (new_allocator.h:147)
==122921== by 0x171CFC: void std::allocator_traits<std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >&, (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (alloc_traits.h:484)
==122921== by 0x171D97: void std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::_M_push_back_aux<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:496)
==122921== by 0x170C2F: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>& std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::emplace_back<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:174)
==122921== by 0x16DAFB: (anonymous namespace)::TBTables::add(std::vector<PieceType, std::allocator<PieceType> > const&) (tbprobe.cpp:489)
==122921== by 0x16F2A6: Tablebases::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:1367)
==122921== by 0x167A8C: UCI::on_tb_path(UCI::Option const&) (ucioption.cpp:44)
==122921== by 0x1699F0: UCI::Option::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ucioption.cpp:202)
==122921== by 0x163837: (anonymous namespace)::setoption(std::__cxx11::basic_istringstream<char, std::char_traits<char>, std::allocator<char> >&) (uci.cpp:100)
==122921==
==122921== Conditional jump or move depends on uninitialised value(s)
==122921== at 0x13DF43: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, StateInfo*, Thread*) (position.cpp:222)
==122921== by 0x13EFDE: Position::set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Color, StateInfo*) (position.cpp:407)
==122921== by 0x16D62F: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>::TBTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:385)
==122921== by 0x173AB9: void __gnu_cxx::new_allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>((anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (new_allocator.h:147)
==122921== by 0x171CFC: void std::allocator_traits<std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::construct<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> >&, (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (alloc_traits.h:484)
==122921== by 0x171D97: void std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::_M_push_back_aux<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:496)
==122921== by 0x170C2F: (anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>& std::deque<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0>, std::allocator<(anonymous namespace)::TBTable<((anonymous namespace)::TBType)0> > >::emplace_back<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (deque.tcc:174)
==122921== by 0x16DAFB: (anonymous namespace)::TBTables::add(std::vector<PieceType, std::allocator<PieceType> > const&) (tbprobe.cpp:489)
==122921== by 0x16F2A6: Tablebases::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (tbprobe.cpp:1367)
==122921== by 0x167A8C: UCI::on_tb_path(UCI::Option const&) (ucioption.cpp:44)
==122921== by 0x1699F0: UCI::Option::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ucioption.cpp:202)
==122921== by 0x163837: (anonymous namespace)::setoption(std::__cxx11::basic_istringstream<char, std::char_traits<char>, std::allocator<char> >&) (uci.cpp:100)
==122921==
info string Found 35 tablebases
Edit: this should be fixed by:
diff --git a/src/position.cpp b/src/position.cpp
index 52ba710ae..8f2c4b029 100644
--- a/src/position.cpp
+++ b/src/position.cpp
@@ -404,7 +404,7 @@ Position& Position::set(const string& code, Color c, StateInfo* si) {
string fenStr = "8/" + sides[0] + char(8 - sides[0].length() + '0') + "/8/8/8/8/"
+ sides[1] + char(8 - sides[1].length() + '0') + "/8 w - - 0 10";
- return set(fenStr, false, use_nnue(), si, nullptr);
+ return set(fenStr, false, false, si, nullptr);
}
This set() method is basically just there to create a materials key, and starts from a freshly created Position() object.
@MichaelB7 Thank you. I can now at least compile. However, I believe the parameters for aligned_alloc and _aligned_malloc should be reversed. Also we should probably match it with _aligned_free(ptr) instead of std::free(ptr).
first green CI badge on Linux/gcc
https://travis-ci.org/github/official-stockfish/Stockfish/jobs/711914367
@vondele Another thing to ascertain before we can use NNUE in SF is a license to use it. As nodchip's implementation doesn't have any license attached, I'm not sure we can even use it legally now. We assume so but it's not enough I think. This should be solved together with @nodchip and those files should have either GPLv3 or compatible license attached.
@dorzechowski would be great @nodchip confirms this, but as the code (binaries and sources) are being distributed already, the license is GPL, this is also what the repository Copying file in the repository mentions.
Meanwhile also green on Linux/clang : https://travis-ci.org/github/official-stockfish/Stockfish/jobs/711961757
@dorzechowski would be great @nodchip confirms this, but as the code (binaries and sources) are being distributed already, the license is GPL, this is also what the repository Copying file in the repository mentions.
That's great. All this licensing can be very confusing.
I can confirm the x86-64-avx512 gives the same 3377227 bench as above.
@vondele I believe task "remove non-player related code" is done already. I see nothing that could be removed without breaking it. Of course possible optimizations is another thing.
@dorzechowski OK, I'll tick that box, assuming you had a careful look.
I have also verified bench on linux-rhel7-ppc64le using gcc version 9.1.0 (GCC)
There are warnings however:
nnue/evaluate_nnue.cpp: In function ‘Value Eval::NNUE::ComputeScore(const Position&, bool)’:
nnue/evaluate_nnue.cpp:124:61: warning: requested alignment 64 is larger than 16 [-Wattributes]
124 | transformed_features[FeatureTransformer::kBufferSize];
| ^
nnue/evaluate_nnue.cpp:126:61: warning: requested alignment 64 is larger than 16 [-Wattributes]
126 | alignas(kCacheLineSize) char buffer[Network::kBufferSize];
| ^
seemingly that's not a bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68267) and gcc provides a __BIGGEST_ALIGNMENT__ macro which is arch specific (https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html). It is however different from 64 for most arch I tried (except knl).
Maybe there is an easy solution?
@vondele Another thing to ascertain before we can use NNUE in SF is a license to use it. As nodchip's implementation doesn't have any license attached, I'm not sure we can even use it legally now. We assume so but it's not enough I think. This should be solved together with @nodchip and those files should have either GPLv3 or compatible license attached.
AFAIK YaneuraOu is the original author and wrote about it with his full support:
http://yaneuraou.yaneu.com/2020/06/21/stockfish-nnue-subjective-review/
https://twitter.com/nodchip/status/1286947373374087168/retweets/with_comments
YaneuraOu are distributed under GPLv3.
https://github.com/yaneurao/YaneuraOu
I also intended that the ported code and code that I wrote are distributed under GPLv3. If there are additional works to express that those code are distributed under GPLv3, please let me know.
Do we know why the bench of the @nodchip branch with the default net is 3909820 but our nnue-player-wip is 3377227?
The reason I ask is because I played a local STC match between SF and SFNNUE and after 100 games SF won them all. Obviously something seems broken. @dorzechowski Did you do any play testing w/ your branch and what were the results?
@nodchip thanks for confirming the license. There is nothing particular you need to do, I might add a copyright header to the added files, that makes things clear.
@mstembera I think I did play with the branch, and had good results. Obviously that's something to clarify. I'm not sure both branches have the same version of master merged.
@mstembera quick check of the current status of master vs branch looks good for me:
Score of master vs nnue: 17 - 49 - 84 [0.393] 150
Elo difference: -75.3 +/- 36.7, LOS: 0.0 %, DrawRatio: 56.0 %
@vondele Thanks. I tried testing using the chessbase gui and all is fine there. Seems I only have the issue when using cutechess-cli. If I figure out what I'm doing wrong I'll report.
I have used:
./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16 -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn-c157e0a5755b.bin option."Use NNUE"=true -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn
If I have to guess, most likely missing the full path to EvalFile ?
Meanwhile results with a few more games:
Score of master vs nnue: 498 - 1019 - 1879 [0.423] 3396
Elo difference: -53.7 +/- 7.8, LOS: 0.0 %, DrawRatio: 55.3 %
Those are encouraging results.
I wasn't using the EvalFile path because I have the net in the same directory as the .exe Adding it anyway seems to make no difference. When I test nnue vs nnue things seem normal except the draw ration is only about 10%. Most wins are by adjudication as normal.
@mstembera I think cutechess has the current working directory independent of where the engine binary is, and as a result the binary doesn't find the EvalFile, unless given explicitly with the full path. Adding -debug should show that in the output. After printing the error message the code continues, with a zero-ed net.
@vondele Yes that was it. Thanks! I would have never guessed cutechess was overriding the current directory. I wonder if we should just fail when "Use NNUE"=true but the network fails to load?
you are touching on a difficult point, and that is how to deal with the net in general.
For example, do we expect people who build the code to download the default net to the src directory ? This would make it relatively easy to enable profile-build. However, to figure out the name of the default net, one should first build the binary, execute uci and look at EvalFile. So, require a build, afterward a profile-build.
Similar question for the bench command. One could test both NNUE and handcrafted eval in a single bench, if we assume the developer has the default net in the src directory. Similar question of course if 'Use NNUE' would default to true.
Probably that's something we could expect for developers, but I'm sure we will all be bitten by this requirement at some point. Other opinions?
Yes this isn't straight forward. I think it's important than when inevitably something goes wrong it doesn't happen silently. This makes it much harder for a person to track down. The current net is about 21MB and compresses to about 8MB. Is it out of the question to simply add it to the source directory along with the source files?
@vondele Here is a commit by @nodchip to solve the issue of allowing the net path to be relative to the binary.
https://github.com/nodchip/Stockfish/commit/fbdb373b6482db2b462b30d7399c3c7fed1d4f26
will profile build without net in directory give same optimizer compiler flags as with net in directory? does it matter if use nnue is false default?
agree with @mstembera - good to include default net in source directory and have net in same directory as exe by default
@mstembera the problem with adding to the source directory is that the git repo will become very large to clone 20Mb with each new net. We might also run into github bandwidth limits if there are too many downloads. However, I don't think we should add the code you referenced for 'finding' the binary. This is OS/GUI specific, and right now, the problem is not very different from specifying the syzygy directory. Also in that case a relative path doesn't work.
@stockchess there is some impact of having NNUE false by default for the profile-build ... if the code is not executed the profiling will not gather statistics for it, and thus optimize less. The difference is likely not spectacular, but will be visible.
Meanwhile two tests with using nnue=false show some overhead, wrt to master.
https://tests.stockfishchess.org/tests/view/5f1fddb7c09435d870cb9f15
https://tests.stockfishchess.org/tests/view/5f1ff3bdc09435d870cb9f25
it is just below the self-put target of 5 Elo, but would be nice to improve upon this.
@vondele Can't the old net be purged from the repo each time we add a new one so that it never has more than one and doesn't grow?
technically, yes, but that's kind of opposite of what a version control system like git wants, and certainly not good style
https://docs.github.com/en/github/managing-large-files/removing-files-from-a-repositorys-history
@mstembera Yes, I played a lot of test matches in cutechess-cli, LittleBlitzer and Arena and the results were as good as reported by other people. I don't have any Chessbase GUI to test. I also carefully checked that bench didn't change after I removed/refactored some code. I also updated it to latest master. It's still possible that I've missed something but I'm pretty confident I haven't.
yes, I checked that the nodchip branch and our branch are just diverged a little due to patches on master being present on our branch. The difference in bench is thus likely not important.
@vondele Maybe we could create a very small net that may be not any good for playing but would verify the nnue bench signature? For example, if in first layer we would use only 8x2 parameters instead of current 256x2, it would shrink the size to below 700kB if I calculated correctly. For sure it could be even smaller. Then we could just add it to the sources and always use it for bench.
Networks will be distinguished by sha and we would always have an info which one is the latest so also running bench on it would be redundant.
for profile-build it would be nice to have the same network structure as the actual run. For bench/pgo One could even use a zero or better pseudo-random network.... I wonder if somebody can derive an algorithm to prefill a network such that it gives material count only?
Edit: shouldn't it be quite doable to get material + psqt in the first layer of the net?
I'll merge the proposed change of extension of network data https://github.com/official-stockfish/Stockfish/issues/2849 (any extension will still work, but it is a meaningful change).
OK, but I thought we just want to avoid having big net in the sources. Another idea: add a small tool that will produce a reference network on user side filled perhaps with random data (but obviously always the same). Then compile and run it before compiling Stockfish and running bench. Then downloading an official net would be done after that passes.
@dorzechowski yes, this tool could be inside the binary. See comment above, I think it should even be possible to have better than random, e.g. material (+psqt?). It would be actually quite useful to have. @nodchip do you have an idea if it would be possible to initialize a net such that it gives material difference as an output ?
That would also require re-adding some code used for net creation that is not there now.
well some code to initialize the net data-structure (instead of reading the file) not creating/optimizing nets/files etc. I suspect one can write the network coefficients just analytically for material difference only, but I don't know if this is true, or how difficult that would be.
I may take a look at it later. Of course we don't need whole net generation machinery but I think at least a function to write header and parameters needs to be re-included (now we have only read functions there). I also think that material only net is not very good for bench as it's not sensitive enough imo. Such eval relatively rarely changes and many changes in the code could produce the same bench. Pseudo-random net would probably be better and much easier to produce taking also into account possible different net architectures down the way..
@dorzechowski actually material only eval is already surprisingly strong (>2400 Elo, https://github.com/vondele/Stockfish/tree/simpleEval, https://lichess.org/@/simpleEval), and would almost certainly be good enough for a bench. In fact, the main advantage over random is that it would exercise search in a more or less realistic way (good for PGO). Of course, one could super-impose a small random component (as the simpleEval is doing actually), or if feasible PSQT. Anyway, I think it would be very valuable to have a synthetic internal net.
Yes, I'm aware of simpleEval bot. Material + random component would certainly be good but I think it's an overkill for the following reason: if (or more likely when) we change net architecture or size, the whole concept of how to make the net a material counter would need to be revisited for every such change. This is not a problem for random nets as we wouldn't care about inputs interpretation and we would generate a 'bench' net always in the same way. Or maybe you think that this synthetic net should be fixed once for all? I might be confused there.
I agree that having this 'material net' adapt to architecture and size will be a lot (and possibly too much) developer overhead. In that case random might be good enough. I'm just slightly concerned about random being good enough to do PGO. Actually, one could presumably just add the random net output (hoping it is small) to e.g. pos.psq_score() (in Eval::evaluate()) and have something good enough for bench and PGO ? What do you think?
I didn’t find any mention nor code about training data for NNUE! Did I miss? Will we have that part? If yes, when? What the plan?
The plan is to have the player, the training code will remain https://github.com/nodchip/Stockfish for the time being.
I think that it is hard to generate network parameters which outputs material differences. Because the activation function is Clipped ReLU. It will clip big values to 127. I don't come up with any ideas to output material differences with halfkp_256x2-32-32.
uh... It might be possible... We could count the number of the friend's pieces for each piece type with the network parameters between the input layer and the first hidden layer, propagate the numbers to the last hidden layer, and multiply the material value with the network parameter between the last hidden layer and the output layer. Though implementation would be hard...
Alternatively, (if material values are 15 bits) sum 5 most significant bits in one "neuron", 5 next in another and 5 LSB in a third, propagate it to the final layer and sum them there.
IMO, having training and playing code in different repos seems be a trouble in long term. It’s harder to maintain and optimise. Sometimes there will be misunderstandings, delays between teams, some works may become double jobs
Actually, one could presumably just add the random net output (hoping it is small) to e.g. pos.psq_score() (in Eval::evaluate()) and have something good enough for bench and PGO ? What do you think?
But psq is not used in bench with Use NNUE=true. We will have different execution paths hence different pgo builds depending of default Use NNUE value. Or you want to always build it with Use NNUE=false only?
Actually, one could presumably just add the random net output (hoping it is small) to e.g. pos.psq_score() (in Eval::evaluate()) and have something good enough for bench and PGO ? What do you think?
But psq is not used in bench with
Use NNUE=true. We will have different execution paths hence different pgo builds depending of defaultUse NNUEvalue. Or you want to always build it withUse NNUE=falseonly?
the bench could switch (automatically) between Use NNUE=trueand Use NNUE=false if we have the internal ('random') net. PGO would thus always result in the same binary. As far as I know, pos.score_psq() is always available also if NNUE (it is part of put_piece and updated via do_move).
But profile-build exercise should resemble normal use as much as possible. Adding stuff to NNUEval only during profiling would be Wrong™.
Alternatively, (if material values are 15 bits) sum 5 most significant bits in one "neuron", 5 next in another and 5 LSB in a third, propagate it to the final layer and sum them there.
material values are small (currently at most 2682).
Alternatively, (if material values are 15 bits) sum 5 most significant bits in one "neuron", 5 next in another and 5 LSB in a third, propagate it to the final layer and sum them there.
material values are small (currently at most 2682).
OK, 4 bits per "neuron". It would help to avoid overflow, too.
But profile-build exercise should resemble normal use as much as possible. Adding stuff to NNUEval only during profiling would be Wrong™.
that's actually why adding something would be reasonable, as a surrogate for having a more accurate net. It would add 1 branch difference to the full code.
Overall, I simply don't understand the point of a complicated surrogate. Why does one need fast NNUE in PGO build without a good set of eval parameters? And if one is at hand, why not simply use it for PGO build?
@sf-x because we want to simplify the build process, and have it reproducible among users and covering as much as possible the code base (for e.g. bench output). Ideally one can build from sources without any additional download.
Why would one want to build and not use?
cause the code will have both handcrafted and NNUE eval at the same time. Maybe one just wants to use the handcrafted part...
Change profile command to stockfish <input.txt >/dev/null
where input.txt is something like
setoption name Use NNUE value true
bench
If the user doesn't care for NNUE, let him build it with input.txt containing just bench, or all-zero eval params...
In Eval::evaluate() we could have a third option that returns
return (NNUE::evaluate(pos) + Evaluation<NO_TRACE>(pos).value()) / 2;
This would solve a couple of issues.
1) We could produce a single bench that depends on both the hand crafted eval and NNUE.
2) We could PGO optimize both code paths in a single build.
This could be used only internally for bench or we could expose all 3 eval options to UCI as something like
Classic, NNUE, Blend.
Any opinions on how we should refer to the existing eval 'handcrafted' or 'classic' or '...' ?
'legacy'?
I would go for a term with positive connotation ... the condensed knowledge in the eval is impressive and succinctly expressed.
@nodchip more technical question, have people tried 'delta learning' in this context, i.e. learn the difference between 'classical eval' and a search based eval ? I know some other fields where this is successful, but don't know for shogi or chess.
Any opinions on how we should refer to the existing eval 'handcrafted' or 'classic' or '...' ?
I called it 'standard' because this type of eval is de facto standard used for decades. I don't particularly like 'classic' or 'legacy' as they hint that it's something old, that maybe was cool one day but not anymore. If NNUE type of eval prevails and becomes new standard, we can refer to old one as 'classic' but not before that imo.
If 'delta learning' means 'Temporal difference learning', it is half true. The trainer uses a TDLeaf(λ)-ish method called 'elmo method' to adjust network parameters. In detail, the trainer adjusts network parameters to approximate a search based eval and the game result by a quiescence search base eval.
The detail is written in this slide.
https://www.slideshare.net/HisayoriNoda/in2017-nodchip-2017-1223-nodchip
In formal, the problem is modeled as:
argmin_{ω} L(ω)
where ω and L are the network parameters and an object function, respectively. For simplicity, we assume the evaluation function is a liner function, and ω is the coefficient vector. L(ω) can be expressed as:
L(ω) = Σ_{i} l(ω, φ_{i}, ω', φ'_{i}, w_{i})
where l, i, φ_{i}, ω', φ'_{i} and w_{i} are a loss function, the index of a position in the training data, the feature vector of the leaf node of the quiescence search in the i-th position, the coefficient vector used in the training data generation, the feature vector of the leaf node of the search in the i-th position, the result of the game containing the i-th position.
The trainer uses the liner sum of two cross entropies as the loss function. The first one is cross entropy between the winning probability distribution calculated from a quiescence search base eval and the winning probability distribution calculated from a search base eval. The second one is cross entropy between the winning probability distribution calculated from a quiescence search base eval and the winning probability distribution calculated from a game result (which must be 0.0 or 1.0).
We define p and q using the approximate relationship between the winning probability and the pawn advantage (https://www.chessprogramming.org/Pawn_Advantage,_Win_Percentage,_and_Elo):
p = σ(α w'^{t} φ')
q = σ(α w^{t} φ)
Where α is the coefficient in the wiki page above. We also define t:
t = 1.0 (if the player who moved in the i-th position won the game.)
0.0 (if the player who moved in the i-th position lost the game.)
l can be expressed as:
l = λ H(p, q) + (1.0 - λ) H(t, q)
= λ E_{p}[-log q] + (1.0 - λ) E_{t}[-log q]
= -λ Σ_{x} p(x) log q(x) - (1.0 - λ) Σ_{x} t(x) log q(x)
x takes the two values, win or lose:
= - λ (p log q + (1 - p) log (1 - q)) - (1.0 - λ) (t log q - (1 - t) log (1 - q))
The gradient of l is:
l' = ∂ l / ∂ ω_{i} = - λ (q - p) - (1.0 - λ) (q - t)
The trainer uses this gradient in back propagation to adjust network parameters.
I would just refer to the old eval as non-NN eval or non-neural evaluation. Then just refer to NNUE’s evaluation as either neural eval, NNUE eval, or NN eval.
neuralEvaluation and nonNeuralEvaluation are both clear and descriptive names. “Use neural evaluation” is a straightforward UCI option.
I think legacy implies something is on the way out more so than classic does. Standard or original are good names too.
@nodchip I wonder about a network trained on top of our normal eval (that is the nn eval added to sf normal eval both during training and play) It would presumably have less information to store and could focus more on nuances rather than basics like material values etc. I'm sure we'll try eventually.
@vondele I'm sorry but I misunderstood your question. The short answer should be yes. When a net author creates the initial net, he or she creates training data by self-play using the classical eval. Each training datum contains a position, the search based eval value, and so on. The trainer uses those information to adjust network parameters.
Could you please correct me if I still don't understand your question?
@mstembera It sounds interesting. We could try it.
@nodchip thanks for the answer on learning (and sorry for the late reply, I was away for two days). What I meant was also what @mstembera suggested, i.e. train a network that is additive to the handcrafted/classic eval, for the same reasons. This would become something quite easy to test after the merge.
I don't see how additive eval could work. With additive effect it becomes "try to guess how bad is stockfish for exactly this position" instead of "try to maximize chance of winning this game". Is there any reasoning behind additive effects why would it be better? Additive effect would also make network file totally useless if an eval hole is fixed in next version of stockfish because the thing that it learned (how bad stockfish is for specific position) would change but weights stay same
@vondele I realized that I still misunderstood your question. I understand that your question is if we can create a net which outputs the difference between a classic eval value and a search based eval value. It is yes, technically.
https://github.com/nodchip/Stockfish/blob/master/src/learn/learner.cpp#L719
If we change the code above to
psv.score = evaluate_leaf(pos, pv1) - Eval::evaluate(pos);
The differences between the classic eval values and the search based eval values are recorded to training data. The trainer will train a net which outputs the differences. When we use the net, we need to change the eval to sum up both the values of the classic eval and the net output. (Though I have not tested...)
My concern is that we could need to implement three evals, "classic eval", "NNUE", and "classic eval + NNUE". It might make users confused.
thanks for the info! Obviously as a first step we would need to see if it works at all, it could be it is just always weaker, or just much stronger. It was just an idea for experimenting. One thing for example that I think the current net can't capture is the castling rights, that's something that is used in eval, but not in NNUE, is that correct?
Castling rights are not used in halfkp_256x2-32-32 network architecture which is a standard architecture of NNUE. But I already created the input feature implementation of castling rights (and enpassant), and a network architecture which uses it.
https://github.com/nodchip/Stockfish/blob/master/src/eval/nnue/features/castling_right.cpp
https://github.com/nodchip/Stockfish/blob/master/src/eval/nnue/architectures/halfkp-cr-ep_256x2-32-32.h
We could try it and compare the elo between halfkp_256x2-32-32 and halfkp-cr-ep_256x2-32-32.
Nice... after the merge we should try these things!
Tttak has also created a lot of possible features for NNUE:
One of them has mobility encoded into it, other has the game divided into multiple stages (every 40 ply new stage), another switches based on pieces left, etc.
Halfkpe4 (The E stands for Effect.)
It is a subdivision of HalfKP, taking into account the number of our and our opponent's effects to the square with each piece.
halfkpe4 is 4 times as big, halfkpe9 would be 9 times as big.
KK is King-King, a combination of both white and black King positions.
HalfKP-KK is the standard HalfKP plus KK as input features.
They can also all be combined so you can get some sort of chimera: HalfKP-CastlingRight-EnPassant-Pawn-KK...
Basically an infinite amount of possibilities.
https://github.com/tttak/Stockfish/tree/features_20200712/src/eval/nnue/features
https://github.com/tttak/Stockfish/tree/features_20200712/src/eval/nnue/architectures
armv8 (aws g6c.large) is dead slow with NNUE while as fast as x86_64 with SF
Total time (ms) : 26378
Nodes searched : 4366686
Nodes/second : 165542
Built from nodchip master branch with:
make profile-nnue ARCH=armv8 COMP=clang COMPCXX=clang++ CXXFLAGS='-DIS_ARM'
I guess that will require somebody to add neon intrinsic code to the branch (after the merge). Thanks for testing, this means the branch compiles on armv8. However, the bench is different from what I would expect. Did you have the net nn-c157e0a5755b.nnue loaded? Also, looks like we need to add the -DIS_ARM somewhere in the makefile.
Edit: would be nice if you could test our nnue-player-wip branch :-)
I can test your branch sure. Net used was sergio-v 20200728-2138. But it is orders of magnitude slower than x86_64 while on the later SF and NNUE are much closer NNUE being maybe two times slower. So I don't see a lot of sense benchmarking it now until the NNUE code is optimized for armv8. (BTW c6g.16xlarge gives me up to 100M n/s from SF master)
benchmarking for speed is not needed, but reporting the bench for our current testing net (nn-c157e0a5755b.nnue see https://github.com/official-stockfish/networks) would confirm the code is working on armv8. It should be 3377227 with the network loaded and Use NNUE true. (see https://github.com/official-stockfish/Stockfish/issues/2823#issuecomment-663863116)
make profile-nnue ARCH=armv8 COMP=clang COMPCXX=clang++ CXXFLAGS='-DIS_ARM'
Of course it's slow, you disabled all optimizations!
Ah, well spotted. Yes, try:
CXXFLAGS='-DIS_ARM' make profile-nnue ARCH=armv8 COMP=clang COMPCXX=clang++
to just add the additional flag
The code in the nnue-player-branch is getting closer to completion. I'd like to ask some help with two remaining tick boxes in particular:
git diff master src/evaluate.cpp src/evaluate.h src/misc.cpp src/misc.h src/position.cpp src/position.h src/types.h src/uci.cpp src/uci.h src/ucioption.cpp (where master is still at d89730d5c8dcf10eb9e1d91a81f903d9fc3c949a)nnue/ subdirectory and fix small mistakes as appropriate (some leftover characters, or strange auto-translations)This may be off-topic, but the README is suggesting that the network can be downloaded from tests.stockfish.org, which doesn't seem to be correct.
How about letting the Makefile check for the presence of a network file and downloading one if it is missing (with wget, if available on the system).
@syzygy1 that's work in progress, should work once the production server has the patches merged. See also https://github.com/glinscott/fishtest/pull/736
Naming like Human mode and AI mode would more interesting.
after commit Use a global instead of a variable in pos it looks something in eval doesn't work as expected?
e.g. this position isn't solved anymore:
r1b1qr1k/2p3pp/4p3/1pb1PpN1/pn3N1P/8/PPP1QPP1/2KR3R w - - 0 1 bm Rd8
do you have the exact sequence of UCI commands that worked before the commit, but failed after the commit? In principle the commit was non-functional, except if the sequence triggered some bug in the old implementation.
for now I can only reproduce it in the Fritz17 GUI
will futher investigate
* Have a look at the code comments in the `nnue/` subdirectory and fix small mistakes as appropriate (some leftover characters, or strange auto-translations)@vondele See PR #2883.
@stuwph Make sure you compare behaviour on 1 core. On many cores search is not deterministic and different results in different runs are normal.
branch still looking good for me:
Score of master vs nnue: 348 - 743 - 1409 [0.421] 2500
Elo difference: -55.4 +/- 8.9, LOS: 0.0 %, DrawRatio: 56.4 %
after some more research i can verify this happens in fritz 17 gui when compiling (msys2 10.2.0) the nnue master with additional nnue preset:
diff --git a/src/ucioption.cpp b/src/ucioption.cpp
->
since commit Use a global instead of a variable in pos
tested using this batch (but also with some other more optimized ones) :
@ECHO OFF
@cd src\
@del *.exe *.s *.o *.gcda *.gcno .depend
@set ts=%date:~-10,2%%date:~-7,2%%date:~-2,2%
make -f MakeFile profile-build ARCH=x86-64-bmi2 COMP=mingw
strip stockfish.exe
move Stockfish.exe ..\SF_NN_%ts%.exe
@del *.s *.o *.gcda *.gcno _stock+profiled.exe .depend
@PAUSE
still after latest [NNUE] Remove not used network architecture
maybe an issue of the gui or my setup, wonder if someone else is having this?
@stuwph Make sure you compare behaviour on 1 core. On many cores search is not deterministic and different results in different runs are normal.
happens on 1 core too
@stuwph I think I have an idea what could it be. Could you go to UCI options in Fritz, unmark and then mark again "Use NNUE" option? Perhaps you use standard Stockfish eval there as you changed the sources and never actually set NNUE in GUI.
already tried it, also deleting the net path and reinsert it
both in single and together
@stuwph I think I have an idea what could it be. Could you go to UCI options in Fritz, unmark and then mark again "Use NNUE" option? Perhaps you use standard Stockfish eval there as you changed the sources and never actually set NNUE in GUI.
pr #2890 does the job :)
@stuwph did you have a modification to the sources that did set the default value of 'Use NNUE' to 'true' ?
do you mean if i have changed something else in the source?
for testing i used the plain NNUE branch with only changing
o["Use NNUE"] << Option(true, on_use_nnue);
o["EvalFile"] << Option("eval/nn.nnue", on_eval_file);
which was equivalent to introducing a bug :-)
But of course, the underlying reason needs fixing.
so #2890 doesn't really fix that underlying reason?
yes, or at least one possible approach, i.e. make sure that the flag used in the code to signal the Use of NNUE is equivalent to the default value of the option.
I have a question on how GUIs behave.. would it make sense to output a
info string NNUE evaluation using nn-[foo].nnue
prior to any search, or are there GUIs where this leads to a popup window or some annoying thing?
We already have info string about tablebases found and large pages and no complaints that it has broken anything so I believe it's safe to go ahead. I think it's nice to have printed confirmation in GUI that NNUE is being used.
However, those two strings are output only once, the one I propose would be output for ever search.
I agree it would be nice to have a hint in the log which eval is used. This will be useful.
Let me add it to PR #2893, we can always drop it if not OK.
Does anybody know if the pext instruction on AMD bdver4 (gcc arch) is also slow (as it is on znver1 znver2 ), and thus one should avoid building the x86-64-bmi2 target on this architecture ?
I have never read about pext being slow on pre-Zen AMD cpus.
I have never read about pext being slow on pre-Zen AMD cpus.
~Because they don't support it.~ Hm, Excavator might, but it's slow on Ryzen obviously due to a patent issue. Why would they add it then remove?
I thought it was just a case of reducing transistor count by implementing pext in (extremely slow) microcode. AMD figured nobody was using pext/pdep anyway.
The branch https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip looks increasingly close to being final. This is a good point to test it, and report issues if they are found. Anybody wanting to have a look at the issues https://github.com/official-stockfish/Stockfish/issues?q=is%3Aissue+is%3Aopen+label%3ANNUE is more than welcome, I believe the Makefile one is probably the most pressing one.
Meanwhile, fishtest development work is progressing well.
Note that the fishtest server has now functionality to upload .nnue nets. Authors of nets can upload them (needs login to fishtest) already now. Feel free to upload nets.
Testing nets is not yet possible, this needs fishtest worker changes that are work in progress.
first test of the upgraded fishtest with the @gekkehenker net running here:
https://tests.stockfishchess.org/tests/view/5f2878a0a5abc164f05e4c1d
the testing infrastructure seems to work.
Many thanks to @tomtor @ppigazzini @noobpwnftw @daylen for getting fishtest to the new state.
I'll continue some work on the branch before doing the real performance tests.
Another problem:
If we do PGO compiles as it is now with fishtest, then the NNUE code path is not profiled because it does not have a net to load during bench, which results in less optimizations for the path when NNUE eval is active in the final compiles.
This probably affects every PGO compiles so far unless they have a default net to load and it is used during profiling stage. The performance difference can be measured and see if this matter is worth to be addressed.
@noobpwnftw I have this PR https://github.com/official-stockfish/Stockfish/pull/2879 to address profiling of both code paths in one PGO compile. However, you are right there has to be a net available to load.
yes, PGO builds / default bench is an open issue, but I might as well postpone that one. It does require to have a net available at all times, and maybe it is better to wait a little till we get a feel for that requirement. However, it is quite easy on most systems with the new Makefile target make net. It would require somebody to come up with something similar in Microsoft PowerShell to do this for our appveyor CI (any experts?)
I haven't seen measurements yet of the performance benefit of PGO for NNUE.
Meanwhile merged master and updated to a new default net by Sergio Vieri, with impressive testing results:
https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58
Stockfish 12 will surely become the Rybka 3 of Stockfish releases.
@vondele: BanksiaGUi displays info string in a special panel, not a popup as the bellow image, thus users won't be interrupted. If they don't open the panel, they don't see that info either.

However, BanksiaGUi keeps displaying all info string until the engine is reloaded. Thus it is not nice if some strings are displayed repeatedly (such as info about Syzygy loading) thus users may see them all the time.
Below is another confusion for users when SF NNUE load data twice when starting.

@nguyenpham that's presumably not with the latest branch (at least for the found & loaded, which is not on the branch anymore).
Is there currently a plan for how testing will work for patches that may have different impacts on different nets? Is the current "best" net going to be used for testing new patches to search?
Is the current "best" net going to be used for testing new patches to search?
IMO, yes.
I have updated the code and done a quick check, there is no redundant info string when starting. Well done!

A few thoughts about EvalFile: setting up EvalFile seems to be one of the new steps which may confuse users since they have to set it correctly before they can play: aware to set up, know where to find eval files, try-error with short or full path forms...
We can help users by doing some auto works to setup EvalFile. To do that we just scan all data files (.nnue) in the SF folder as well as subfolders to get the list of all data files. Now there are two ways to show the list:
EvalFile, type as a string, the default value is
EvalFile, type as a combo: sort the list by descent time and put their file names only (for being short) as the combo choices. The first item of the combo as well as the default value are
Your thoughts?
How about this (which is similar to how Lc0 solves this problem, but adapted to the possibility of not needing a net):
Use NNUE defaults to falseEvalFile defaults to <empty>Use NNUE defaults to trueEvalFile defaults to the most recent .nnue file present.This would have to be done before Stockfish responds to the initial uci command, so that the defaults sent to the GUI (via option UCI lines) are correct.
(I don't speak C++ so I can't offer to actually write this, sorry...)
Having said all that, the fact that 2 options are needed is potentially confusing, and will probably lead to some users using classical eval by accident. How about just removing the Use NNUE option and:
EvalFile defaults to <empty>EvalFile defaults to the most recent .nnue file present.(Again, doing all this before replying to the initial uci command.)
The GUI can still override the choice with an explicit setoption command, of course. When the time comes to actually evaluate positions, use classical eval only if EvalFile is <empty>.
I feel it is better to clearly indicate which net is default/recommended with the given binary, and would encourage against running the binary with a non-default net, except for experts/enthusiasts. I'm hopeful that we will have process that basically makes sure that the best net available will be the default.
Adding too much logic to the code is not good IMO, and if so best done by the GUI. If the GUI makes sure the engines working directory contains the default net, there user just needs to click the 'Use NNUE' option. The way we have a unique name for the net (with the sha) makes finding this net relatively easy for the GUI.
It is clear that .nnue files will not remain of the same format. They will evolve in features, content, and what exactly is part of the evaluation. Unless the user is an expert, picking non-default nets will lead to surprising behavior.
I haven't seen measurements yet of the performance benefit of PGO for NNUE.
On my build (x86-64-avx2), the effect is actually quite significant, around 10% speedup. Anyway... not urgent IMO.
I have decided to postpone those items that require running with a net during build / CI to after a merge, it is better to get some feeling for the download/build process first. The step is likely not very difficult, but if done incorrectly would be annoying.
Similarly we don't really need to worry on older hardware on fishtest, performance is good even if included.
The mostly leaves the Makefile issue for fixing.
Makefile issues fixed. SPRT NNUE vs NNUE testing OK.
Small note for classical vs NNUE tests (e.g. regression tests, FYI @Alayan-stk-2), since the precise result of such a match depends quite strongly on the hardware used, we might expect some variability in future RT (SF11 vs SFdev). It is not worth trying to fix this with time odds. It does however mean that also our statistical tests to decide if a worker is bad will fail for RT, so it is important to switch 'auto-purge' to off for those tests. (see also https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c)
@vondele I see you changed the default net in the code for this test
https://tests.stockfishchess.org/tests/view/5f2a4d35a5abc164f05e4cf6
Are the options on the test page not working properly if you use
EvalFile=nn-c157e0a5755b.nnue
?
@mstembera indeed, on fishtest using EvalFile as an option on test submission is not supported, at least not right now. It would need some code to download this net, and translate it to the proper cutechess option (add path). Not impossible, but not there. There is some WIP documentation on testing nets:
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests
Some bench results for armv8 (Raspberry Pi 4, 64bit), armv7 (Raspberry Pi 3, 32bit) and x86, node count consistency check with apple silicon:
target | nodes | nodes NNUE | nps | nps NNUE
-- | -- | -- | -- | --
X86-64-bmi2 | 4746616 | 4254913 | 2139078 | 1444301
apple-silicon | 4746616 | 4254913 | |
armv8 | 4746616 | 4254913 | 648444 | 64236
armv8-neon | 4746616 | 4254913 | 648444 | 183916
armv7 | 4746616 | 4254913 | 174456 | 16878
edit: armv7 added.
@domschl see https://github.com/official-stockfish/Stockfish/pull/2899#issuecomment-669331705
The code has been merged, I'll close this issue. Thanks for the contributions!
Most helpful comment
first test of the upgraded fishtest with the @gekkehenker net running here:
https://tests.stockfishchess.org/tests/view/5f2878a0a5abc164f05e4c1d
the testing infrastructure seems to work.
Many thanks to @tomtor @ppigazzini @noobpwnftw @daylen for getting fishtest to the new state.
I'll continue some work on the branch before doing the real performance tests.