Stockfish 🚀 - [NNUE] Android building issue

Hi again! As i mentioned above i made android engine by changing the line to "std_aligned_alloc".
Engine working, but only without the network, if i mark the network option engine crashes instantly. Any ideas how to fix this?
Screenshot_20200729-231443

Btw, why the name of the network must be "nn-c157e0a5755b.nnue"? It's not much easier to call it "nn.nnue"?

AlexB123 on 30 Jul 2020

You can create a PR here https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip adding an ARM section to std_aligned_alloc() and std_aligned_free() in misc.cpp. The net is named with the first 12 characters of the SHA256 hash. This is so that when the default nets change we can uniquely tell them apart. You can use any name you like for your custom net and specify it as a UCI option.

mstembera on 30 Jul 2020

🚀1 🎉1 👍1

I've tried this change but it didn't work..
Sf error1
Sf error2

AlexB123 on 30 Jul 2020

You can create a PR here https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip adding an ARM section to std_aligned_alloc() and std_aligned_free() in misc.cpp. The net is named with the first 12 characters of the SHA256 hash. This is so that when the default nets change we can uniquely tell them apart. You can use any name you like for your custom net and specify it as a UCI option.

Hello, thank you for the feedback! I don't know how to create a PR.
I'm not a programmer, so can you bring an example regarding the flags "std_aligned_alloc() and std_aligned_free()" in misc.cpp, how to add them (i mean in what order)? I've tried like this, it didn't work, apparently i did something wrong.
SF1
SF2
Thank you!

AlexB123 on 30 Jul 2020

Sorry, forgot to mention. I made SF NNUE from nodechip source code https://github.com/nodchip/Stockfish Engine working fine and it is reading nn.bin, although the speed of the engine is too slow comparing to normal SF, plus without the nn.bin, engine is horrible, look at analysis.
813388b5cee6484969028ffeadf81efd

AlexB123 on 30 Jul 2020

Yes - his fork was designed for NNUE only - will not play well without bin.

MichaelB7 on 30 Jul 2020

👍1

Yes - his fork was designed for NNUE only - will not play well without bin.

Tnx, i didn't know that. I thought that without nn.bin it suppose to play like normal SF, anyway now i know. :-)

AlexB123 on 30 Jul 2020

Yes - he made it two different executables, the Stockfish team is making it a UCI option , one exe that can play both.

MichaelB7 on 31 Jul 2020

@AlexB123 I created the PR for you here https://github.com/official-stockfish/Stockfish/pull/2872 Thanks.

Edit: Actually I may have done it wrong. Is this an ARM thing or an Adnroid thing? We can use either
defined(IS_ARM)
or
defined(__ANDROID__)

mstembera on 31 Jul 2020

Hello guys, just wanted to let you know that android version still crashing. :(
Flags that I use to build the engine.
set "compiler_options=-m64 -march=armv8-a -DIS_64BIT -fPIE -Wl,-pie -lm -DUSE_POPCNT -DNO_PREFETCH -DUSE_NEON -O3 -flto -static-libstdc++ -std=c++17 -fno-strict-aliasing -fno-strict-overflow -ffunction-sections -fdata-sections -Wl,--gc-sections -Wl,-s"

Btw, from old nodchip's source code https://github.com/nodchip/Stockfish , engine is generated and working with follow flags ->
set "compiler_options=-m64 -march=armv8-a -DIS_64BIT -fPIE -Wl,-pie -lm -DUSE_POPCNT -DEVAL_NNUE -DENABLE_TEST_CMD -fopenmp -O3 -flto -static-libstdc++ -std=c++17 -fno-strict-aliasing -fno-strict-overflow -ffunction-sections -fdata-sections -Wl,--gc-sections -Wl,-s"
Maybe this can help you somehow to solve the issue.

Thank you!

AlexB123 on 6 Aug 2020

So, it compiles but crashes at runtime?

Edit: at which point does the crash happen, i.e. do you have any output, and which UCI commands do you send?

vondele on 6 Aug 2020

So, it compiles but crashes at runtime?

Edit: at which point does the crash happen, i.e. do you have any output, and which UCI commands do you send?

Hi vondele! The engine compiles, with small correction in misc.cpp line 329, by changing to "return std_aligned_alloc(alignment, size);", or using this changes https://github.com/official-stockfish/Stockfish/pull/2872/commits/af6473aa5bc5f7adbc912658aa8c3671ce9ad967
Engine working fine in Droidfish, as normal Stockfish. But, when i mark the "Use NNUE" option in engine's settings, it's crashes instantly, with the message "engine terminated".
Screenshot_20200807-170521
Screenshot_20200807-170408

I have a feeling that some flag is missing in Makefile, that is responsible for applying NNUE in the engine. Since i don't know which is that flag, i don't know what to write in the batch file, so compiler generates normal engine, not able to use NNUE, or, NDK's Clang is unable to cooperate with NNUE.
I don't know how else to explain these crashes.

AlexB123 on 7 Aug 2020

@AlexB123 that change you make (i.e. calling std_aligned_alloc) is not OK. It will compile but crash. Can you try instead of your change the change proposed https://github.com/official-stockfish/Stockfish/pull/2927
i.e. https://github.com/official-stockfish/Stockfish/pull/2927/files

vondele on 7 Aug 2020

Commands execution in SManager for Android.
bench
Screenshot_20200807-180309

uci
Screenshot_20200807-180404

setoption name Use NNUE value true
Screenshot_20200807-180647

AlexB123 on 7 Aug 2020

@AlexB123 that change you make (i.e. calling std_aligned_alloc) is not OK. It will compile but crash. Can you try instead of your change the change proposed #2927
i.e. https://github.com/official-stockfish/Stockfish/pull/2927/files

Ok, i'll try it later, i have to go now. :)

AlexB123 on 7 Aug 2020

@AlexB123 that change you make (i.e. calling std_aligned_alloc) is not OK. It will compile but crash. Can you try instead of your change the change proposed #2927
i.e. https://github.com/official-stockfish/Stockfish/pull/2927/files

Hello! Having tried new flags, compiler gives a new error.
SF 1
SF 2

AlexB123 on 8 Aug 2020

can you try to #include <stdlib.h> in the file?

vondele on 8 Aug 2020

can you try to #include <stdlib.h> in the file?

Not sure if i did it correctly -> misc.cpp, +line 52 "#include ", didn't work, same error.
Also -> misc.cpp, +line 56 "#include ", didn't work either, same error.

AlexB123 on 8 Aug 2020

So can you instead try to use this:

void* std_aligned_alloc(size_t alignment, size_t size) {
    // alignment must be >= sizeof(void*)
    if(alignment < sizeof(void*))
    {
        alignment = sizeof(void*);
    }
    void *pointer;
    if(posix_memalign(&pointer, alignment, size) == 0)
        return pointer;
    return nullptr;
}

leave #include <stdlib.h> in the file near line 56.

vondele on 8 Aug 2020

❤1

void* std_aligned_alloc(size_t alignment, size_t size) {
// alignment must be >= sizeof(void)
if(alignment < sizeof(void))
{
alignment = sizeof(void*);
}
void *pointer;
if(posix_memalign(&pointer, alignment, size) == 0)
return pointer;
return nullptr;

With this changes engine compiles, without errors or warnings, but again, it is crashes when i mark the "Use NNUE" box. It's working as normal engine only.
Flags

AlexB123 on 8 Aug 2020

That code looks right, so, probably we're having a different reason for a crash. (unless the code returns a nullptr). I assume you have the right 'ARCH=...' option for the make command ?

To move on we need to be able to understand where it crashes. Usually that would mean to compile (after make clean) with debug=yes optimize=no flags to make, and afterwards run it under gdb like

gdb ./stockfish
run
setoption name Use NNUE value true
bench
[crash]
bt

vondele on 8 Aug 2020

That code looks right, so, probably we're having a different reason for a crash. (unless the code returns a nullptr). I assume you have the right 'ARCH=...' option for the make command ?

To move on we need to be able to understand where it crashes. Usually that would mean to compile (after make clean) with debug=yes optimize=no flags to make, and afterwards run it under gdb like
gdb ./stockfish
run
setoption name Use NNUE value true
bench
[crash]
bt
I use flag -march=armv8-a in my batch file, for amr8 64 bit engines. Since the engine is working, but only as normal SF, the flag / ARCH is correct. I'll try to make engine without -flto and -DUSE_POPCNT, and let you know later if something changes.
Regards.
Alex.

AlexB123 on 8 Aug 2020

Well team, i give up. I used last source code, the first issue with compiling still remain.
SF NNUE

By using all the mentioned (above) changes in misc.cpp, engine compiles but not 100% functional.
it can execute commands like "uci" and "bench", but it fails to execute "setoption name Use NNUE value true", simply put, it working only without "Use NNUE" option. I've tried several flags "-DNDEBUG", "-DUSE_NEON", "-O3", and without all this flags, nothing works.
There must be a flag(s) in the Makefile or misc.cpp which is responsible for applying of NNUE functions on the engine, but i don't know which flag is that. Maybe Peter Österlund can help?
http://talkchess.com/forum3/viewtopic.php?p=853010#p853010

AlexB123 on 9 Aug 2020

Thank you, vondele. Your patch in this thread (as it appears in AlexB123's screenshot) allowed the compile to finish. I think the binary is actually working too.

I can build for both aarch64 and armv7, but I can only test armv7 binaries right now.

@AlexB123 DroidFish doesn't seem to like the Use NNUE checkbox option. It crashes and/or wouldn't start. My binary appears to be working alright in Chess for Android, and also in a terminal emulator app. Maybe you might want to try your build in those apps instead, although it sounds like yours was crashing in the terminal emulator too?

You may have already noticed this: the current official branch wants the .nnue file to be in the same folder as the engine, not in a sub-folder any more. Chess for Android requires the .nnue file to be installed the same way as an engine, so I assume they're being put in the same dir.

I'll upload my aarch64 build, in case you want to test it. Let me know how it works. The only change is vondele's patch applied to misc.cpp, and I used my usual build flags (somewhat different from what you've posted above).

sf-armv8.zip

It is based on this commit iirc : https://github.com/official-stockfish/Stockfish/commit/ad2ad4c65706c18a5383506d361f1f23fc6a26ab

notruck on 10 Aug 2020

In the terminal emulator, I first ran a bench and the speed was on par with what I usually get for regular non-NNUE Stockfish.

Then, without bringing the .nnue file into the terminal emulator yet, I did a setoption name Use NNUE value true followed by another bench ... This time, I get a warning text Use of NNUE evaluation, but the file ____.nnue was not loaded successfully. and so on. The benchmark didn't run.

After I have the correct .nnue file, I set Use NNUE again and this time the benchmark did run, and at a significantly lower speed than before. So I assumed my armv7 build was actually using NNUE.

notruck on 10 Aug 2020

@notruck so for you, on android, the current master (i.e. calling aligned_alloc) does not build?

However, if you use the code based on posix_memalign, it does work?

What are your usual build flags, i.e. is there anything we can do to making building on android easier?

vondele on 10 Aug 2020

for armv7, I used

CXXFLAGS += --target=armv7a-linux-androideabi16 -fno-addrsig -stdlib=libc++ -O3 -Ofast -mfpu=neon-vfpv4 -mthumb -march=armv7-a -mtune=cortex-a53 -mfloat-abi=softfp -Wall -Wcast-qual -fno-exceptions -std=c++17 $(EXTRACXXFLAGS)
DEPENDFLAGS += -std=c++17
LDFLAGS += -static-libstdc++ -latomic $(EXTRALDFLAGS) # -fuse-ld=lld

notruck on 10 Aug 2020

The current master is still failing to build, both for clang 9.0.8 included with Google's NDK r21, and clang 10.0.0 provided by Termux. It leads to the same exact problem AlexB described in his first post.

What worked for me was your patch above, exactly as it appears in AlexB123's screenshot, along with the #include <stdlib.h>

notruck on 10 Aug 2020

🚀1 🎉1 👍1

@notruck I'll try to make a PR that does include that code snippet, and would appreciate if you test it, once is there.

vondele on 10 Aug 2020

👍2

for armv7, I used

CXXFLAGS += --target=armv7a-linux-androideabi16 -fno-addrsig -stdlib=libc++ -O3 -Ofast -mfpu=neon-vfpv4 -mthumb -march=armv7-a -mtune=cortex-a53 -mfloat-abi=softfp -Wall -Wcast-qual -fno-exceptions -std=c++17 $(EXTRACXXFLAGS) DEPENDFLAGS += -std=c++17 LDFLAGS += -static-libstdc++ -latomic $(EXTRALDFLAGS) # -fuse-ld=lld

We need to keep armv7 flags for android and RPI separate...

these are RPI flags that will work with most RPI

"-mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits -mtune=cortex-a53" otherwise Pi uses the normal GCC 10 flags - 32 bitis still standard for the RPI- but users can now add a 64 bit kernel and use 64 bit exe's on a 32 bit RPI OS - best to leave 32 bit default.

MichaelB7 on 10 Aug 2020

@vondele Thanks! So the cross-compilation with NDK went smoothly for both armv7 and armv8-a. I haven't yet tested the binaries themselves however.

Currently I don't have the hardware to test armv8-a. Let me attach them here, in case someone wants to help with testing that:

(edit: Oops, I forgot to run make clean for the armv8a build, let me correct that)
Fixed: aligned_alloc_changes.zip

Also, the armv8 build explicitly uses -fPIE and -pie flags now.

notruck on 10 Aug 2020

🚀1 🎉1 👍1

@notruck thanks for compiling, please let me know if they test OK.

This PR #2973 would also need testing for OSX and old Linux. @TonHaver @ddugovic does this PR still work on your systems?

vondele on 10 Aug 2020

🚀1 👍1

Works on my old macbook using MacOS 10.14
Don't have anything newer

TonHaver on 10 Aug 2020

👍1

Nodes searched : 4094850 (without NNUE)

Nodes searched : 3314442 (with NNUE)

Is this discrepancy something to be expected? I used the default nn-112bb1c8cdb5.nnue net

notruck on 10 Aug 2020

🎉1 👍1

yes that's correct, completely different eval function. (Bench matches x86)

vondele on 10 Aug 2020

👍2 🎉1

@vondele and @notruck Hello guys!
So let me get it right. :) I have the last source code "Cleanup and optimize SSE/AVX code". To build the full functional engine, i have to do changes (picture below) plus changes mentioned here https://github.com/official-stockfish/Stockfish/pull/2973/commits/74bb29abd1967a9e47dd8913e58d9fbd6efc308d , right?
89720175

P.s. @notruck are you using this Termux (with Clang 10.0.0), and which flags you are using for armv8?
https://play.google.com/store/apps/details?id=com.termux

AlexB123 on 10 Aug 2020

@AlexB123 probably best to test the code I have as a pull request https://github.com/official-stockfish/Stockfish/pull/2973 I'll merge this in master on a next round.

vondele on 10 Aug 2020

🚀1 🎉1 👍1

@AlexB123 probably best to test the code

Sorry, bad news. Same issue (in different lines) appears using "original" source code.
error
With using changes pointed here https://github.com/official-stockfish/Stockfish/pull/2973/commits/74bb29abd1967a9e47dd8913e58d9fbd6efc308d , engine compiles, but still it is unable to use NNUE. Without NNUE working fine.

@notruck, i've tried your compilation armv8 aligned_alloc_changes.zip it's also crashes same as mine. :( I have installed Google NDK's toolchains from r21, r21b, r21d, none of them can compile the engine from source code.

AlexB123 on 10 Aug 2020

@AlexB123 probably best to test the code

Sorry, bad news. Same issue (in different lines) appears using "original" source code.

With using changes pointed here 74bb29a , engine compiles, but still it is unable to use NNUE. Without NNUE working fine.

@notruck, i've tried your compilation armv8 aligned_alloc_changes.zip it's also crashes same as mine. :( I have installed Google NDK's toolchains from r21, r21b, r21d, none of them can compile the engine from source code.

That is a bit strange. In Cfa engine is not crashing using NNUE in analysis mode and it showing a message "classical evaluation enabled", does it mean that it's using NNUE?
Screenshot_20200810-222241

AlexB123 on 10 Aug 2020

no classical evaluation is not NNUE

vondele on 10 Aug 2020

Ok, some of good news. Engine made with corrections mentioned on my above post, is working (so is notruck's aligned_alloc_changes.zip). The trick is, firstly i need to write the path of the network in the engine's settings, and only after mark the "Use NNUE" option. With this way engine is working, but in fact for some reason does not use the network. Еg it can't solve this positions at all

rn1qrnk1/p4pp1/1p1pp3/6P1/2Pp1PN1/2PQ4/P5P1/2KR3R w - - 0 1
4q1kr/p6p/1prQPppB/4n3/4P3/2P5/PP2B2P/R5K1 w - - 0 1

, while Peter's engine from talkchess (mentioned above) finding solutions in seconds.
So, something is still missing in the code. NNUE functions are not applied on armv engines.

AlexB123 on 10 Aug 2020

👍1

@AlexB123 Peter's engine is from late July, and based on an earlier version, before this commit happened about 2 weeks ago.

Is his engine working OK with the current .nnue nets? Could you please confirm where you are putting the .nnue files? They no longer go inside the eval folder. The engine now expects them to be in the same directory. If the issue isn't with the path and/or filenames, we can try following Peter's methods on the current code.

Looking at his post on TalkChess, it seems Peter also uses the NDK (r20b). He made only minimal changes to the official Makefile. He disables the lpthread, as supposed to be done for Android. He uses the -static LD flag. He doesn't include most other flags we use.

He made sure to include the header with a D IS_ARM compile flag. The current master already includes that header if you set neon = yes , or by settng a slightly different D USE_NEON flag. They both should do the same thing (as seen here ).

Next, I'll try to build the latest version by following his post. I'll remove all other/extra build flags. I'll use the latest Stockfish-master, and compile it with NDK r21d.

P.S. I normally use the NDK on Linux. I had the r21 (earliest r21 without any letters) before. When it failed, I downloaded r21d. They both have Clang 9.0.8, so I tried Termux next (same app you found/linked above). Termux provides a Clang 10.0.0 but that didn't work either on (untouched) Stockfish-master. I didn't try the patches with Termux, I returned to NDK for those. Termux is a good terminal emulator, but I didn't get far enough with getting it to compile successfully.

notruck on 11 Aug 2020

👍1

Following Peter's changes to the Makefile wouldn't build the current-master. This is probably to be expected.

His Makefile builds this without any problems:
https://github.com/vondele/Stockfish/tree/74bb29abd1967a9e47dd8913e58d9fbd6efc308d

The -static binary is larger than my previous one, and may or may not work better at loading the .nnue. It expects the https://tests.stockfishchess.org/api/nn/nn-112bb1c8cdb5.nnue as the default net.

followed-petero-static-build.zip

notruck on 11 Aug 2020

🚀1 👍1

OK, so we'll go in steps. First thing, I'll make the commit of PR https://github.com/official-stockfish/Stockfish/pull/2973 so the source builds on Android without modifications of the src. Later, we should revist the Makefile, and it is great if we have people able to test it. So I'll leave this issue open after the commit.

vondele on 11 Aug 2020

👍3 🚀1 🎉1

With this way engine is working, but in fact for some reason does not use the network. Еg it can't solve this positions at all
rn1qrnk1/p4pp1/1p1pp3/6P1/2Pp1PN1/2PQ4/P5P1/2KR3R w - - 0 1
4q1kr/p6p/1prQPppB/4n3/4P3/2P5/PP2B2P/R5K1 w - - 0 1
, while Peter's engine from talkchess (mentioned above) finding solutions in seconds.
So, something is still missing in the code. NNUE functions are not applied on armv engines.

@AlexB123 Which network file do you use with Peter's engine? Using the current default nn-112bb1c8cdb5.nnue my devices are not finding the solutions. Not just my Android phone, but my x86-64 laptop also seems to be missing the solutions. I tried abrok binaries too, without any luck.

@vondele Sorry to bother you with this, could you please test those positions above and comment a little on them? Are they supposed to be reliable indicators whether the NNUE is loaded correctly?

I'd think running bench twice (before and after setoption name Use NNUE value true ) should sufficiently indicate if the NNUE network is in use, but AlexB's apparent success on armv8 with Peter's engine is making me wonder what I might be missing.

`

notruck on 11 Aug 2020

🚀1 👍1

With this way engine is working, but in fact for some reason does not use the network. Еg it can't solve this positions at all
rn1qrnk1/p4pp1/1p1pp3/6P1/2Pp1PN1/2PQ4/P5P1/2KR3R w - - 0 1
4q1kr/p6p/1prQPppB/4n3/4P3/2P5/PP2B2P/R5K1 w - - 0 1
, while Peter's engine from talkchess (mentioned above) finding solutions in seconds.
So, something is still missing in the code. NNUE functions are not applied on armv engines.
@AlexB123 Which network file do you use with Peter's engine? Using the current default nn-112bb1c8cdb5.nnue my devices are not finding the solutions. Not just my Android phone, but my x86-64 laptop also seems to be missing the solutions. I tried abrok binaries too, without any luck.

@vondele Sorry to bother you with this, could you please test those positions above and comment a little on them? Are they supposed to be reliable indicators whether the NNUE is loaded correctly?

I'd think running bench twice (before and after setoption name Use NNUE value true ) should sufficiently indicate if the NNUE network is in use, but AlexB's apparent success on armv8 with Peter's engine is making me wonder what I might be missing.

`

@notruck I use this net, don't remember the day of release, but it is Sergio's network from earlier releases (below). Peter's engine solving those two positions in seconds. To use the network with Peter's engine, you need to create a folder named "eval" and put the network inside. The usual path to Android's memory is /storage/emulated/0, so my path for the network is
/storage/emulated/0/eval/nn.bin write the same path in the engine's settings, done.
eval.zip

AlexB123 on 11 Aug 2020

@vondele and the rest team members :) , Congrats!! Using current master "Tweak castling extension", without any changes. Engine compiles without errors and it can use NNUE!! :D

Screenshot_20200811-185315

Although, it is not able to solve the mentioned two positions using the same network from my above post. I guess it's because of the patch mentioned by @Joachim26.

Try this arm8 (NDK r21), don't forget to write the correct path of the network in the engine's settings.
SF-NNUE-r21.zip

AlexB123 on 11 Aug 2020

👍1

Hi is this something new or it is already included in the current master?
https://github.com/lucabrivio/Stockfish/commit/3e9562123f62d406500c69bc5193f238ece350b9

AlexB123 on 11 Aug 2020

@notruck can i ask you something, for armv7 engines, did you use NDK r17? As far as I know, this is the latest NDK that supports armv7 architecture, and it is not cooperate with -DUSE_NEON.

Also, which flags you are using for "static" builds? I made armv8 static, it is bigger in size, and it's not working, apparently i messed up with flags.

And last question, do you use a batch file for compiling the engines, i mean first you generate the standalone toolchain from NDK, and then compile the engines using a batch file?

P.s. sorry for off-topic.

AlexB123 on 12 Aug 2020

@AlexB123 I use NDK r21 on Linux for both armv7 and armv8 builds. Their latest revision r21d should also work. I still haven't figured out how to cross-compile PGO using the NDK.

I target Android API Level 16 (JellyBean 4.1.x) for armv7, and Android API Level 21 (Lollipop 5.0) for maximum compatibility. API 16 because it's the earliest target still supported by r21, and API 21 because it's when 64-bit Android was first introduced.

My Makefile is a total mess, but their net result is to pass these CXXFLAGS for armv8 at build-time:

aarch64-linux-android21-clang++ --target=aarch64-linux-androideabi21 -stdlib=libc++ -O3 -Ofast -fPIE -march=armv8-a -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -DNDEBUG -O3 -DIS_64BIT -DUSE_POPCNT -DUSE_NEON -flto -c -o benchmark.o benchmark.cpp

(same flags for each .cpp file)

The LDFLAGS are:

aarch64-linux-android21-clang++ -o stockfish benchmark.o bitbase.o bitboard.o endgame.o evaluate.o main.o material.o misc.o movegen.o movepick.o pawns.o position.o psqt.o search.o thread.o timeman.o tt.o uci.o ucioption.o tune.o tbprobe.o evaluate_nnue.o half_kp.o -mfpu=neon-vfpv4 -mthumb -mfloat-abi=softfp -static-libstdc++ -latomic -fPIE -pie --target=aarch64-linux-androideabi21 -stdlib=libc++ -O3 -Ofast -fPIE -march=armv8-a -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -DNDEBUG -O3 -DIS_64BIT -DUSE_POPCNT -DUSE_NEON -flto

Some of them maybe redundant/extraneous but I hope they are sound at least. If these flags work out for you, please let me know.

notruck on 12 Aug 2020

@notruck
cross-compiling from linux to windows works by specifying a custom PGOBENCH which runs the binary under wine:

make -j ARCH=x86-64-avx2 COMP=mingw profile-build PGOBENCH="wine ./stockfish.exe bench"

maybe something similar is what you need to do a PGO cross to Android.

vondele on 12 Aug 2020

In another issue https://github.com/official-stockfish/Stockfish/issues/2979 some Makefile changes were suggested. I'm centralizing all open issues here, in the hope that eventually one can make good pull request to fix the Makefile for Android.

vondele on 12 Aug 2020

Another good suggestion is in https://github.com/official-stockfish/Stockfish/issues/1457#issue-302075109 i.e. if we get this work reasonably well, we should improve our CI so we can have automatic testing on Android in place.

vondele on 12 Aug 2020

@vondele Thanks, I'm aware of the Wine PGO method to cross-compile for Windows from a Linux host. Unfortunately, Android PGO may be more involved than that. To run the android binary on the host machine during profiling, I suspect you could need something like qemu, or some other android emulator, as well as an Android image?

The other route would be to use a terminal emulator app inside Android itself, and try to build PGO from within that environment.
This approach is less CI friendly, but probably good enough for those of us only looking for a usable engine.

Termux is a good candidate, they have Clang 10.0.0 available in their official repo, as an optional install. But that Clang doesn't come with the runtime profiler libclang_rt.profile-arm-android.a necessary for determining code coverage during the first run for profiling.

There is also a community repo with GCC, which looks promising for a PGO build. I have considered it, and looks like @MisesEnForce has actually used it.

Personally, I do appreciate the community repo maintaining/offering termux_gcc. That said, GCC could have potential issues down the road, as Google has already deprecated it in favor of Clang in their NDK. The Termux devs have also dropped GCC support from their official repo for this reason.

Because Google's NDK isn't enough for PGO, the Mozilla devs have resorted to making their own Android cross-toolchain to enable it. It doesn't seem to be easily available unless someone is willing to go through their CI environment and/or take up nearly the whole onboarding process to build Firefox. That option makes a lot of sense for a big project like Mozilla, but it is way beyond my ability or scope. More info here Even they have to be very careful with the toolchain maintenance when migrating from Clang 8 to Clang 9 and so on.

I think the patch you made to address the aligned_alloc issue is definitely helpful. In #2979 they needed to specify Android API Level 28 (Android 9). Many users are still on much earlier versions of Android.

notruck on 12 Aug 2020

@vondele Thx for pointing this thread to me.
@notruck You're right about gcc vs clang. I really insisted in using gcc as I am more used to it, but I'll give a shot to clang.
@AlexB123 I am quite new to android sdk/ndk but I would like to give it a try to cross compile stockfish as you did --> could you help ?

MisesEnForce on 12 Aug 2020

@AlexB123 I use NDK r21 on Linux for both armv7 and armv8 builds. Their latest revision r21d should also work. I still haven't figured out how to cross-compile PGO using the NDK.

I target Android API Level 16 (JellyBean 4.1.x) for armv7, and Android API Level 21 (Lollipop 5.0) for maximum compatibility. API 16 because it's the earliest target still supported by r21, and API 21 because it's when 64-bit Android was first introduced.

My Makefile is a total mess, but their net result is to pass these CXXFLAGS for armv8 at build-time:

aarch64-linux-android21-clang++ --target=aarch64-linux-androideabi21 -stdlib=libc++ -O3 -Ofast -fPIE -march=armv8-a -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -DNDEBUG -O3 -DIS_64BIT -DUSE_POPCNT -DUSE_NEON -flto -c -o benchmark.o benchmark.cpp

(same flags for each .cpp file)

The LDFLAGS are:

aarch64-linux-android21-clang++ -o stockfish benchmark.o bitbase.o bitboard.o endgame.o evaluate.o main.o material.o misc.o movegen.o movepick.o pawns.o position.o psqt.o search.o thread.o timeman.o tt.o uci.o ucioption.o tune.o tbprobe.o evaluate_nnue.o half_kp.o -mfpu=neon-vfpv4 -mthumb -mfloat-abi=softfp -static-libstdc++ -latomic -fPIE -pie --target=aarch64-linux-androideabi21 -stdlib=libc++ -O3 -Ofast -fPIE -march=armv8-a -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -DNDEBUG -O3 -DIS_64BIT -DUSE_POPCNT -DUSE_NEON -flto

Some of them maybe redundant/extraneous but I hope they are sound at least. If these flags work out for you, please let me know.

Ok, i see you have Linux, i have Windows 7 so things more complicated (for me), plus i know only how to make engines via toolchain. I also use API 16 for armv7 and API 24 for armv8, and same flags, except -mfpu=neon-vfpv4 -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions
I'll try to use them, they might come in handy.
@MisesEnForce Perhaps I can only help you if you have Windows.

AlexB123 on 13 Aug 2020

Ok, i see you have Linux, i have Windows 7 so things more complicated (for me), plus i know only how to make engines via toolchain. I also use API 16 for armv7 and API 24 for armv8, and same flags, except -mfpu=neon-vfpv4 -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions
I'll try to use them, they might come in handy.

Among those flags, -Wall -Wcast-qual -fno-exceptions come from the official Makefile.

-mfpu=neon-vfpv4 -mfloat-abi=softfp flags are necessary for armv7 VFP and the hybrid softtp float. Armv8 should have full NEON support, and should no longer need VFP. Most likely, the compiler will just ignore those flags even if included, but you might also want to leave them out. Same thing with -mthumb

I think it's a good idea to keep using -fno-addrsig for the time being, regardless of the version of NDK you have.

Clang passes -faddrsig by default, incompatible with GNU binutils. https://github.com/android/ndk/issues/884

Google is trying to migrate to LLVM eventually, but they are not fully there yet. For example, they retain GNU binutils as default even in r21.

For that, they wanted the users to pass -fno-addrsig to work around the issue. But then in r21, they actually WENT BACK and set that flag as the new default.

So -fno-addrsig is probably necessary in r19 and r20, and superfluous/harmless in r21.

notruck on 13 Aug 2020

Ok, i see you have Linux, i have Windows 7 so things more complicated (for me), plus i know only how to make engines via toolchain. I also use API 16 for armv7 and API 24 for armv8, and same flags, except -mfpu=neon-vfpv4 -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions
I'll try to use them, they might come in handy.

Among those flags, -Wall -Wcast-qual -fno-exceptions come from the official Makefile.

-mfpu=neon-vfpv4 -mfloat-abi=softfp flags are necessary for armv7 VFP and the hybrid softtp float. Armv8 should have full NEON support, and should no longer need VFP. Most likely, the compiler will just ignore those flags even if included, but you might also want to leave them out. Same thing with -mthumb

I think it's a good idea to keep using -fno-addrsig for the time being, regardless of the version of NDK you have.

Clang passes -faddrsig by default, incompatible with GNU binutils. android/ndk#884

Google is trying to migrate to LLVM eventually, but they are not fully there yet. For example, they retain GNU binutils as default even in r21.

For that, they wanted the users to pass -fno-addrsig to work around the issue. But then in r21, they actually WENT BACK and set that flag as the new default.

So -fno-addrsig is probably necessary in r19 and r20, and superfluous/harmless in r21.

Ok, thanks! Regarding "static" builds, i assume you use all the mentioned flags, plus "-static" instead of "-static-libstdc++", right?

AlexB123 on 13 Aug 2020

For Android versions earlier than Lollipop 5.0, there is no libc++ in the system, and it must be included as a static-link. For other libraries, hopefully the end users will already have those on their systems.

Another difference I noticed was the -lm flag you're using. I think it applies to Cfish but not so much to Stockfish.

On the other hand, I still have the -latomic in there, which may no longer be necessary. It still compiled without either, but I haven't tried to measure their effect (if any) on the speed of the executable.

Removing those extra libraries might help reduce the binary size when static linking everything.

notruck on 13 Aug 2020

@AlexB123 I use NDK r21 on Linux for both armv7 and armv8 builds. Their latest revision r21d should also work. I still haven't figured out how to cross-compile PGO using the NDK.
I target Android API Level 16 (JellyBean 4.1.x) for armv7, and Android API Level 21 (Lollipop 5.0) for maximum compatibility. API 16 because it's the earliest target still supported by r21, and API 21 because it's when 64-bit Android was first introduced.
My Makefile is a total mess, but their net result is to pass these CXXFLAGS for armv8 at build-time:
aarch64-linux-android21-clang++ --target=aarch64-linux-androideabi21 -stdlib=libc++ -O3 -Ofast -fPIE -march=armv8-a -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -DNDEBUG -O3 -DIS_64BIT -DUSE_POPCNT -DUSE_NEON -flto -c -o benchmark.o benchmark.cpp
(same flags for each .cpp file)
The LDFLAGS are:
aarch64-linux-android21-clang++ -o stockfish benchmark.o bitbase.o bitboard.o endgame.o evaluate.o main.o material.o misc.o movegen.o movepick.o pawns.o position.o psqt.o search.o thread.o timeman.o tt.o uci.o ucioption.o tune.o tbprobe.o evaluate_nnue.o half_kp.o -mfpu=neon-vfpv4 -mthumb -mfloat-abi=softfp -static-libstdc++ -latomic -fPIE -pie --target=aarch64-linux-androideabi21 -stdlib=libc++ -O3 -Ofast -fPIE -march=armv8-a -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -DNDEBUG -O3 -DIS_64BIT -DUSE_POPCNT -DUSE_NEON -flto
Some of them maybe redundant/extraneous but I hope they are sound at least. If these flags work out for you, please let me know.

Ok, i see you have Linux, i have Windows 7 so things more complicated (for me), plus i know only how to make engines via toolchain. I also use API 16 for armv7 and API 24 for armv8, and same flags, except -mfpu=neon-vfpv4 -fno-addrsig -stdlib=libc++ -Wall -Wcast-qual -fno-exceptions
I'll try to use them, they might come in handy.
@MisesEnForce Perhaps I can only help you if you have Windows.

@AlexB123 I have windows (the linux I have is WSL :))

MisesEnForce on 13 Aug 2020

@notruck for compilation of raspberry pi based on armv6/armv7 the following PR would be needed https://github.com/official-stockfish/Stockfish/pull/3006 does that conflict with anything discussed here?

Is the plan to turn the discussion in this issue in a PR for the Makefile that would allow compilation on Android? This would be great.

vondele on 15 Aug 2020

@vondele Regarding #3006, omitting the -msse from armv7 is a good call.

-latomic is most likely an unused flag on NDK r21 Clang 9.0.8. The NDK compiles (both armv7 and armv8) with or without -latomic and the checksums are identical for each pair of the resulting (stripped) binaries. But in the case of the raspberry pi, Dantist noted in that thread a possible performance drop when -latomic was used. So from my perspective, I have no problems with #3006 if the other raspberry pi users don't object to it.

On the other hand, I've found that the -lm flag leads to a different binary with a different checksum.

̶Edit: it seems -lm only produces a different binary for armv7. It might be another unused flag for armv8. Still, it would be nice if you can please confirm that, though. @̶A̶l̶e̶x̶B̶1̶2̶3̶ ̶C̶o̶u̶l̶d̶ ̶y̶o̶u̶ ̶p̶l̶e̶a̶s̶e̶ ̶b̶u̶i̶l̶d̶ ̶a̶n̶d̶ ̶t̶e̶s̶t̶ ̶i̶t̶s̶ ̶e̶f̶f̶e̶c̶t̶s̶ ̶o̶n̶ ̶s̶p̶e̶e̶d̶ ̶(̶i̶f̶ ̶a̶n̶y̶)̶ ̶f̶o̶r̶ ̶a̶r̶m̶v̶8̶?̶ Thanks!

Because of things like that, we may need to test/justify the Android flags some more, before making a PR. With personal builds, I'm the only one who "suffers" if they're unsound or sub-optimal, but the Makefile can potentially affect many other people.

As MichaelB pointed out above, we should probably keep Android flags separate from other armv7 targets such as the raspberry pi. For that, we may need new target(s) like android-armv7 or ndk-armv8. Would that be alright?

notruck on 15 Aug 2020

I can't test on any of the arm targets, so I'm guessing, but it could well be that -lm is added implicitly by gcc at link time for most architectures, so it makes no difference to add it manually. We do need the functionality provided by libm (some exp in the code), so it is there one way or another. That could probably be seen when adding '-v' to the compile and link commands. I can confirm '-lm' is there on linux. Possibly the 'soft-float' or embedded targets need it explicitly?

Testing a little more the flags is a good thing, but I'm eager to make progress with it, the state we have it right now is bad for most users, and now we have a few people looking at this that seem to have some understanding of this, or can test..

concerning the flags android-armv7 or ndk-arv8 we can do checks in the Makefile based on ifeq ($(OS),Android) but if that would not be enough, we could indeed consider targets like android-armv7. In some sense the apple-silicon target is already similar.

vondele on 15 Aug 2020

̶Edit: it seems -lm only produces a different binary for armv7. It might be another unused flag for armv8. Still, it would be nice if you can please confirm that, though. @̶A̶l̶e̶x̶B̶1̶2̶3̶ ̶C̶o̶u̶l̶d̶ ̶y̶o̶u̶ ̶p̶l̶e̶a̶s̶e̶ ̶b̶u̶i̶l̶d̶ ̶a̶n̶d̶ ̶t̶e̶s̶t̶ ̶i̶t̶s̶ ̶e̶f̶f̶e̶c̶t̶s̶ ̶o̶n̶ ̶s̶p̶e̶e̶d̶ ̶(̶i̶f̶ ̶a̶n̶y̶)̶ ̶f̶o̶r̶ ̶a̶r̶m̶v̶8̶?̶ Thanks!

Hello guys! @notruck i'm a bit busy this days, here is arm8 and arm8-no-lm -> (made without -lm flag). Source code from https://github.com/official-stockfish/Stockfish/commit/6eb186c97e9d808970d0b1369bcd7aca60612e26 My flags ->
set "compiler_options=-m64 -march=armv8-a -DIS_64BIT -fPIE -Wl,-pie -DNDEBUG -DUSE_NEON -DUSE_POPCNT -Ofast -flto -static-libstdc++ -std=c++17 -fno-strict-aliasing -fno-strict-overflow -ffunction-sections -fdata-sections -Wl,--gc-sections -Wl,-s
SF-NNUE-140820-arm8.zip
SF-NNUE-140820-arm8-no-lm.zip

AlexB123 on 15 Aug 2020

Both binaries have identical speed, work fine, and I think, they are byte-identical. However, one was built 20 h earlier than the other, thus some bytes are different.

Joachim26 on 15 Aug 2020

Thanks Alex and Joachim!

Testings for armv7 are also underway. I'll work on opening a PR soon.

notruck on 16 Aug 2020

@notruck great. Should I wait to merge #3006 or it doesn't matter for you.
Edit: alternatively, just include it in your PR.

vondele on 16 Aug 2020

Sure, I'll include #3006 as well. I think that one will be helpful for those wishing to compile armv7 with clang from inside termux.

My modifications of the Makefile will only focus on cross-compiling with the NDK. I took some liberties to "support a new compiler" ndk.

They may look ugly, but they should work, and perhaps more importantly they display all the necessary flags together in one section for possible refactoring later. It will now be very easy to compile Stockfish for Android from the command line. Simply issuing make build ARCH=armv7 COMP=ndk or make build ARCH=armv8 COMP=ndk will do the trick.

Besides, it shouldn't conflict with any other target, or the existing workflow of the people already compiling Stockfish for Android in their own preferred way.

notruck on 16 Aug 2020

I think that make build ARCH=armv8 COMP=ndk is a good choice, but we can look in detail when the PR is there.

vondele on 16 Aug 2020

PR created, forgot to mention this issue.

I also included the easiest way I know to use the NDK in command line. You won't need Android Studio nor the SDK.

notruck on 16 Aug 2020

yes, looks good. I left a few questions/remarks there.

vondele on 16 Aug 2020

@notruck The armv7 binary "sf-vondele-armv7" posted in "aligned_alloc_changes.zip" on 08/10 is on an armv8 device nearly as fast as "sf-vondele-armv8a" (both with NNUE on).
Short question: I think this means that NEON is used in the 64- as well as the 32-bit engine? Is this correct?

Joachim26 on 16 Aug 2020

@notruck The armv7 binary "sf-vondele-armv7" posted in "aligned_alloc_changes.zip" on 08/10 is on an armv8 device nearly as fast as "sf-vondele-armv8a" (both with NNUE on).
Short question: I think this means that NEON is used in the 64- as well as the 32-bit engine? Is this correct?

Thanks for letting me know!

Yeah, NEON is explicitly enabled for both architectures. I'm not sure if armv7 processors are fully able to benefit from it, but in my tests there wasn't any adverse effect, and the code sped up by at least 30% even on armv7. So I opted to use it.

Your armv8 device has full hardware NEON support, which might be why it ran the 32-bit binary almost as fast as armv8!

notruck on 16 Aug 2020

I think not all armv7 chips support NEON, but all armv7 used together with Android do. So, we can enable NEON for android on armv7, but not for armv7 in general (raspberry pi 1 for example). At least that's my understanding.

vondele on 16 Aug 2020

Thank you both for your fast answers!
"Not all ARMv7-based Android devices support Neon, but devices that do may benefit significantly from its support for scalar/vector instructions."
it's from: https://developer.android.com/ndk/guides/cpu-arm-neon
I will repeat the tests, but I'am quite sure, that also on an old armv7 phone sf-vondele-armv7 was relatively fast.

Joachim26 on 16 Aug 2020

In the meantime I have made some more speedtests on an old Android 4 device with binaries I got from @AlexB123 , since I think the situation on armv8 is quite clear: Neon is hardware-supported and also armv7 binaries can call these neon instructions (with 128 data bit?). Therefore, the speed difference between a v7 and a v8 SF-NNUE-binary is quite small. I measured 83% with NNUE on (and 69% with NNUE off).

On the Android 4 device it's a bit more complicated. I measured "more or less" 3 speed "levels" with NNUE on (always in the starting position, 1 core):
10 knps (no Neon support), 20 knps (Neon emulated with flag -mfpu=neon-vfpv4 (?)), 40 knps (hardware-supported Neon).
The last value is surprisingly close to the 50 to 60 knps on the 64 bit cpu with the v7 SFNNUE-binary! Both mentioned devices have comparable clock rates of 1.3 GHz (v7) and 1.5 Ghz (v8).

Joachim26 on 17 Aug 2020

So, before this comment, I tended to think we should go with 3 arch values armv8, armv7-neon and armv7 so we could enable the corresponding flags (i.e. armv8 and armv7-neon enable both neon and popcnt). That would give the users choice, and what to distribute would be up to the person distributing.

However, if we can always (linux and android) enable -mfpu=neon-vfpv4 it seems like we could unify things more?

vondele on 17 Aug 2020

Was -mfpu=neon-vfpv4 enabled building sf-vondele-armv7?
For Alex's comparable fast binary I also don't know!
I only know, that it was enabled for his medium fast binary.

Joachim26 on 17 Aug 2020

In the past I have used -mfpu=neon-vfpv4 on everything.

But in my tests the other day, before opening the pending PR, I found that -mfpu-neon-vfpv4 didn't seem to do anything (flag ignored?). The NDK documents mention ABI support for vfpv3-d16. Accordingly, I didn't include that -mfpu=neon-vfpv4 flag in my PR.

Clang doesn't like the vfpv3-d16 as described in the GCC documents. The flag that finally produced a different binary (different checksum anyway, still similar speeds) was -march=armv7-a+fp so you might want to test with that flag instead @Joachim26

The NDK document you linked above about NEON also mentions that starting with NDK r21 they're now enabling NEON by default for all API levels? Confusingly, we're still supposed to use a flag for that(?!) Looking at their CMake example, that flag is -mfpu=neon so we should test that as well.

notruck on 17 Aug 2020

https://www.dropbox.com/s/byd00aklic2pftn/sfndk.zip?dl=0

vondele on 17 Aug 2020

@Joachim26 the above are three binaries... can you test them.

That would be the status of the branch: https://github.com/official-stockfish/Stockfish/compare/master...vondele:notruck-master

vondele on 17 Aug 2020

v8 phone:
vondele_v8 : 60 knps
sfndk.armv7 : 26.5 knps
sfndk.armv7-neon: 52 knps
sfndk.armv8-neon: 61 knps

error +/- 1 knps => 1) and 4) same speed

v7 phone:
vondele_v7 : 35 knps
sfndk.armv7 : 15.5 knps
sfndk.armv7-neon: 35 knps

error +/- 1 knps => 1) and 3) same speed

Startposition, 1 core, measured after several seconds, hash cleared before measurement, GUI=Droidfish
If something is not clear, let me know.

Joachim26 on 17 Aug 2020

cool, so I can compile for android and it works... that's foolproof. I'll just have to learn how to copy the binaries to my phone ;-).

@notruck do you like the state of this branch https://github.com/vondele/Stockfish/tree/notruck-master (see also https://github.com/official-stockfish/Stockfish/compare/master...vondele:notruck-master). I might just want to adjust the strip target to pick the right binary.

Now we need to verify that this still works with linux on arm. @Dantist could you test that this branch compiles on RP ?

vondele on 17 Aug 2020

Hello, just want to let you know guys. As @Joachim26 mentioned already, i've made 6 versions of armv7, using NDK r21 and r21d. The fastest armv7 was compiled with r21, including changes in Makefile from here https://github.com/official-stockfish/Stockfish/pull/3006/commits/a251ef54debbb42b9cbc2277285eeb6d2efc0938 , and with follow flags
set "compiler_options=-m32 -march=armv7-a -fPIE -Wl,-pie -mfloat-abi=softfp -mfpu=vfpv3-d16 -mfpu=neon-vfpv4 -DUSE_NEON -mthumb -Wl,--fix-cortex-a8 -latomic -DNDEBUG -DUSE_POPCNT -Ofast -flto -static-libstdc++ -std=c++17 -fno-strict-aliasing -fno-strict-overflow -ffunction-sections -fdata-sections -Wl,--gc-sections -Wl,-s
All 6 armv7 here (in case if you want to try).
ARM7 speed test.zip

AlexB123 on 17 Aug 2020

For the RPI 4 64 bit OS ( still in beta) - therese are the proper flags

    ifeq ($(ARCH),armv7)
        CXXFLAGS += -mcpu=cortex-a72 -march=armv8-a+crypto+simd -mtune=cortex-a72

for the RPI 4 or lower in a 32 bit OS
these are the proper flags

    ifeq ($(ARCH),armv7)
            CXXFLAGS += -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits -mtune=cortex-a53

64 bit OS will produce exe's that output ( in classical eval mode) 750K nps at standard @ arm_freq of 1500, up tp nearly 900k/nps at 2100 mhz. 32 bit OS is about 30 to 40 % slower . NNUE mode is about 38% of Classical mode.

MichaelB7 on 17 Aug 2020

@MichaelB7 which of those flags are non-essential, i.e. can be left out, and still results in a reasonable executable. I assume in the 64 bits case all of the flags can be left out, but possibly not in the 32 bit case?

Edit: Also, isn't RPI4 an armv8, why modify the flags under armv7?

The challenge with the arm target, at least for me, is the diversity, and I'd like to have a minimal working input first. For example, will the flags you post work with the RPI 1?

vondele on 17 Aug 2020

The active community base of Pi users which I am involved with, are those that are using engines with PicoChess. PicoChess runs on Pi 3 and higher. The Picochess community is a community that are using DGT-PI or a modified DGT clock 3000 or something similar that may be handcrafted or modified that enable one to use a wooden chessboard to make their moves, primarily a on DGT board, to play against various chess engines. There is no active group of chess users using RPI 1 or RPI 2 devices. The second set of flags will suffice.

https://groups.google.com/g/picochess

The armv8 flag did not compile with the RPI-4 , and ARMv7 did - at least for me. This is a beta Raspi 64 bit OS which is stil in beta -and the user base is probably very very small. The flags in for the 32 bit RPI work for all 3 models and above including the 4 if is running the 32 bit OS. I'm not familiar with anyone using the RP1 or RP2 since they are not supported for PicoChess.

MichaelB7 on 17 Aug 2020

still trying to understand. Also RPI3 is armv8, so my question is, can you compile&run on that hardware with
https://github.com/vondele/Stockfish/tree/notruck-master
using make -j ARCH=armv8 build ?

vondele on 17 Aug 2020

no that does not work:

nnue/evaluate_nnue.cpp: In function ‘Value Eval::NNUE::ComputeScore(const Position&, bool)’:
nnue/evaluate_nnue.cpp:135:61: warning: requested alignment 64 is larger than 8 [-Wattributes]
         transformed_features[FeatureTransformer::kBufferSize];
                                                             ^
nnue/evaluate_nnue.cpp:137:61: warning: requested alignment 64 is larger than 8 [-Wattributes]
     alignas(kCacheLineSize) char buffer[Network::kBufferSize];
                                                             ^
In file included from nnue/../nnue/architectures/../features/../nnue_common.h:40,
                 from nnue/../nnue/architectures/../features/features_common.h:25,
                 from nnue/../nnue/architectures/../features/feature_set.h:24,
                 from nnue/../nnue/architectures/halfkp_256x2-32-32.h:24,
                 from nnue/../nnue/nnue_architecture.h:25,
                 from nnue/../nnue/nnue_accumulator.h:24,
                 from nnue/../position.h:31,
                 from nnue/evaluate_nnue.cpp:26:
/usr/lib/gcc/arm-linux-gnueabihf/8/include/arm_neon.h: In member function ‘void Eval::NNUE::FeatureTransformer::RefreshAccumulator(const Position&) const’:
/usr/lib/gcc/arm-linux-gnueabihf/8/include/arm_neon.h:589:1: error: inlining failed in call to always_inline ‘int16x8_t vaddq_s16(int16x8_t, int16x8_t)’: target specific option mismatch
 vaddq_s16 (int16x8_t __a, int16x8_t __b)
 ^~~~~~~~~
In file included from nnue/evaluate_nnue.h:24,
                 from nnue/evaluate_nnue.cpp:30:
nnue/nnue_feature_transformer.h:229:40: note: called from here
             accumulation[j] = vaddq_s16(accumulation[j], column[j]);
                               ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from nnue/../nnue/architectures/../features/../nnue_common.h:40,
                 from nnue/../nnue/architectures/../features/features_common.h:25,
                 from nnue/../nnue/architectures/../features/feature_set.h:24,
                 from nnue/../nnue/architectures/halfkp_256x2-32-32.h:24,
                 from nnue/../nnue/nnue_architecture.h:25,
                 from nnue/../nnue/nnue_accumulator.h:24,
                 from nnue/../position.h:31,
                 from nnue/evaluate_nnue.cpp:26:
/usr/lib/gcc/arm-linux-gnueabihf/8/include/arm_neon.h:589:1: error: inlining failed in call to always_inline ‘int16x8_t vaddq_s16(int16x8_t, int16x8_t)’: target specific option mismatch
 vaddq_s16 (int16x8_t __a, int16x8_t __b)

looking now to see what works and keeps it simple

MichaelB7 on 17 Aug 2020

is that the 32 bit OS? in that case maybe try ARCH=armv7-neon ?

Anyway, I've force-pushed the branch (https://github.com/vondele/Stockfish/tree/notruck-master) once more, I think the ndk changes are final. @notruck do you want your full name added to the AUTHORS file, right now I've used your github handle.

vondele on 17 Aug 2020

so far , I need at least these two

   CXXFLAGS +=  -mcpu=cortex-a53 -mfpu=neon-fp-armv8

using armv7-neon in lieu of above

g++: error: unrecognized -march target: armv7-neon
g++: note: valid arguments are: armv2 armv2a armv3 armv3m armv4 armv4t armv5 armv5t armv5e armv5te armv5tej armv6 armv6j armv6k armv6z armv6kz armv6zk armv6t2 armv6-m armv6s-m armv7 armv7-a armv7ve armv7-r armv7-m armv7e-m armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8-m.base armv8-m.main armv8-r iwmmxt iwmmxt2 native; did you mean ‘armv7-a’?

MichaelB7 on 17 Aug 2020

I'm surprised by that gcc error, we don't pass -march=armv7-neon to gcc, as far as I can see. Do you have anywhere in the Makefile other local changes?

vondele on 17 Aug 2020

you still get this error using just -march = armv7-a

In file included from nnue/../nnue/architectures/../features/../nnue_common.h:40,
                 from nnue/../nnue/architectures/../features/features_common.h:25,
                 from nnue/../nnue/architectures/../features/feature_set.h:24,
                 from nnue/../nnue/architectures/halfkp_256x2-32-32.h:24,
                 from nnue/../nnue/nnue_architecture.h:25,
                 from nnue/../nnue/nnue_accumulator.h:24,
                 from nnue/../position.h:31,
                 from nnue/evaluate_nnue.cpp:26:
nnue/nnue_feature_transformer.h: In member function ‘Eval::NNUE::FeatureTransformer::RefreshAccumulator(Position const&) const’:
/usr/lib/gcc/arm-linux-gnueabihf/8/include/arm_neon.h:589:1: error: inlining failed in call to always_inline ‘vaddq_s16’: target specific option mismatch
 vaddq_s16 (int16x8_t __a, int16x8_t __b)
 ^~~~~~~~~
In file included from nnue/evaluate_nnue.h:24,
                 from nnue/evaluate_nnue.cpp:30:
nnue/nnue_feature_transformer.h:229:40: note: called from here
             accumulation[j] = vaddq_s16(accumulation[j], column[j]);
                               ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from nnue/../nnue/architectures/../features/../nnue_common.h:40,
                 from nnue/../nnue/architectures/../features/features_common.h:25,
                 from nnue/../nnue/architectures/../features/feature_set.h:24,
                 from nnue/../nnue/architectures/halfkp_256x2-32-32.h:24,
                 from nnue/../nnue/nnue_architecture.h:25,
                 from nnue/../nnue/nnue_accumulator.h:24,
                 from nnue/../position.h:31,
                 from nnue/evaluate_nnue.cpp:26:
/usr/lib/gcc/arm-linux-gnueabihf/8/include/arm_neon.h:589:1: error: inlining failed in call to always_inline ‘vaddq_s16’: target specific option mismatch
 vaddq_s16 (int16x8_t __a, int16x8_t __b)
 ^~~~~~~~~
In file included from nnue/evaluate_nnue.h:24,
                 from nnue/evaluate_nnue.cpp:30:
nnue/nnue_feature_transformer.h:229:40: note: called from here
             accumulation[j] = vaddq_s16(accumulation[j], column[j]);
                               ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [<builtin>: evaluate_nnue.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/Al/github/Stockfish/src'
make: *** [Makefile:812: build] Error 2

MichaelB7 on 17 Aug 2020

so I found the minimum flags and it still compiles are below, this works on the 32 bit kernel and OS

g++ -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -march=armv7-a  -mfpu=neon -DNDEBUG -O3 -DIS_64BIT -DUSE_NEON

and it works on the RPI-4 running 32 bit kernel OS, which is the current standard. It currently does not work on the RPI-4 with 64 bit kernel/OS which is still in beta - but for those who need that, it is not hard to figure out.

MichaelB7 on 18 Aug 2020

I'm surprised by that gcc error, we don't pass -march=armv7-neon to gcc, as far as I can see. Do you have anywhere in the Makefile other local changes?

I thought that is what you were asking me to pass.

MichaelB7 on 18 Aug 2020

Similar, I meant make -j ARCH=armv7-neon build.

What I'm still a bit confused about is why we need to pass the -mfpu=neon flag. It is reasonable, but somehow various people seem to be able to build for raspberry PI with this option. Let me try to add it specific to linux only: https://github.com/vondele/Stockfish/tree/notruck-master can you give it a try?

vondele on 18 Aug 2020

@vondele Github username is fine. Thank you!

I tried to digest the information from the NDK NEON page a little more. If I understood them correctly, they didn't have a -mfpu=neon as the default until r21. I have used it explicitly before, and doing so was indeed beneficial.

Now that r21 uses it on everything, it has become optional (redundant but harmless). Other compilers and older NDKs will still benefit from an explicit -mfpu=neon or a slight variation of that base flag with extra suffixes.

While -mfpu=neon-x-y-z will benefit the runtime speeds in general, it still doesn't seem to help with NNUE network usage.

The neon = yes in the Makefile ensures the passing of -DUSE_NEON flag, which ultimately leads to the inclusion of a header in the nnue/nnue_common.h file:
https://github.com/official-stockfish/Stockfish/blob/master/src/nnue/nnue_common.h#L42

So when compiling with GCC for raspberry pi armv7, using both -mfpu=neon-x-y-z and -DUSE_NEONmay be producing the best results.

notruck on 18 Aug 2020

I have updated master with what I believe is the best patch so far. There might/will still be issues, let's try to improve as a follow up. Thanks for the feedback and testing.

vondele on 18 Aug 2020

Stockfish: [NNUE] Android building issue

Most helpful comment

All 98 comments

Related issues