Stockfish: Windows large pages commit - modern version instantly crashing

Created on 13 May 2020  Â·  22Comments  Â·  Source: official-stockfish/Stockfish

I just downloaded the latest Stockfish build ("Add support for Windows large pages" commit) from the abrok.eu site.

I run the engine exe-file in the command prompt (just open exe-file without adding it to any GUI) and see the following:

Stockfish 130520 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
info string Hash table allocation: Windows large pages not used.

Of course I did not change the system policy to use page locking.

Bench command is working well.

But when I try to close the command prompt window, the engine stops working (with Windows appropriate message and log information in the event viewer). This happens only when the latest modern build is launched (my processor is quite old and is not supports BMI command), all the previous builds until Lock pages commit work without any problems.

System: Win 10 Pro x64 (November 2019 update).

Most helpful comment

All 22 comments

Updated: I checked the latest build on another, more modern computer (BMI2 - Haswell CPU - Stockfish version). The problem is exactly the same.

At the moment, I did not add the newest build to any GUI, as it is obvious that the problem after Lock pages commit exists.

What do you mean exactly with "But when I try to close the command prompt window, the engine stops working"?

@CoffeeOne I mean that if I try to close the running engine window as normal way - engine crashing with Windows application error report is following. It is not good. Such was never happened before the latest commit.

I tried to duplicate, with and without Large Pages from abrok , to no avail.

Can you post system information ( search "system information: in search bar and run app.)
Output will look something like this:

OS Name Microsoft Windows 10 Pro
Version 10.0.18363 Build 18363
Other OS Description    Not Available
OS Manufacturer Microsoft Corporation
System Name VM-894787
System Manufacturer System manufacturer
System Model    System Product Name
System Type x64-based PC
System SKU  SKU
Processor   AMD Ryzen Threadripper 3970X 32-Core Processor, 3693 Mhz, 32 Core(s), 64 Logical Processor(s)
BIOS Version/Date   American Megatrends Inc. 0807, 1/30/2020
SMBIOS Version  3.2
Embedded Controller Version 255.255
BIOS Mode   UEFI
BaseBoard Manufacturer  ASUSTeK COMPUTER INC.
BaseBoard Product   PRIME TRX40-PRO
BaseBoard Version   Rev 1.xx
Platform Role   Desktop
Secure Boot State   Off
PCR7 Configuration  Binding Not Possible
Windows Directory   C:\Windows
System Directory    C:\Windows\system32
Boot Device \Device\HarddiskVolume7
Locale  United States
Hardware Abstraction Layer  Version = "10.0.18362.752"
User Name   VM-894787\MichaelB7
Time Zone   Eastern Daylight Time
Installed Physical Memory (RAM) 128 GB
Total Physical Memory   128 GB
Available Physical Memory   116 GB
Total Virtual Memory    136 GB
Available Virtual Memory    114 GB
Page File Space 8.00 GB
Page File   C:\pagefile.sys
Kernel DMA Protection   Off
Virtualization-based security   Not enabled
Device Encryption Support   Reasons for failed automatic device encryption: TPM is not usable, PCR7 binding is not supported, Hardware Security Test Interface failed and device is not Modern Standby, Un-allowed DMA capable bus/device(s) detected, TPM is not usable
Hyper-V - VM Monitor Mode Extensions    Yes
Hyper-V - Second Level Address Translation Extensions   Yes
Hyper-V - Virtualization Enabled in Firmware    Yes
Hyper-V - Data Execution Protection Yes

One thing I would try, is to run wndows updates until it does not update anymore. Standard procedure when an app that always worked before stops working.

Not sure if this is related but I compiled the latest version (large pages), on Windows, (with large pages off) and while I'm not getting any crashes or errors, the executable takes noticeably longer to close (around 2 to 3 seconds) while all prior executables close pretty much instantaneously when the "X" button is clicked. This is repeatable too, and I'm not doing anything differently than I normally would.

I checked and this behavior isn't present/hasn't been present in compiles that go only up to the TBTables::Entry commit, so I'm fairly certain it has something to do with the latest Large Pages commit. I also find it strange that this effect is even occurring given that I'm not even using the functionality

Not sure if this is related but I compiled the latest version (large pages), on Windows, (with large pages off) and while I'm not getting any crashes or errors, the executable takes noticeably longer to close (around 2 to 3 seconds) while all prior executables close pretty much instantaneously when the "X" button is clicked. This is repeatable too, and I'm not doing anything differently than I normally would.

@silversolver1 Can you please check Event log - Application. I assume you will find out Stockfish crashes after those 2-3 seconds of freezing. This is exactly behavior that I got on the second machine. On the first one there were Windows crashes messages after seconds of freezing, but on the second machine only freezing was visible, but I guessed to look into the event log and found those same crashes.

@MichaelB7 The reason is definitely not in the hardware, as this problem occurs on two different computers at once. And I did not update Windows before the crash issue occurred. It appeared exclusively in the latest Stockfish build with Large pages commit.

Right now I am running any previous Stockfish builds on both PCs and none of them have a similar crashes problem.

It could be the other commit regarding table bases - I will post a link to my complied version of Stockfish that does not have the tbtables commit since that commit was problematic for me and it wasn’t an elo gainer so I just ignored it.

edit: here's the file, functionally it's the same stockfish as SF, but it does have other bells and whistles. Let me know if this has the same behavior or not, This has LP, slightly modified, but it does not have commit 86ee4eb84d54 wich was also committed today

Stockfish-dev-051320.zip

On May 13, 2020, at 9:24 PM, Ivan Panitevsky notifications@github.com wrote:


@MichaelB7 The reason is definitely not in the hardware, as this problem occurs on two different computers at once. And I did not update Windows before the crash issue occurred. It appeared exclusively in the latest Stockfish build with Large pages commit.

Right now I am running any previous Stockfish builds on both PCs and none of them have a similar crashes problem.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

@Coolchessguykevin Yes, I checked and you were right it shows up with event name APPCRASH.

Some of the seemingly relevant details in the crash log are:

P4: ntdll.dll
P7: c0000005
P8: 00000000000407d3

https://stackoverflow.com/questions/9497384/how-to-interpret-windows-appcrash-mysterious-log

From there,
P4 = Module name
P7 = Exception code
P8 = Exception offset from the start of the module

According to google, exception code c0000005 indicates an access violation

The only real thing I could notice about this is that in tt.cpp's TranspositionTable::resize, Here, variable mem is not declared, defined or initialized in any way that I can tell. It does not seem to have a type, is not passed in as a parameter in any way, is not a global variable and does not point to anything either (so I am not sure how it can be used. although i could of course be missing something ;) )

@skiminki could you have a look ?

From the description is seems that the TT.resize(0) at the end might be causing troubles.

@silversolver1 mem is a variable of the TranspositionTable class (see tt.h line 95).

This is certainly not related to the TBTables::Entry commit

@silversolver1 could you try the following, it is a bit a shot in the dark, but it would be useful to exclude it as a cause:

diff --git a/src/main.cpp b/src/main.cpp
index c7cf2c6f2..f0b7f240b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -49,6 +49,7 @@ int main(int argc, char* argv[]) {

   UCI::loop(argc, argv);

+  Threads.main()->wait_for_search_finished();
   TT.resize(0);
   Threads.set(0);
   return 0;

@vondele Sure, I compiled a version with just the line Threads.main()->wait_for_search_finished(); added in that particular spot on top of the Large Pages commit. It seems the problem persists with the same error log, including the same exception code and exception offset.

OK, thanks. So that's not an issue. The windows binary (including the abrok one) runs fine under wine on Linux. @silversolver1 could you reproduce this crash in a debugger and get the location of the crash confirmed?

Also as asked previously, I guess this will be Windows version specific, so can you provide that as well.

It happens only on SIGINT sent to the process (so Ctrl-C or just closing terminal window with SF running). Command 'quit' works with no problem.

I think I have reproduced this. I get this behavior + entry in the application log if I close the stockfish terminal window with the "close window" button.

However, I do not see this issue if I use command 'quit'.

My reproduction steps precisely:

  1. Click 'stockfish.exe' icon to open a terminal Window
  2. Click 'X' button in top-right to close the window
  3. It takes maybe 2 secs to close the window, and there are 3 new events in the event log

However, the following works without issues

  1. Click 'stockfish.exe' icon to open a terminal Window
  2. type 'quit' to close the engine gracefully
  3. Terminal window closes instantly

Yeah, I think this has to do with SIGINT and possibly about that we didn't get a chance to close VirtualAlloc's. I'll post the events in a follow-up comment.

This is about NTDLL.DLL crashing because we didn't close VirtualAlloc's in ctrl+c exit path. I get the same behavior with 'quit' exit with the following patch:

diff --git a/src/tt.cpp b/src/tt.cpp
index 6ee63138..4b2c45d0 100644
--- a/src/tt.cpp
+++ b/src/tt.cpp
@@ -63,6 +63,12 @@ void TranspositionTable::resize(size_t mbSize) {

   Threads.main()->wait_for_search_finished();

+  if (!mbSize)
+  {
+      // aligned_ttmem_free(mem); mem = nullptr; -- no crash if this line commented out
+      exit(0);
+  }
+
   if (mem)
       aligned_ttmem_free(mem);

With this patch, the following gives a crash:

  1. Click 'stockfish.exe' icon to open a terminal Window
  2. type 'quit'
  3. It takes maybe 2 secs to close the window, and there are 2 new events in the event log (error + info)

Actually, I don't think we need the crash events here. I'll investigate a bit further.

Ok, the crash is most likely because of a left-over. That is, TranspositionTable destructor frees mem if it's non-null. The fix is to delete it, because we free transposition table now manually--or rather, maybe we should delete that resize(0) stuff and replace free(mem) with aligned_ttmem_free(mem) call. This was a bug in my patch. I'll send a PR soon.

diff --git a/src/tt.h b/src/tt.h
index 8b70f797..94fb9f61 100644
--- a/src/tt.h
+++ b/src/tt.h
@@ -75,7 +75,7 @@ class TranspositionTable {
   static_assert(sizeof(Cluster) == 32, "Unexpected Cluster size");

 public:
- ~TranspositionTable() { free(mem); }
+  ~TranspositionTable() { }
   void new_search() { generation8 += 8; } // Lower 3 bits are used by PV flag and Bound
   TTEntry* probe(const Key key, bool& found) const;
   int hashfull() const;

So with this patch, I don't see crashes. TBH, I'm a bit surprised that Windows calls the static destructors on window close after aborting the program.

I haven't yet tried anything with a debugger but as I commented in 2679, the PR seems to work to remedy the problem and I no longer get any application crashes or errors in the Windows Event Viewer log (running Windows 10). The hang on closing using the Ctrl-C and "X" methods also seems to be gone now, so it looks like this worked

yes, will be fixed after merging.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BKSpurgeon picture BKSpurgeon  Â·  6Comments

nguyenpham picture nguyenpham  Â·  4Comments

bftjoe picture bftjoe  Â·  5Comments

anonymous7002 picture anonymous7002  Â·  3Comments

Technologov picture Technologov  Â·  3Comments