Our team (https://github.com/dotnet/machinelearning) is using lightgbm (version 2.2.3) and sometimes lightgbm crashes the test process to LGBM_BoosterUpdateOneIter method call when running on CI, mainly on windows 10 dotnet core 2.1.
I tried to capture memory dump but seems lightgbm dll don't have symbol built (Binary was not built with debug information), so I'm wondering how to build lightgbm dll with debug information so I can investigate further, also will upgrade to the latest version resolve the issue?
This issue is happening randomly so I'm not able to reproduce a stable repro.
Exception looks like below:

The error call stack looks like below:

Any help will be appreciated, thanks
Hi @frank-dong-ms,
Firstly, you can try the latest master branch, and see what happens.
In addition, I can create a PR to easily compile the DLL with debug flags, so that we can get meaningful information.
@guolinke Thanks for your reply, I tried to use latest nuget version and the issue still exists.
Can you please provide dll with debug flags so I can get more information, thanks
@guolinke thanks, is this dll you provided debug version? This crash only appears in Release version, can you provide Release version as well?
@guolinke Thanks so much for quick response. Yes, the crash only appear in Release build, I tried to build in Debug build there is no crash..
I got several repro with the dll you provided and I paste the call stack below, they seems different issues:
Issue1:
Exception:
Unhandled exception at 0x00007FFEDBC9C0DD (lib_lightgbm.DLL) in dotnet.exe.13236.dmp: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF. occurred
Call Stack:

Issue2:
Exception:
Unhandled exception at 0x00007FFF3878B049 (ntdll.dll) in dotnet.exe.7588.dmp: 0xC0000374: A heap has been corrupted (parameters: 0x00007FFF387F27F0). occurred
Call Stack:

Issue3:
Exception:
Unhandled exception at 0x00007FFEE9223B8D (coreclr.dll) in dotnet.exe.3768.dmp: 0xC0000005: Access violation. occurred
Call Stack:

All these exception happens in multi-thread scenario, so are these API thread safe?
did you mean the lib_lightgbm.dll is called by multi-thread in C#?
I check the lines that throw the error, and don't think there are problems.
The SplitInfo is a simple C struct, no contains any pointers.
One possible reason is, the num_threads of LightGBM is changed (become larger) during training. Can you check this?
BTW, for the Debug mode, did you set Both ml.net and LightGBM to debug mode? or only LightGBM?
I made update to #1 issue, I think I pasted wrong call stack.
For #1 issue, LGBM_DatasetPushRowsByCSR will be called multi-thread in C# by different test case, will this likely to cause the issue?
How to set num_threads of LightGBM? I don't find a place ml.net set this.
And yes, I set both ml.net and lightGBM to Debug mode. We only observing crash on Release build from our past CI builds.
Could you try to use the release mode for ml.net, but debug mode for LightGBM?
the error in LGBM_DatasetPushRowsByCSR seems related to the num_threads changed in traning as well.
I check the code of LGBM_DatasetPushRowsByCSR, I think it is not thread-safe, for one Dataset.
And most Dataset related APIs are not thread-safe.
Also refer to https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L956-L958 ,
which has a lock outside.
Num_threads parameter: https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L150
You can also check whether the error code is called by multi-threading outside or not.
Could you try to use the release mode for ml.net, but debug mode for LightGBM?
Yes, I tried, no repro.
the error in
LGBM_DatasetPushRowsByCSRseems related to thenum_threadschanged in traning as well.
num of threads is set at start of tests as 1 and not changed during test running.
@frank-dong-ms are these tests run by multi-threading? And can you ensure num_threads is 1 in all test cases? (the default is all cores, not 1)
If there are many tests run at the same time, and some of them use the different num_threads, the num_threads could be changed, and cause the error.
Yes, I tried, no repro
Do you mean that it can pass?
If there are many tests run at the same time, and some of them use the different
num_threads, the num_threads could be changed, and cause the error.
I tried to hard code num of threads to 1 in ml.net code and rerun tests, the #1 issue is also reproduced with slightly different call stack:

Do you mean that it can pass?
Yes, pass as long as I use debug version of lightgbm dll
This seems like various issue exists here, I feel like memory is corrupted during test running and cause issue when lightgbm try to access some internal data structure
@frank-dong-ms did you lock the PushRows in dataset like https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L956-L958 ?
And set num of threads to 1 eliminates other issues, right?
@frank-dong-ms did you lock the PushRows in dataset like https://github.com/dotnet/machinelearning/blob/d849ba4c7831586018d5b2cca6bdd2b1fa228ddf/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs#L956-L958 ?
Yes, I checked the current codebase seems all PushRows has lock.
And set num of threads to 1 eliminates other issues, right?
No, I tried to hard code num of threads to 1 in our source code and crash still exists.
Can I know which commit you are building lightgbm dll from (the one you give me) or how can I build lightgbm dll myself? I'm using application verifier to troubleshoot the crash and I need correct version of lightgbm source code to view the correct call stack
@frank-dong-ms refer to https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#with-gui
for the dll, you can choose DLL for Release, and Debug_DLL for debug.
@frank-dong-ms refer to https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#with-gui
Can I get dll from visual studio build? I can only see exe generated
use DLL or Debug_DLL config, and then build, the dll is in the corresponding folder.
@guolinke I have below finding, please see whether this is bug in lightgbm.
We are setting num of threads to 1 when running test, so share_state_->num_threads should be 1 in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L414.
In this https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L424 the number of threads should be number of core on vm that running (in our case number of core is 2) so tid can be 0 or 1 from https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L430.
when tid equals 1, we are accessing index that is out of boundary at line https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L441.
I have tried to add below assertion at line 438 and this assertion fails during test run:
assert(tid < share_state_->num_threads);
Could you please take a look or I'm misunderstanding anything, thanks.
@frank-dong-ms I don't think this is a bug in LightGBM, as we never saw it before.
The num_threads is got by omp_get_num_threads(), it should be correct if there are no changes during tests.
https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/utils/openmp_wrapper.h#L22
did you use other modules that relies on open_mp too?
I am not sure what happens, but maybe the omp_num_threads is set by other modules.
Maybe you can try only run tests for LightGBM, without the effect of other modules.
The
num_threadsis got byomp_get_num_threads(), it should be correct if there are no changes during tests.
https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/utils/openmp_wrapper.h#L22
Yes, I can see num_threads is 1 and is correct, but how many threads will be use in below line? And is it possible we get 1 for tid (const int tid = omp_get_thread_num())? This is what I'm seeing during my test.
https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L424
did you use other modules that relies on open_mp too?
I am not sure what happens, but maybe theomp_num_threadsis set by other modules.
Yes, I believe OnnxRuntime native library that ML.NET is referencing also use open_mp during their training. I'm not familiar with open_mp lib use this lib inside OnnxRuntime will affect light gbm?
Maybe you can try only run tests for LightGBM, without the effect of other modules.
Yes, I will try as you suggested as well.
If omp number of threads is set, it will always use that number of threads globally.
If omp number of threads is set, it will always use that number of threads globally.
Yes, you are right, when this crash happens, omp_get_num_threads() return 2 and cause vector access index out of range, thanks
@guolinke it is possible LightGBM not to rely on number of threads to do indexing?
If this setting of OpenMP (number of threads) is global then it is likely this setting will be override in other native library and cause issue in LightGBM code.
Also CC @harishsk
not, for fasterst speed, we use many per-thread buffers in multi-threading.
I see, at lease some protection can be added like to https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L355, change to something like below:
does it make sense?
@frank-dong-ms good suggestion! I think it can work
@frank-dong-ms , sorry, I benchmark for that solution, it seems will be much slower.
I guess there are some overheads when specific num_threads.
So we cannot adopt that fix.
updated:
never mind, it seems to keeping num_threads with the same value in all loops solve the problem.
@guolinke Do you know when new version of lightgbm will be released with this fix?
I think it needs about one or two months. we have many breaking changes recently, needs more time to ensure there are no bugs.
Got it, thanks!