Lightgbm: [mmlspark] seeing memory corruption error on LGBM_BoosterFree with latest version

Created on 7 Apr 2020  路  13Comments  路  Source: microsoft/LightGBM

when trying to upgrade lightgbm to latest version on master I am seeing a random memory corruption error in LGBM_BoosterFree inside SerialTreeLearner, pasting stack trace below:

[LightGBM] [Info] Finished linking network in 0.006790 seconds

* Error in `java': double free or corruption (!prev): 0x00007fb0ac9342a0 *

======= Backtrace: =========

/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fb0c91467e5]

/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fb0c914f37a]

/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fb0c915353c]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM23DataParallelTreeLearnerINS_17SerialTreeLearnerEED2Ev+0x78)[0x7fb05c3915f8]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM23DataParallelTreeLearnerINS_17SerialTreeLearnerEED0Ev+0x9)[0x7fb05c391639]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM4GBDTD1Ev+0xa30)[0x7fb05c1c8c50]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM4GBDTD0Ev+0x9)[0x7fb05c1c9219]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(_ZN8LightGBM7BoosterD2Ev+0x5f9)[0x7fb05c4401c9]

/tmp/mml-natives484132555220552210/lib_lightgbm.so(LGBM_BoosterFree+0xe)[0x7fb05c42ec1e]

Most helpful comment

@guolinke I have some great news! I was able to resolve this issue, but only after pivoting in my debugging strategy several times. I had a lot of trouble narrowing down the cause of the issue - I added a lot of std::cout debug but it did not help, partly because the errors that appeared were always very random. After struggling with getting valgrind to work with Java (and spark), I gave up on this approach. I then struggled to use address sanitizer with mmlspark to narrow down the problem, but still could not get it to work. I came up with a new strategy to reproduce the issue in native-only mode, without using mmlspark at all. After a lot of parameter tweaking, exporting data from the test in mmlspark and intense debugging, I was able to reproduce the issue! Not only that, I was able to run valgrind (without the issues I had with running it against Java), and I was able to see the following invalid write:

IM%% input_buffer_ size before SyncUpGlobalBestSplit: 96
%%IM SplitInfo size: 222
==4422== Invalid write of size 4
==4422==    at 0x63B1E2: memcpy (string3.h:53)
==4422==    by 0x63B1E2: CopyTo (split_info.hpp:56)
==4422==    by 0x63B1E2: SyncUpGlobalBestSplit (parallel_tree_learner.h:195)
==4422==    by 0x63B1E2: LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::FindBestSplitsFromHistograms(std::vector<signed char, std::allocator<signed char> > const&, bool) (data_parallel_tree_learner.cpp:242)
==4422==    by 0x631AD4: LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::FindBestSplits() (data_parallel_tree_learner.cpp:175)
==4422==    by 0x6543A2: LightGBM::SerialTreeLearner::Train(float const*, float const*) (serial_tree_learner.cpp:184)
==4422==    by 0x493247: LightGBM::GBDT::TrainOneIter(float const*, float const*) (gbdt.cpp:373)
==4422==    by 0x48D6CF: LightGBM::GBDT::Train(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (gbdt.cpp:252)
==4422==    by 0x45C572: LightGBM::Application::Train() (application.cpp:203)
==4422==    by 0x45B9B0: Run (application.h:86)
==4422==    by 0x45B9B0: main (main.cpp:15)

I found out that input_buffer_ was too small before calling SyncUpGlobalBestSplit, which caused the memory overwrite. Inserting the following code prior to calling SyncUpGlobalBestSplit fixed the memory issues for me:

  auto max_cat_threshold = this->config_->max_cat_threshold;
  int splitInfoSize = SplitInfo::Size(max_cat_threshold);
  if (input_buffer_.size() < splitInfoSize * 2) {
      input_buffer_.resize(splitInfoSize * 2);
  }
  SyncUpGlobalBestSplit(input_buffer_.data(), input_buffer_.data(), &smaller_best_split, &larger_best_split, this->config_->max_cat_threshold);

I validated this fixed both the native example and, after exporting the code to mmlspark, the mmlspark test. It is very satisfying to finally be able to resolve this memory corruption issue!

You can find the branch here:
https://github.com/imatiach-msft/LightGBM/commits/ilmat/debug-free-err

The native LightGBM example that reproduced the issue is here:
https://github.com/imatiach-msft/LightGBM/tree/ilmat/debug-free-err/examples/memory_debug

To reproduce, I ran the following commands on two terminals within the memory_debug folder:
valgrind ./lightgbm config=train.conf1
valgrind ./lightgbm config=train.conf2

Make sure to remove the fix I added (mentioned above) and recompile in order to reproduce the error.

All 13 comments

@imatiach-msft is that possible to locate which function/code block causes the memory corruption?

I'm currently trying to locate the function/code block. Removing the call to LGBM_BoosterFree, StringArrayHandle_free and LGBM_DatasetFree still seems to result in segfault or sigabort errors in other places, and they seem to vary from run to run. Using a single partition (instead of two) seems to fix all issues, so I suspect it's something that only happens for distributed case.

Interestingly out of all of the tests that are run I am only seeing this issue on just one test that uses the flare dataset.

Tried to look at this again this evening. This time I modified the code to use LGBM_DatasetCreateFromCSR instead of the mmlspark-optimized method LGBM_DatasetCreateFromCSRSpark. I had my hopes high that this may fix it, as I've isolated the issue to training-only APIs (not scoring), and what seems to be sparse-only code - which automatically made me think this would be caused by the relatively new method that was added for sparse code. Unfortunately, I still ran into the issue, albeit with a different stack trace, in SplitInner method:

# C  [lib_lightgbm.so+0x258491]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0x581

Specifically with the stack trace:

Stack: [0x00007f7592dee000,0x00007f7592eef000],  sp=0x00007f7592eebff0,  free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [lib_lightgbm.so+0x258491]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0x581
C  [lib_lightgbm.so+0x2b0e60]  LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::Split(LightGBM::Tree*, int, int*, int*)+0x20

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 9131  com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_BoosterUpdateOneIter(JJ)I (0 bytes) @ 0x00007f761dd41884 [0x00007f761dd41840+0x44]
J 9130 C1 com.microsoft.ml.lightgbm.lightgbmlib.LGBM_BoosterUpdateOneIter(Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;Lcom/microsoft/ml/light\
gbm/SWIGTYPE_p_int;)I (12 bytes) @ 0x00007f761d47b044 [0x00007f761d47aea0+0x1a4]
j  com.microsoft.ml.spark.lightgbm.TrainUtils$.trainCore(Lcom/microsoft/ml/spark/lightgbm/TrainParams;Lscala/Option;Lorg/slf4j/Logger;Z)V+78\
5
...

Interestingly, as I was looking through source and various commits, I stumbled on a recent issue that may be related to the memory corruption errors I am seeing:
https://github.com/microsoft/LightGBM/issues/2820

However, the fact that the memory corruption is random, and sometimes the test works when I run, makes me wonder if it is a different issue. I've also seen the test get stuck as well.

maybe we can try to add #pragma omp parallel for schedule(static) num_threads(share_state_->num_threads) in the omp loop in parallel tree learners?

as our CI in the serial version didn't catch such an error, I think the bug is very possible in parallel tree learner: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/data_parallel_tree_learner.cpp . Maybe some edge cases cause the out-of-the-boundaries array access.
@imatiach-msft could you help to add some assert/check (like check the index range) to locate the bugs?

"could you help to add some assert/check"
yes definitely
" try to add #pragma omp parallel for schedule(static) num_threads(share_state_->num_threads) in the omp loop"
added in both omp loops:
image
unfortunately it did not seem to fix the issue yet

@guolinke I think I've made some progress on this, the issue seems to occur much earlier, as early as in the memset calls here:
https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L1241
https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L1342
I think it is related to constructing histogram for sparse data in parallel case.
Although the error can change from run to run that is the earliest I've seen it occur so far.

@imatiach-msft , yeah, when call the std::memset, it will check the memory states, and will throw the exception if found memory errors.

BTW, maybe using the dll with debug flags will be easier to locate the problem?

@guolinke I did add a lot of debug in this branch to track down the error:
https://github.com/imatiach-msft/LightGBM/tree/ilmat/debug-free-err
I am currently building with USE_DEBUG flag:

mkdir build ; cd build
cmake -DUSE_SWIG=ON -DUSE_DEBUG=ON ..
make -j4

I tried to modify the CMakeLists.txt by adding:
-D_GLIBCXX_ASSERTIONS
and I also tried:
add_compile_definitions(_GLIBCXX_ASSERTIONS)
but it doesn't seem to be working or it's just not catching the errors.
I tried to run valgrind on sbt (scala build/test tool) but I couldn't enter the test command due to the output from valgrind, will try to refactor the test code to a java file and run valgrind on that.
I've attached 3 sample runs in case it helps - note the errors highly vary from run to run, and are almost never the same. The earliest I've seen them was in the histogram code for sparse data so I assume it must be coming somewhere around there, even before any training, but I'm not sure.

LightGBMRun1.txt
LightGBMRun2.txt
LightGBMRun3.txt

@imatiach-msft
I find the out of range in your log:

[LightGBM] [Warning] vector::_M_range_check: __n (which is 1) >= this->size() (which is 0)
[LightGBM] [Warning] vector::_M_range_check: __n (which is 2) >= this->size() (which is 0)
[LightGBM] [Warning] vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 1) >= this->size() (which is 0)

where did this happen?

@guolinke yes, that is the first time I've seen that specific error, the errors change every time when I run the tests - I suspect this is caused by memory corruption further upstream. Unfortunately I don't have the stack for that error anymore, but I can send one when/if I see a similar error.

@guolinke I have some great news! I was able to resolve this issue, but only after pivoting in my debugging strategy several times. I had a lot of trouble narrowing down the cause of the issue - I added a lot of std::cout debug but it did not help, partly because the errors that appeared were always very random. After struggling with getting valgrind to work with Java (and spark), I gave up on this approach. I then struggled to use address sanitizer with mmlspark to narrow down the problem, but still could not get it to work. I came up with a new strategy to reproduce the issue in native-only mode, without using mmlspark at all. After a lot of parameter tweaking, exporting data from the test in mmlspark and intense debugging, I was able to reproduce the issue! Not only that, I was able to run valgrind (without the issues I had with running it against Java), and I was able to see the following invalid write:

IM%% input_buffer_ size before SyncUpGlobalBestSplit: 96
%%IM SplitInfo size: 222
==4422== Invalid write of size 4
==4422==    at 0x63B1E2: memcpy (string3.h:53)
==4422==    by 0x63B1E2: CopyTo (split_info.hpp:56)
==4422==    by 0x63B1E2: SyncUpGlobalBestSplit (parallel_tree_learner.h:195)
==4422==    by 0x63B1E2: LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::FindBestSplitsFromHistograms(std::vector<signed char, std::allocator<signed char> > const&, bool) (data_parallel_tree_learner.cpp:242)
==4422==    by 0x631AD4: LightGBM::DataParallelTreeLearner<LightGBM::SerialTreeLearner>::FindBestSplits() (data_parallel_tree_learner.cpp:175)
==4422==    by 0x6543A2: LightGBM::SerialTreeLearner::Train(float const*, float const*) (serial_tree_learner.cpp:184)
==4422==    by 0x493247: LightGBM::GBDT::TrainOneIter(float const*, float const*) (gbdt.cpp:373)
==4422==    by 0x48D6CF: LightGBM::GBDT::Train(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (gbdt.cpp:252)
==4422==    by 0x45C572: LightGBM::Application::Train() (application.cpp:203)
==4422==    by 0x45B9B0: Run (application.h:86)
==4422==    by 0x45B9B0: main (main.cpp:15)

I found out that input_buffer_ was too small before calling SyncUpGlobalBestSplit, which caused the memory overwrite. Inserting the following code prior to calling SyncUpGlobalBestSplit fixed the memory issues for me:

  auto max_cat_threshold = this->config_->max_cat_threshold;
  int splitInfoSize = SplitInfo::Size(max_cat_threshold);
  if (input_buffer_.size() < splitInfoSize * 2) {
      input_buffer_.resize(splitInfoSize * 2);
  }
  SyncUpGlobalBestSplit(input_buffer_.data(), input_buffer_.data(), &smaller_best_split, &larger_best_split, this->config_->max_cat_threshold);

I validated this fixed both the native example and, after exporting the code to mmlspark, the mmlspark test. It is very satisfying to finally be able to resolve this memory corruption issue!

You can find the branch here:
https://github.com/imatiach-msft/LightGBM/commits/ilmat/debug-free-err

The native LightGBM example that reproduced the issue is here:
https://github.com/imatiach-msft/LightGBM/tree/ilmat/debug-free-err/examples/memory_debug

To reproduce, I ran the following commands on two terminals within the memory_debug folder:
valgrind ./lightgbm config=train.conf1
valgrind ./lightgbm config=train.conf2

Make sure to remove the fix I added (mentioned above) and recompile in order to reproduce the error.

Closed via #3110.

Thank you very @imatiach-msft !

Was this page helpful?
0 / 5 - 0 ratings