With the release of XGBoost 1.0.x (i.e xgboost-1.0.1-py3-none-manylinux1_x86_64.whl), it seems that installing TVM from scratch (rebuilding Docker containers) makes tests/python/unittest/test_autotvm_xgboost_model.py to fail with a segfault.
Investigating it a bit further, if I manually revert it to xgboost-0.90 it works fine. Using xgboost-1.0.1, this is the message I see:
tests/python/unittest/test_autotvm_xgboost_model.py::test_fit Fatal Python error: Segmentation fault
Thread 0x00007f4f98de4700 (most recent call first):
File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
File "/usr/lib/python3.6/multiprocessing/pool.py", line 463 in _handle_results
File "/usr/lib/python3.6/threading.py", line 864 in run
File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
Thread 0x00007f4f905e3700 (most recent call first):
File "/usr/lib/python3.6/threading.py", line 295 in wait
File "/usr/lib/python3.6/queue.py", line 164 in get
File "/usr/lib/python3.6/multiprocessing/pool.py", line 415 in _handle_tasks
File "/usr/lib/python3.6/threading.py", line 864 in run
File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
Thread 0x00007f4f8fde2700 (most recent call first):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 406 in _handle_workers
File "/usr/lib/python3.6/threading.py", line 864 in run
File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
Current thread 0x00007f4fb514c700 (most recent call first):
File "/usr/local/lib/python3.6/dist-packages/xgboost/core.py", line 1248 in update
File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 74 in _train_internal
File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 209 in train
File "/workspace/python/tvm/autotvm/tuner/xgboost_cost_model.py", line 272 in fit_log
File "/workspace/tests/python/unittest/test_autotvm_xgboost_model.py", line 35 in test_fit
File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 1445 in runtest
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 134 in pytest_runtest_call
File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in <lambda>
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 237 in from_call
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in call_runtest_hook
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 185 in call_and_report
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 99 in runtestprotocol
File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in pytest_runtestloop
File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main
File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session
File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main
File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", line 93 in main
File "/usr/local/lib/python3.6/dist-packages/pytest/__main__.py", line 7 in <module>
File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
./tests/scripts/task_python_unittest.sh: line 27: 24582 Segmentation fault (core dumped) TVM_FFI=ctypes python3 -m pytest -v tests/python/unittest
@tqchen, I didn't see any PR or discussion about it, but are you aware about any ongoing initiative to move TVM to XGBoost 1.0.x, or shall we pin xgboost to be 0.90, to prevent the error to happen? (note: I'm happy to send a patch to pin the version)
@leandron can you create a minimum reproducible example? e.g. pickle the data that causes segfault in XGBoost. Then we can start to bring attention of the XGBoost dev community. In the meanwhile, we can pin xgboost to 0.9
cc @merrymercy @hcho3 Who might be interested in this issue
It would be nice if we can get a reproducible example. We are currently working on the patch release 1.0.2 and I want to get a patch to fix this issue.
@leandron @hcho3 please followup :)
@leandron can you please comment about the current state?
Hi, I only managed to investigate it further, today. XGBoost now is version 1.0.2, and I can still reproduce this issue.
To give some context, this is the function call that triggers the issue:
https://github.com/apache/incubator-tvm/blob/54975a3fd24fa45b815be39075f4614e53009444/python/tvm/autotvm/tuner/xgboost_cost_model.py#L262-L272
I tried just creating pickle files of some inputs (self.xgb_params and dtrain) and simplifying the function call, but this is not enough to reproduce the issue. The issue seems to be in the context custom_callback, below:
Now, something that could help me a bit to narrow down where the problem is if I run XGBoost in debug mode. @hcho3 what is the simplest way I can do that?
@leandron Thanks for pointing out which part of TVM test is failing. Not sure if running in debug mode would help, since XGBoost is crashing with segfault here. I will take a look some time this week.
I compiled TVM from source and tried running the test tests/python/unittest/test_autotvm_xgboost_model.py::test_fit and I cannot reproduce the issue. Do I need a specific Docker container to reproduce the problem?
I also tried building ci-cpu Docker image from scratch and running the unit test inside the container. The test runs without crashing.
We applied a workaround, pinning the xgboost version to be 0.90. Which XGBoost version you see in the image you created from scratch?
@leandron I checked out commit 8502691b5b7ca152da9eb626529070db53d479c8 so that XGBoost 1.0.x is used.
@leandron can you also provide a bit more details
e.g. does directly run tests/python/unittest/test_autotvm_xgboost_model.py fails or do we need to run the entire unittest. It would also be nice if you can send a CI binary hashtag(perhaps in docker hub) to confirm the problematic issue.
I tried to build a docker image with xgboost==1.0.2 and seems cannot repro the issue.
I see this reliably with a virtualenv on bionic on AWS.
environment:
ami: 099720109477/ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408
using tvm revision: 72f2aea2dd219bf55c15b3cf4cfc21491f1f60dd
command: TVM_FFI=ctypes python3 -m pytest -s -v tests/python/unittest -k 'test_autotvm_xgboost_model'
python version
```$ python --version
Python 3.7.5
installed python packages:
antlr4-python3-runtime==4.8
Cython==0.29.16
decorator==4.4.2
psutil==5.7.0
pylint==2.4.4
backtrace:
~/ws/tvm$ gdb python
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run tests/python/unittest/test_autotvm_xgboost_model.py
Starting program: /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/bin/python tests/python/unittest/test_autotvm_xgboost_model.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4104700 (LWP 11905)]
[New Thread 0x7ffff3903700 (LWP 11906)]
[New Thread 0x7fffef102700 (LWP 11907)]
[New Thread 0x7fffec901700 (LWP 11908)]
[New Thread 0x7fffea100700 (LWP 11909)]
[New Thread 0x7fffe78ff700 (LWP 11910)]
[New Thread 0x7fffe50fe700 (LWP 11911)]
[New Thread 0x7fffe28fd700 (LWP 11912)]
[New Thread 0x7fffe00fc700 (LWP 11913)]
[New Thread 0x7fffdd8fb700 (LWP 11914)]
[New Thread 0x7fffdb0fa700 (LWP 11915)]
[New Thread 0x7fffd88f9700 (LWP 11916)]
[New Thread 0x7fffd60f8700 (LWP 11917)]
[New Thread 0x7fffd38f7700 (LWP 11918)]
[New Thread 0x7fffd10f6700 (LWP 11919)]
[Thread 0x7fffe50fe700 (LWP 11911) exited]
[Thread 0x7fffd88f9700 (LWP 11916) exited]
[Thread 0x7fffd10f6700 (LWP 11919) exited]
[Thread 0x7fffd60f8700 (LWP 11917) exited]
[Thread 0x7fffdb0fa700 (LWP 11915) exited]
[Thread 0x7fffdd8fb700 (LWP 11914) exited]
[Thread 0x7fffe00fc700 (LWP 11913) exited]
[Thread 0x7fffe28fd700 (LWP 11912) exited]
[Thread 0x7fffe78ff700 (LWP 11910) exited]
[Thread 0x7fffea100700 (LWP 11909) exited]
[Thread 0x7fffec901700 (LWP 11908) exited]
[Thread 0x7fffef102700 (LWP 11907) exited]
[Thread 0x7ffff3903700 (LWP 11906) exited]
[Thread 0x7ffff4104700 (LWP 11905) exited]
[Thread 0x7fffd38f7700 (LWP 11918) exited]
[New Thread 0x7fffd10f6700 (LWP 11936)]
[New Thread 0x7fffd38f7700 (LWP 11937)]
[New Thread 0x7fffd60f8700 (LWP 11938)]
[Thread 0x7fffd10f6700 (LWP 11936) exited]
[Thread 0x7fffd60f8700 (LWP 11938) exited]
[Thread 0x7fffd38f7700 (LWP 11937) exited]
[New Thread 0x7fffd38f7700 (LWP 11955)]
[New Thread 0x7fffd60f8700 (LWP 11956)]
[New Thread 0x7fffd10f6700 (LWP 11957)]
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffb5ba9e37 in std::vector
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
(gdb) bt
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
rtld_fini=<optimized out>, stack_end=0x7fffffffdfb8) at ../csu/libc-start.c:310
(gdb)
pytest log:
~/ws/tvm$ TVM_FFI=ctypes python3 -m pytest -s -v tests/python/unittest -k 'test_autotvm_xgboost_model'
================================================================ test session starts =================================================================
platform linux -- Python 3.7.5, pytest-5.4.1, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/bin/python3
cachedir: .pytest_cache
rootdir: /home/ubuntu/ws/tvm
collecting 88 items Testing using contexts: [cpu(0)]
collected 562 items / 560 deselected / 2 selected
tests/python/unittest/test_autotvm_xgboost_model.py::test_fit Fatal Python error: Segmentation fault
Thread 0x00007f6c59dd9700 (most recent call first):
File "/usr/lib/python3.7/multiprocessing/connection.py", line 379 in _recv
File "/usr/lib/python3.7/multiprocessing/connection.py", line 407 in _recv_bytes
File "/usr/lib/python3.7/multiprocessing/connection.py", line 250 in recv
File "/usr/lib/python3.7/multiprocessing/pool.py", line 470 in _handle_results
File "/usr/lib/python3.7/threading.py", line 870 in run
File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap
Thread 0x00007f6c5cddb700 (most recent call first):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 422 in _handle_tasks
File "/usr/lib/python3.7/threading.py", line 870 in run
File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap
Thread 0x00007f6c5c5da700 (most recent call first):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 413 in _handle_workers
File "/usr/lib/python3.7/threading.py", line 870 in run
File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap
Current thread 0x00007f6c7f460740 (most recent call first):
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/core.py", line 1249 in update
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/training.py", line 74 in _train_internal
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/training.py", line 209 in train
File "/home/ubuntu/ws/tvm/python/tvm/autotvm/tuner/xgboost_cost_model.py", line 272 in fit_log
File "/home/ubuntu/ws/tvm/tests/python/unittest/test_autotvm_xgboost_model.py", line 35 in test_fit
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/python.py", line 1479 in runtest
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 217 in
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 244 in from_call
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 217 in call_runtest_hook
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 186 in call_and_report
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 100 in runtestprotocol
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 247 in _main
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 191 in wrap_session
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/config/__init__.py", line 125 in main
File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pytest/__main__.py", line 7 in
File "/usr/lib/python3.7/runpy.py", line 85 in _run_code
File "/usr/lib/python3.7/runpy.py", line 193 in _run_module_as_main
Segmentation fault (core dumped)
```
Let me try again with TVM_FFI=ctypes environment variable set. What does this do?
The trace might offer some insights, @hcho3 , couldit caused by ConfigureGpuId? also cc @trivialfis since it seems to relates to https://github.com/dmlc/xgboost/pull/4961?
0x00007fffb5ba9e37 in std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > xgboost::XGBoostParameter<xgboost::GenericParameter>::UpdateAllowUnknown<std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > >(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, bool*) ()
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
(gdb) bt
#0 0x00007fffb5ba9e37 in std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > xgboost::XGBoostParameter<xgboost::GenericParameter>::UpdateAllowUnknown<std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > >(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, bool*) ()
from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#1 0x00007fffb5b970b7 in xgboost::GenericParameter::ConfigureGpuId(bool) ()
Is it possible that somehow XGBoost linked a wrong dmlc static library?
i don't know unless we go and dig deeper, but if the bug is reproducible, then it should not be hard to find the cause
tried to reproduce with xgboost built from source (at HEAD/e4f5b6c8 and v1.0.2), no luck (test_autotvm_xgboost_model passes). if I reinstall the pip package (1.0.2), I can get it to reproduce again.
I built xgboost with this config:
cmake -GNinja .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DOPEN_MP:BOOL=ON
any other suggestions to get it to build or install like the pypi package? might it be related to building the package on centos?
@areusch Are you still running bionic when installing from pip?
yes
There is something I wanted to point out, which is an insight after @areusch's comment (thanks for that!).
The VM I'm running this test, does not have a GPU. However, the same test used to pass on this very same machine, with xgboost<1. Is that the case for you @areusch?
@hcho3 do you think this could be something caused by a change in behaviour after xgboost>=1.0 ?
no GPU on my instance (it is c5.4xlarge)
I can reproduce it on bionic. Here is what I have found so far:
@hcho3 It would be of great help if I can obtain a debug build or RelaseWithDebugInfo build.
I am still unable to reproduce it on my Bionic machine. @areusch Can you share the content of your config.cmake?
@trivialfis I'll try to build a wheel using CentOS Docker image.
I am still unable to reproduce it on my Bionic machine. @areusch Can you share the content of your config.cmake?
@hcho3 You have to install the binary package on pip to reproduce it. Building from source works fine.
@hcho3 Here is config.cmake from tvm.
@trivialfis I rebuilt the 1.0.2 wheel with debug symbol enabled: https://drive.google.com/file/d/1cELaBb_rnmb9y8irSwEBQ5AIvaDQgDcF/view?usp=sharing. Hope it helps.
@areusch Since you are using AWS, can you make AMI from your EC2 instance and share it with me?
@hcho3 I found the cause:
The dmlc::Error is from libtvm.so instead of libxgboost.so, which can not be caught by XGBoost. I believe there are some mixed ups in the built binary.
#0 0x00007fffc821853f in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1 0x00007fffc821a098 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2 0x00007ffff7b15168 in __GI___backtrace (array=<optimized out>, size=<optimized out>) at ../sysdeps/x86_64/backtrace.c:111
#3 0x00007fffc8b324fd in dmlc::StackTrace[abi:cxx11](unsigned long, unsigned long) () from /home/fis/Workspace/XGBoost/incubator-tvm/build/libtvm.so
#4 0x00007fffc8b32dbc in dmlc::LogMessageFatal::~LogMessageFatal() () from /home/fis/Workspace/XGBoost/incubator-tvm/build/libtvm.so
#5 0x00007fffb971b4c5 in dh::ThrowOnCudaError (code=cudaErrorInsufficientDriver, file=0x7fffb9a82278 "/workspace/src/common/common.cu", line=14) at /workspace/src/c_api/../data/../common/common.h:41
#6 0x00007fffb972ac75 in xgboost::common::AllVisibleGPUs () at /workspace/src/common/common.cu:14
But I don't understand how it happens.
Somehow the dmlc::LogMessageFatal constructor is a plt function in XGBoost, but the one in tvm is not plt, and calling it in XGBoost got resolved into the one in tvm.
Thread 1 "python" hit Breakpoint 1, 0x00007fffb952ac60 in dmlc::LogMessageFatal::LogMessageFatal(char const*, int)@plt ()
from /home/fis/Workspace/XGBoost/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so
(gdb) bt
#0 0x00007fffb952ac60 in dmlc::LogMessageFatal::LogMessageFatal(char const*, int)@plt ()
from /home/fis/Workspace/XGBoost/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so
#1 0x00007fffb971b49c in dh::ThrowOnCudaError (code=cudaErrorInsufficientDriver, file=0x7fffb9a82278 "/workspace/src/common/common.cu", line=14)
at /workspace/src/c_api/../data/../common/common.h:41
#2 0x00007fffb972ac75 in xgboost::common::AllVisibleGPUs () at /workspace/src/common/common.cu:14
#3 0x00007fffb961496e in xgboost::gbm::GBTree::Configure (this=0xc41b70, cfg=std::vector of length 12, capacity 12 = {...})
at /workspace/src/gbm/gbtree.cc:54
#4 0x00007fffb9643985 in xgboost::LearnerImpl::ConfigureGBM (this=0x1397db0, old=..., args=std::vector of length 12, capacity 12 = {...})
at /workspace/src/learner.cc:925
#5 0x00007fffb963aeaf in xgboost::LearnerImpl::Configure (this=0x1397db0) at /workspace/src/learner.cc:252
#6 0x00007fffb9641649 in xgboost::LearnerImpl::UpdateOneIter (this=0x1397db0, iter=0, train=0x12282e0) at /workspace/src/learner.cc:722
#7 0x00007fffb9561414 in XGBoosterUpdateOneIter (handle=0x1397db0, iter=0, dtrain=0x13856e0) at /workspace/src/c_api/c_api.cc:501
#8 0x00007ffff5092dae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#9 0x00007ffff509271f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#10 0x00007ffff52a65c4 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#11 0x00007ffff52a6c33 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
Somehow the
dmlc::LogMessageFatalconstructor is a plt function in XGBoost, but the one in tvm is not plt, and calling it in XGBoost got resolved into the one in tvm.
Looks promising!
Naive question: based on what you see here, is it possible to correlate with the original stack trace/error described in the first report? https://github.com/apache/incubator-tvm/issues/4953#issue-572158142
@leandron I believe it's the root cause of this issue. I just don't know how to fix it. Maybe it's a bug in system linker on 18.04 or compiler used to build the pip package? Or a convention in hairy C++ ABI I'm not aware of? Or there's a wrong CMake flag that makes the function plt somehow while it's not supposed to?
@hcho3
Could you please create another build with position independent code of dmlc-core disabled?
set_target_properties(dmlc PROPERTIES
CXX_STANDARD 11
CXX_STANDARD_REQUIRED ON
POSITION_INDEPENDENT_CODE OFF) # ON -> OFF
list(APPEND LINKED_LIBRARIES_PRIVATE dmlc)
I wonder if that was due to inconsistency between dmlc-core of tvm and xgb. https://github.com/apache/incubator-tvm/pull/5401 updated the logging to latest, please check again.
@trivialfis Here it is: https://drive.google.com/file/d/13WZRRaUPKil4rwH2avgUix_xO5LpIYs5/view?usp=sharing
@hcho3 do you still need an AMI from me? I think you can repro by using the AMI I mentioned earlier
ami: 099720109477/ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408
it should be enough to just try and run the tvm test using a pip installed xgboost. I can build you another if it would help.
@hcho3 One last request, otherwise I'm running out of ideas.
Patch both xgboost and rabit's C API macro:
For xgboost:
diff --git a/include/xgboost/c_api.h b/include/xgboost/c_api.h
index f9c0a0ff..baaaeb43 100644
--- a/include/xgboost/c_api.h
+++ b/include/xgboost/c_api.h
@@ -20,7 +20,7 @@
#if defined(_MSC_VER) || defined(_WIN32)
#define XGB_DLL XGB_EXTERN_C __declspec(dllexport)
#else
-#define XGB_DLL XGB_EXTERN_C
+#define XGB_DLL XGB_EXTERN_C __attribute__ ((visibility ("default")))
#endif // defined(_MSC_VER) || defined(_WIN32)
// manually define unsigned long
For rabit:
diff --git a/include/rabit/c_api.h b/include/rabit/c_api.h
index 0a96ef7..47c5735 100644
--- a/include/rabit/c_api.h
+++ b/include/rabit/c_api.h
@@ -18,7 +18,7 @@
#if defined(_MSC_VER) || defined(_WIN32)
#define RABIT_DLL RABIT_EXTERN_C __declspec(dllexport)
#else
-#define RABIT_DLL RABIT_EXTERN_C
+#define RABIT_DLL RABIT_EXTERN_C __attribute__ ((visibility ("default")))
#endif // defined(_MSC_VER) || defined(_WIN32)
/*! \brief rabit unsigned long type */
Build XGBoost with following flags appended:
-DCMAKE_CXX_FLAGS='-fvisibility=hidden' -DCMAKE_C_FLAGS='-fvisibility=hidden'
@trivialfis Did you update TVM to latest?
@hcho3 I tried master branch and the commit before:
I wonder if that was due to inconsistency between dmlc-core of tvm and xgb. #5401 updated the logging to latest, please check again.
@trivialfis And you ran git submodule update --init --recursive?
@hcho3 Yes. Currently detached at the commit before above linked PR.
fis@fis-Standard-PC-Q35-ICH9-2009:~/Workspace/XGBoost/incubator-tvm$ git status
HEAD detached at 56941fb9d
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gdb_history
nothing added to commit but untracked files present (use "git add" to track)
@areusch Can you try out the latest TVM master on your end? I'm still having trouble reproducing the original issue.
Finally, I reproduced it. Yes! Note: I used latest TVM as of today. The crash still occured.
@hcho3 Could you try applying the patches I posted above and use the corresponding cmake flags?
@trivialfis I applied your patch and changed CMake flags. And now the unit test does not crash any more.
You should try it too. Get the wheel at https://xgboost-wheels.s3-us-west-2.amazonaws.com/xgboost-1.0.2-py3-none-manylinux1_x86_64.whl.
@hcho3 Yup. It works fine on my machine too
@hcho3 tested with your new wheel on my aws instance and the test now passes!
@trivialfis Can you elaborate what your patch does? Does it hide certain symbols?
It hides all the symbols, except for C APIs. So if anyone's using C++ header, it might generate a lots of linker errors.
@trivialfis I think we can hide all C++ symbols when building Python wheels. I don't think anyone using C++ headers would use the Pip wheel. WDYT?
@hcho3 Yup. Good idea. We can make it a CMake option:
@trivialfis Let me file a pull request. We'll include the fix as part of the upcoming 1.1.0 release.
We need to be careful about this. Rabit and dmlc core are independently built, I'm not sure what will happen if they throw an error, as hiding symbols means exception can not be propagated out.
Not entirely sure in the context of static linking.
Got it. How about compiling the wheel using latest Ubuntu (not CentOS) and put it in a S3 bucket? The TVM CI can pull from this bucket instead of PyPI.
Current build environment for the Pip wheel is quite old: CentOS 6 + devtoolset-4 (GCC 5.x). When we drop CUDA 9.0 support, we can upgrade the build environment to CentOS 6 + devtoolset-6 (GCC 7.x). Upgrade may fix the issue.
Before upgrading libc dependency, we can try to add a test that forces rabit to throw an error, see if it crashes XGBoost with segfault. An uneven all reduce can do.
Or a test with dmlc core, a file nonexist error seems to be simple.
Just make sure error is not thrown in header.
I believe if it works it will be a net gain for XGBoost, hiding symbols is a good practice for shared libraries.
With https://github.com/dmlc/xgboost/pull/5590, I can now run tests/python/unittest/test_autotvm_xgboost_model.py::test_fit without crashing.
@leandron @areusch @tqchen I've put up RC1 for the upcoming XGBoost 1.1.0 release. Feel free to try it:
python3 -m pip install xgboost==1.1.0rc1
The unit test should not crash.
Thanks @hcho3 @areusch @leandron @trivialfis for resolving this problem.
Most helpful comment
With https://github.com/dmlc/xgboost/pull/5590, I can now run
tests/python/unittest/test_autotvm_xgboost_model.py::test_fitwithout crashing.