Hello, Ive run into an error trying to get XGboost GPU working on WSL2. My usage of XGboost GPU works on a Linux install with CUDA, and Im fairly confident this is a driver issue with CUDA. Someone recommended me opening this issue here just so its kept track of.
Environment:
Windows Build 20161
Ubuntu Windows Subsystem Linux 2 - Linux Version 4.19.121-microsoft-standard
NVIDA Driver 455.41 / CUDA 11.0
GPU : GTX 1070
Running Python XGBoost (XGBRegressor) with tree method of gpu_hist & gpu_id=0 causes the following error
terminate called after throwing an instance of 'thrust::system::system_error'
what(): device free failed: unknown error
Aborted
wsl2 has supported GPU锛烳icrosoft has not release such such version锛寃sl2 supports GPU still need for Microsoft's development
I am out of time
It would be great if we can test XGBoost in WSL2 + CUDA and see whether various functionalities will work.
Note. We cannot use multi-GPU training because NCCL does not support WSL2 yet.
To new contributors: Post a comment here if you'd like to test XGBoost with WSL2. I am available for help and guidance. After testing, submit a pull request to update the doc to document which feature of XGBoost are currently functional with WSL2. This way you can claim credit for Hacktoberfest 2020.
Hi,
I might could do some tests in GTX1050Ti. Can you point out which tests to begin with?
*edit : in GTX1650
Hi @otivedani You can start with google test: https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html#running-gtest
I have done google test, and this is my logs : https://github.com/otivedani/xgboost/tree/wsl-test-logs/build/wsl2-ubuntu2004/logs
I have encountered this error at make test
https://github.com/otivedani/xgboost/blob/wsl-test-logs/build/wsl2-ubuntu2004/logs/02_maketest_gtest.log :
83% tests passed, 1 tests failed out of 6
Total Test time (real) = 70.39 sec
The following tests FAILED:
1 - TestXGBoostLib (Child aborted)
Errors while running CTest
make: *** [Makefile:130: test] Error 8
while detailed ctest -VV
is here https://github.com/otivedani/xgboost/blob/wsl-test-logs/build/wsl2-ubuntu2004/logs/03_ctest_gtest.log
system info :
Windows Version 2004 Build 20236.1005
WSL 2 Ubuntu 20.04 (Linux PC 4.19.128-microsoft-standard)
NVIDIA GTX 1650, CUDA Toolkit 11.1
NCCL 2.7.8
please let me know what you think.
Great! Can you also try running the unit tests for Python? We'd like to document how well XGBoost works in WSL2.
Sure!
After running pytest with and without gpu, this is my logs : https://github.com/otivedani/xgboost/tree/wsl-test-logs/build/wsl2-ubuntu2004/logs/pytest
if ret != 0:
> raise XGBoostError(py_str(_LIB.XGBGetLastError()))
E xgboost.core.XGBoostError: [22:19:17] /home/otivedani/xgboost/src/gbm/../common/common.h:156: XGBoost version not compiled with GPU support.
E Stack trace:
E [bt] (0) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x79) [0x7fdf55f41f79]
E [bt] (1) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::ConfigureUpdaters()+0x105) [0x7fdf560328c5]
E [bt] (2) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::Configure(std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x238) [0x7fdf56037658]
E [bt] (3) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerConfiguration::Configure()+0x87f) [0x7fdf56074cff]
E [bt] (4) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x7e) [0x7fdf56062a0e]
E [bt] (5) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x69) [0x7fdf55f38549]
E [bt] (6) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7fdf7795aff5]
E [bt] (7) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7fdf7795a40a]
E [bt] (8) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x58c) [0x7fdf7797328c]
venv/lib/python3.8/site-packages/xgboost/core.py:186: XGBoostError
=============================== warnings summary ===============================
steps to reproduce :
# using virtualenv
python3 -m venv venv
source venv/bin/activate
# using last build
python setup.py develop --use-cuda --use-nccl
# install dependencies (latest)
pip install -r ./doc/requirements.txt
pip install numpy scikit-learn
sudo apt install graphviz
# tests
export PYTHONPATH=./venv/lib/python3.8/site-packages:./python-package
pytest -v -s --fulltrace tests/python
pytest -v -s --fulltrace tests/python-gpu
without using pip setup.py develop :
E xgboost.core.XGBoostError: [05:18:19] /home/otivedani/xgboost/src/tree/updater_gpu_hist.cu:786: Exception in gpu_hist: NCCL failure :unhandled system error /home/otivedani/xgboost/src/common/device_helpers.cu(71)
I have tried build I made from gtest before as well as creating new build without (no GOOGLE_TEST=ON), but result is the same.
Maybe is there any step I missed?
note : there is this warning after installing graphviz (and libcuda, iirc) from apt
/sbin/ldconfig.real: /usr/lib/wsl/lib/libcuda.so.1 is not a symbolic link
possible linked issues : WSL/issues#5548