Lightgbm: CI CUDA job

Created on 21 Sep 2020 · 16Comments · Source: microsoft/LightGBM

Opening separate issue to discuss enabling CUDA CI job on demand as the original PR with initial CUDA support has 400+ comments. Refer to https://github.com/microsoft/LightGBM/pull/3160#issuecomment-659105695.

@guolinke Will linux-gpu-pool be used exclusively for LightGBM (CUDA) CI jobs? Or this machine is used for other purposes as well?

Source

StrikerRUS

All 16 comments

It is exclusive, feel free to use it.

guolinke on 21 Sep 2020

👍1

I think we can go the following way.

Create a separate pipeline for CUDA job (https://sethreid.co.nz/using-multiple-yaml-build-definitions-azure-devops/).
Mark it as non-required and disable auto-builds (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#run-pull-request-validation-only-when-authorized-by-your-team).
Setup comments triggers (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#comment-triggers). Now collaborators will be able to run CUDA builds only when it is really needed by commenting something like /azp run cuda-builds.
Use NVIDIA Docker containers similarly we are using Ubuntu 14.04 container for compatibility purposes right now.

I've made some progress with this in .vsts-ci.yml of the test branch: https://github.com/microsoft/LightGBM/blob/test/.vsts-ci.yml, but it looks like there are some issues with NVIDIA drivers:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.

@guolinke Could you please help to install NVIDIA drivers to the machine? I'm not sure but it might help to automate the process: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux.

StrikerRUS on 26 Sep 2020

@guolinke

I think it will be enough to have 1 machine.

StrikerRUS on 26 Sep 2020

@StrikerRUS these VMs are allocated on-the-fly, I am not sure can we install driver on it or not.

guolinke on 26 Sep 2020

I just install the gpu driver extension.

guolinke on 26 Sep 2020

👍1

let us set the max workers to 2, in case for some cocurrence jobs.

guolinke on 26 Sep 2020

👍1

Looks like driver extension didn't help: there is no nvidia-smi utility which is normally installed with NVIDIA drivers.

Also, I found experimental option that allows to not install driver on host machine but use driver containers.

Alternatively, and as a technology preview, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver

Alternatively, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki#how-do-i-install-the-nvidia-driver

https://github.com/NVIDIA/nvidia-docker/wiki/Driver-containers#ubuntu-1804

Unfortunately, driver containers also requires rebooting:

sudo reboot

So I have no idea how to configure CUDA jobs other than renting normal permanent GPU Azure machine.

StrikerRUS on 26 Sep 2020

Thanks @StrikerRUS , Maybe we can use self-hosted github action agents. I used it before, which can use an permanent VM for CI jobs.
I will try to build one in the next week.

guolinke on 27 Sep 2020

just create an runner

you can have try. the driver and docker is installed. Also, I fix the setup-python according to https://github.com/actions/setup-python#linux

guolinke on 27 Sep 2020

Amazing! Just got it work!

Will read more about GitHub Actions self-hosted runners and get back with new proposals in a few days.

https://github.com/microsoft/LightGBM/runs/1173237869?check_suite_focus=true

StrikerRUS on 27 Sep 2020

🎉1

Hmm, seems that it is possible to use on demand allocation of new VMs with each trigger:
https://github.com/AcademySoftwareFoundation/tac/issues/156
https://github.com/jfpanisset/cloud_gpu_build_agent

Or probably it will be good (at least easier) to have one permanent VM with installed drivers and docker, but turn it on and off automatically with new builds.

StrikerRUS on 27 Sep 2020

@guolinke I'm afraid we cannot run tests with NVIDIA Tesla M60.

[LightGBM] [Fatal] [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

/LightGBM/docker-script.sh: line 12:  1861 Aborted                 (core dumped) python /LightGBM/examples/python-guide/simple_example.py

https://en.wikipedia.org/wiki/CUDA#GPUs_supported

https://github.com/microsoft/LightGBM/blob/79d288a32db3b124c39cbe40c1ab0c18647595d1/CMakeLists.txt#L159

I'm adding all architectures from 6.0 onward. 6.0 is needed because of the way atomics are handled.
https://github.com/microsoft/LightGBM/pull/3160#discussion_r470572587

StrikerRUS on 30 Sep 2020

I see. I will change it to p100 or p40

guolinke on 1 Oct 2020

Now it is p100

guolinke on 1 Oct 2020

@guolinke