Lightgbm: CI CUDA job

Created on 21 Sep 2020  路  16Comments  路  Source: microsoft/LightGBM

Opening separate issue to discuss enabling CUDA CI job on demand as the original PR with initial CUDA support has 400+ comments. Refer to https://github.com/microsoft/LightGBM/pull/3160#issuecomment-659105695.

@guolinke Will linux-gpu-pool be used exclusively for LightGBM (CUDA) CI jobs? Or this machine is used for other purposes as well?

All 16 comments

It is exclusive, feel free to use it.

I think we can go the following way.

  1. Create a separate pipeline for CUDA job (https://sethreid.co.nz/using-multiple-yaml-build-definitions-azure-devops/).
  2. Mark it as non-required and disable auto-builds (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#run-pull-request-validation-only-when-authorized-by-your-team).
  3. Setup comments triggers (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#comment-triggers). Now collaborators will be able to run CUDA builds only when it is really needed by commenting something like /azp run cuda-builds.
  4. Use NVIDIA Docker containers similarly we are using Ubuntu 14.04 container for compatibility purposes right now.

I've made some progress with this in .vsts-ci.yml of the test branch: https://github.com/microsoft/LightGBM/blob/test/.vsts-ci.yml, but it looks like there are some issues with NVIDIA drivers:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.

@guolinke Could you please help to install NVIDIA drivers to the machine? I'm not sure but it might help to automate the process: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux.

@guolinke

I think it will be enough to have 1 machine.

image

@StrikerRUS these VMs are allocated on-the-fly, I am not sure can we install driver on it or not.

I just install the gpu driver extension.

image

let us set the max workers to 2, in case for some cocurrence jobs.

Looks like driver extension didn't help: there is no nvidia-smi utility which is normally installed with NVIDIA drivers.

Also, I found experimental option that allows to not install driver on host machine but use driver containers.

Alternatively, and as a technology preview, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver

Alternatively, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki#how-do-i-install-the-nvidia-driver

https://github.com/NVIDIA/nvidia-docker/wiki/Driver-containers#ubuntu-1804

Unfortunately, driver containers also requires rebooting:

sudo reboot

So I have no idea how to configure CUDA jobs other than renting normal permanent GPU Azure machine.

Thanks @StrikerRUS , Maybe we can use self-hosted github action agents. I used it before, which can use an permanent VM for CI jobs.
I will try to build one in the next week.

just create an runner
image

you can have try. the driver and docker is installed. Also, I fix the setup-python according to https://github.com/actions/setup-python#linux

Amazing! Just got it work!

Will read more about GitHub Actions self-hosted runners and get back with new proposals in a few days.

https://github.com/microsoft/LightGBM/runs/1173237869?check_suite_focus=true
image

Hmm, seems that it is possible to use on demand allocation of new VMs with each trigger:
https://github.com/AcademySoftwareFoundation/tac/issues/156
https://github.com/jfpanisset/cloud_gpu_build_agent

Or probably it will be good (at least easier) to have one permanent VM with installed drivers and docker, but turn it on and off automatically with new builds.

@guolinke I'm afraid we cannot run tests with NVIDIA Tesla M60.

[LightGBM] [Fatal] [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

/LightGBM/docker-script.sh: line 12:  1861 Aborted                 (core dumped) python /LightGBM/examples/python-guide/simple_example.py

https://en.wikipedia.org/wiki/CUDA#GPUs_supported

image

https://github.com/microsoft/LightGBM/blob/79d288a32db3b124c39cbe40c1ab0c18647595d1/CMakeLists.txt#L159

I'm adding all architectures from 6.0 onward. 6.0 is needed because of the way atomics are handled.
https://github.com/microsoft/LightGBM/pull/3160#discussion_r470572587

I see. I will change it to p100 or p40

Now it is p100

@guolinke

Now it is p100

Thank you!

Is there any similar to AWS G4 machines in Azure? It will probably cost less:
https://github.com/dmlc/xgboost/issues/4881#issuecomment-534322162
https://github.com/dmlc/xgboost/issues/4921#issuecomment-540244581

The only other option is p40, which provides more GPU memories, but slightly slower. The cost is the same. So I choose p100.

Was this page helpful?
0 / 5 - 0 ratings