Related to #773
I have a pipeline that uses GPUs. On my university's cluster, which uses PBS Pro, I can request GPUs in nextflow with clusterOptions. But now I'm trying to run my pipeline on a Kubernetes cluster and I don't see how to request a GPU from the nextflow script. What I would like in my pod config file is something like this:
resources:
limits:
cpu: 4
memory: "8Gi"
nvidia.com/gpu: 1
I see that I can use cpus and memory for the first two, but what about the GPU resource? Could we expand the pod directive to also add custom resources like this? Something like pod resource "nvidia.com/gpu" "1".
Alternatively, we could add a gpu directive. That should have some use for other executors (such as PBS). I just want to be able to use GPUs on a Kubernetes cluster from nextflow.
I think it should be possible to do the same using a pod nodeSelector, see #955. Would that work in your case?
Unfortunately not, while nodeSelector can land me on a physical node with a GPU, it can't provision a GPU for my pod, so I don't see any GPUs in my container. Just verified this on the cluster I use. Nextflow will need to add the GPU resources to the resources section one way or another.
Yes, I agree that NF should allow to specify the GPU resources. Still not sure how. There are other possible required setting other than nvidia.com/gpu: 1 when using a GPU processor?
Related: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Also Google Pipelines API requires the specification of acceleratorType and acceleratorCount attributes, see: https://cloud.google.com/genomics/reference/rest/v1alpha2/pipelines
@KevinSayers have you any experience running GPU workloads on K8s ?
As far as I know the resources object is the only way to request GPUs.
@pditommaso No I have not tried a GPU with K8s yet.
What about adding key-values for both limits and requests having that available would enable the GPU to be defined and flexibility in the future if other resource types become available. Having requests would also enable more fine tuning of where pods go.
e.g.
pod limits: 'nvidia.com/gpu', value: '1'
pod requests: 'nvidia.com/gpu', value: '1'
I think the pipelines API and k8s will probably have to be handled differently from my understanding pipelines is expecting an actual accelerator value e.g. nvidia-tesla-t4 whereas k8s just needs to know about a class of resources.
This doesn't look bad, but I would to not have a setting only for K8s. I was more thinking to introduce a gpus definition e.g.
gpus 1
or
gpus 1, type: 'nvidia-tesla-t4'
The type could be interpreted differently depending the executor, as the accelerator type for Google Pipelines and the class of resources for K8s (defaulting to nvidia.com/gpu).
@KevinSayers @pditommaso The methods you proposed are pretty much the two alternatives I had in my head. I think the gpus directive would be a better long-term solution, especially with the type option. I know the PBS executor (which I use on another system) should also be able to use that directive right off the bat, and I'm sure other executors could as well.
As for specifying the GPU type, yes for k8s it is sufficient to say nvidia.com/gpu, but at least on the k8s cluster I use, they use labels to specify the GPU type. So that functionality is already available through the pod directive. For PBS there is the gpu_model directive which currently must be provided through clusterOptions but could also be given through a gpus directive. So it seems that specifiying the type will look differently for each executor:
# pbs / pbspro
gpus 1, type: "p100" # maps to -l select=1:ngpus=1:gpu_model=p100
# k8s
gpus 1, type: "nvidia.com/gpu" # or just hardcode the type part for now
pod label: "gpu-type", value: "1080Ti" # the convention I would use for my k8s cluster
After thinking about it more, gpus would improve portability between executors compared to requests and limits.
slurm:
sbatch ... --gres=gpu:1 鈥揷onstraint="tesla,t4"
or:
sbatch --gres=gpu:nvidia-tesla-t4:1
Another interesting use case for gpus directive is on a local machine with several GPUs, such as a DGX-1 / DGX-2. In that case it would be helpful if nextflow could map processes to specific GPUs as it already does with CPUs. For NVIDIA GPUs this can be done by setting the CUDA_VISIBLE_DEVICES environment variable:
# assign process to first four GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3
For GCE, it'd also be nice if TPU support was possible.
My understanding is that CUDA_VISIBLE_DEVICES should be set accordingly the number of avail gpus and the requested ones, e.g. if there are 16 gpus and the process requires 2, then each task should have different pair e.g CUDA_VISIBLE_DEVICES=0,1, CUDA_VISIBLE_DEVICES=2,3, etc.
Is that right?
Yes that's correct.
OK, being so the variable CUDA_VISIBLE_DEVICES is usually managed by the underlying job scheduler e.g slurm. NF should only take care to pass to the container environment.
Unless the workflow is executed with the local executor, in that case it should be managed by NF itself.
Then again, to my knowledge not every scheduler uses this variable. I think SLURM does but PBS Pro does not (at least not on our cluster). I suppose nextflow can handle CUDA_VISIBLE_DEVICES differently for each scheduler though.
Not so trivial, because NF is not aware of the node(s) where jobs are allocated by the batch scheduler, which is essential to know to determine such var correctly.
Just googling, not sure it can help.
Ah you're right, NF does not know which GPUs on a node are given to a batch job. Actually the admins of my cluster are trying to implement CUDA_VISIBLE_DEVICES through PBS Pro so everything should work fine as you described.
Since the conversation has increased in scope somewhat, I'll try to summarize what we've determined so far:
gpus directive that allows the user to request GPUs as a "first-class" resource, which will make using GPUs easier for most executors and make it possible in the first place for K8S.gpus directive should also have a type option as some executors have different options for accelerator types. Examples have been given above for several executors including PBS, SLURM, and K8S.CUDA_VISIBLE_DEVICES, which controls GPU visibility. For the local executor, nextflow should use this variable to schedule GPUs. For HPC schedulers, nextflow should only pass this variable to the container environment if it exists, as it is up to the scheduler to set this variable in the first place. For K8S, this variable is not used as GPU visibility is handled properly by K8S.accelerator instead? Or might be better to implement a separate directive for things like that.Are there any other technical considerations that we need to discuss, in terms of implementing this feature in nextflow?
@bentsherman It seems like a good summary to me. I suggest the most immediate work should be to implement limits in the k8s model in Nextflow. This would enable your workflow to immediately run, and I think the gpu directive could also then rely on this model. This would likewise also support TPUs and any other extended resource. @pditommaso what do you think?
Thank a lot for your comprehensive feedback. What still remains not completely clear to me is what properties the command script need to access other the number of gpus. For example should the type property accessible in the script as task property like for example task.gpus.type ?
Also GPU have they own memory. Does it need to specified independently from the cpu mem? does K8s managed this ?
@KevinSayers that sounds great to me.
@pditommaso I don't think I'll need those properties in any of my workflows... CUDA applications are typically designed to be agnostic to the particular GPU model. But I wouldn't be surprised if some day somebody asks for that feature :)
As for GPU memory, it is not typically managed in any way from the OS. Once an application has access to a GPU it can allocate the GPU's entire global memory (tensorflow does this by default). So I don't think we need a separate directive for GPU memory; provisioning the GPU is sufficient.
I see good. Now the biggest issue: it should be gpus for consistency with cpus or gpu ? :D
gpus for consistency!
Relevant to this context https://aws.amazon.com/batch/faqs/?#GPU_Scheduling_
Not to complicate things, but there might be a slight difference for containers (at least as far as nvidia-docker is concerned) vs. cluster resource managers.
nvidia-docker uses NVIDIA_VISIBLE_DEVICES passed as an environmental variable to the container on the docker run line, rather than CUDA_VISIBLE_DEVICES (at least according to their documentation). If you've isolated GPUs this way, then you don't really need to set CUDA_VISIBLE_DEVICES. Also, by default, all GPUs are exposed to the container if you don't bother setting this variable.
No clue how the other containerization solutions (like Singularity) handle this, though.
Ah yes, although I don't think they conflict with each other so it would be safe to both of them if necessary. Singularity has CUDA support built in with the --nv option, and it uses CUDA_VISIBLE_DEVICES and not NVIDIA_VISIBLE_DEVICES.
With kubernetes neither environment variable is needed because pods are automatically given access to only the GPUs that are requested.
I don't think they conflict per se, but assuming they can be set the same (at least with docker) is not a good assumption.
For example, if you want the first and third GPU of a 3 or 4 GPU machine (hypothetically, then you set NVIDIA_VISIBLE_DEVICES=0,2 but CUDA_VISIBLE_DEVICES=0,1 is correct inside of the container.
A safe default is to use CUDA_VISIBLE_DEVICES=all as the nvidia-docker runtime is only exposing the wanted GPUs to the container (numbered contiguously from 0). Official NVIDIA CUDA images come with this variable set this way in the Dockerfiles to sidestep this issue, assuming you are using their Nvidia-docker runtime.
Quite complicated stuff. Is it realistic to say that docker uses NVIDIA_xxx variables and Singularity CUDA_xxx ones?
@pditommaso Correct.
There is probably one fine point to keep in mind. The NVIDIA_xxx stuff applies if you have Nvidia-docker installed and --runtime=nvidia specified on your docker run line (aka containerOptions in NF).
Nvidia-docker is the most mature/supported way to use the GPU with docker, but there are some hacks used elsewhere and prior to the development of nvidia-docker. Probably safe to assume that docker + Nvidia GPUs requires Nvidia-docker, but maybe document just in case?
The hard thing is to have a comprehensive list of docker/singularity command lines with different possible options relative to GPU execution. Could any of you try to summarize them below, something like:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=<value> -e NVIDIA_DRIVER_CAPABILITIES=<value> <container> <command>
and the possible variations?
tl;dr: For nvidia-docker, just specify NVIDIA_VISIBLE_DEVICES=<comma separated list of gpus> to manage GPUs for a process in a container.
I'll take a stab for nvidia-docker since that's what I'm familiar with. I am going to assume that you are working from a well-provisioned CUDA base image. If you haven't set up your image properly, you have to do more, but that seems outside of the scope of Nextflow. I'm deferring on Singularity.
# Restricting GPUs visible to the container
# Get a list of the GPUs in the machine
nvidia-smi
# Use all GPUs. This is the default, no environment variable required to be passed to the container
docker run --runtime=nvidia <container> <command>
# also equivalent.
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all <container> <command>
# Use a specific GPU (3rd GPU as listed by nvidia-smi)
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=2 <container> <command>
# Use a combination of GPUs (1st and 3rd as listed by nvidia-smi)
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0,2 <container> <command>
# Specify a GPU by hardware UUID/GUID, probably overkill and not really necessary
<Scrape the UUID from nvidia-smi -L output>
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPU-04546190-b68d-65ac-101b-035f8faed77d <container> <command>
The above is all that I think Nextflow needs to specify in order manage GPUs with nvidia-docker in the local executor. Everything else (not listed) is ways to mess with/override containers that are either not completely configured or misconfigured. You can manipulate these with containerOptions for users who really know what they are doing.
I think it's safe to assume that any GPU-enabled docker container will use the nvidia runtime, I don't know that there is any other way to use NVIDIA GPUs in a container.
Docker:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=<value> <container> <command>
Singularity:
CUDA_VISIBLE_DEVICES=<value> singularity exec --nv <container> <command>
# with --cleanenv
SINGULARITYENV_CUDA_VISIBLE_DEVICES=<value> singularity exec --cleanenv --nv <container> <command>
All of the variations listed by @acerjanic are also valid in the case of Singularity.
I've created a first implementation for K8s https://github.com/nextflow-io/nextflow/pull/1125
Quite easy to extend this to AWS Batch and Google Pipelines.
Update: should be noted that Batch does not allow the specification of GPU type, likely it depends on the instance type. Google Pipelines instead allow the selection of the GPU type as listed here.
I'm starting to think we should call this directive accelerator following the Amazon and Google pipelines approach and, above all, to being able to support future accelerator types as those APIs are suggesting.
Seems reasonable to me and would support TPU's as previously mentioned in this thread.
Agreed, and by the way, the experimental gpu directive is working beautifully on my k8s cluster.
I've renamed gpu to accelerator. The gpu still works with a deprecation message.
Now accellerator does also support AWS Batch and Google pipelines. The support for legacy schedulers will be implemented in a future release.
Since the conversation has increased in scope somewhat, I'll try to summarize what we've determined so far:
- Implement a
gpusdirective that allows the user to request GPUs as a "first-class" resource, which will make using GPUs easier for most executors and make it possible in the first place for K8S.- The
gpusdirective should also have atypeoption as some executors have different options for accelerator types. Examples have been given above for several executors including PBS, SLURM, and K8S.- The only environment variable that nextflow may need to deal with is
CUDA_VISIBLE_DEVICES, which controls GPU visibility. For the local executor, nextflow should use this variable to schedule GPUs. For HPC schedulers, nextflow should only pass this variable to the container environment if it exists, as it is up to the scheduler to set this variable in the first place. For K8S, this variable is not used as GPU visibility is handled properly by K8S.- Possibly extend this directive to support other types of accelerators such as TPUs. Maybe this directive should be called
acceleratorinstead? Or might be better to implement a separate directive for things like that.Are there any other technical considerations that we need to discuss, in terms of implementing this feature in nextflow?
Hi, I hope it's appropriate to continue the discussion here about supporting the accelerator directive for the local executor.
I'm interested in using nextflow on a local computer with multiple GPUs. After some Googling I think there's no existing solution to schedule the GPUs properly. So I attempted to implement this for myself.
I got something working in https://github.com/yqshao/nextflow/tree/local_accelerator, basically what I did was to:
availAcc LocalPollingMonitor as List<String> to keep track of the avaiable GPUs, the available GPUs is inferred from the CUDA_VISIBLE_DEVICES variablehandler.task.processor.session.config.env in LocalPollingMonitor when submitting to specify the GPUs to use. (I somehow had to remove the @Memoized in TaskProcessor for this to work)acceleratorIds field in TaskHandler to keep track of the GPUs used by each task and add them back when the task finishes.I feel that it works for me for now (with python & bash scripts) but I'm sure there's more cases to deal with, e.g. getting the list of GPUs from nvidia-smi when CUDA_VISIBLE_DEVICES=all. And since I've quite new to groovy and java so I'm mainly trying to debug and understand the code with brutal force. Therefore I'm sure that it's not even close to be ready for merging.
I'd like to ask whether it's of interest to include this?
And if so,
Most helpful comment
I'm starting to think we should call this directive
acceleratorfollowing the Amazon and Google pipelines approach and, above all, to being able to support future accelerator types as those APIs are suggesting.