As we discuss with @pditommaso and @fmorency on gitter, we have found an issue when we use singularity container with nextflow where the CUDA_VISIBLE_DEVICES environment variable is not given to the container.
Here are the files to reproduce the behavior.
here are the commands run from the main node.
nextflow main.nf run -process.executor slurm -with-singularity /singularity/gpu_image.img -profile gpunextflow main.nf run -process.executor slurm -profile gpuFor both command we want the output to be something like this:
0
1
0
1
0
The first command crash by saying:
line 2: CUDA_VISIBLE_DEVICES: unbound variable
or if we add env.CUDA_VISIBLE_DEVICES='\$CUDA_VISIBLE_DEVICES' to nextflow.config it just echo an empty string.
The second command echo the expected output.
Quick question is this a variable assigned by SLURM ?
Yes, CUDA_VISIBLE_DEVICES is assigned by slurm.
I've uploaded a new snapshot that should solve the problem. You can try it using this command:
NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc
Note, the variable must be defined in the config file w/o escaping the dollar ie:
env.CUDA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'
It works perfectly, and the definition in the nextflow.config of env.CUDA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES' is not needed.
Thanks a lot for the fast fix.
I will keep this open because I'm not sure this is going to be the final solution. Ideally this should be transparent for the user.
Sorry, I was wrong. I forget to add the -with-singularity option when I run my tests...
It still does not work. The output is still empty with env.CUDA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES' and it still crashes with the same error message when I remove it.
runned command:
NXF_VER=0.31.0-SNAPSHOT nextflow main.nf run -process.executor slurm -profile gpu -with-singularity /containers/gpu.imgNXF_VER=0.31.0-SNAPSHOT nextflow main.nf run -process.executor slurm -profile gpuI've made another snapshot. Update your version with this command
CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info
then run as
NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc
NOTE, the variable definition in the config file is not more required, therefore remove this line
env.CUDA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'
Thanks, I cannot test it right now, I'll try as soon as I can.
Ok, better. I also need to make other changes, therefore please wait to test it until further notice.
I have tested it this morning and the CUDA_VISIBLE_DEVICES is still empty when I use a singularity container even after
CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info
I'm closing this in favour of #803.
This is solved by #803 adding the following setting in the config file
singularity {
envWhitelist = 'CUDA_VISIBLE_DEVICES'
}
It requires 0.31.0-SNAPSHOT build 4894. You can update to it using this command:
CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info
Then use NF as usual:
NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc
Thanks I'll try.
It's working like a charm thanks a lot.