Hi, sorry if this is not the correct place to ask but I have been unable to find a solution anywhere.
When using pytorch with docker it's useful to add --ipc=host in order to have multithreaded loaders. More info here. Is there anyway to add such a flag with kubeflow pipelines?
@patrickpoirson which KFP component are you using to launch pytorch? You should probably ask it there
or if you are writing your own component, you are free to add any argument to it
Thanks for your reply. I am writing my own component. I store my pytorch training code in a docker container and use kfp.dsl.ContainerOp to wrap the docker container. Is there a way to pass --ipc=host through the kfp.dsl.ContainerOp?
@patrickpoirson Yes, you can pass command and arguments there just like in docker: https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#kfp.dsl.ContainerOp
Sorry, I may be confused but the command and arguments can be passed to docker but what I need is to pass a flag that affects the actual docker run command e.g. docker run image [command] [arguments] changed to docker run --ipc=host image [command] [arguments]
I believe in standard kubernetes this is done by setting hostIPC to true, which looks to be not possible in kfp?
Sorry, I may be confused but the command and arguments can be passed to docker but what I need is to pass a flag that affects the actual docker run command e.g.
docker run image [command] [arguments]changed todocker run --ipc=host image [command] [arguments]I believe in standard kubernetes this is done by setting hostIPC to true, which looks to be not possible in kfp?
KFP runs on top of a Kubernetes cluster. KFP does not create a cluster or run docker.
which looks to be not possible in kfp?
This option does not seem to be supported by Argo which we use for orchestration. You can open an issue in their repo asking to support this feature.
When using pytorch with docker it's useful to add --ipc=host in order to have multithreaded loaders.
Interesting. Have you verified that you're going to benefit from that option when running the loaders?
I am writing my own component. I store my pytorch training code in a docker container and use kfp.dsl.ContainerOp to wrap the docker container.
Please write a reusable component instead of creating ContainerOp objects yourself. Please read the documentation and check our library of components to get inspiration. https://www.kubeflow.org/docs/pipelines/sdk/component-development/ https://github.com/kubeflow/pipelines/tree/master/components/XGBoost/Train/from_ApacheParquet https://github.com/kubeflow/pipelines/blob/master/components/datasets/Chicago_Taxi_Trips/component.yaml
When using pytorch with docker it's useful to add --ipc=host in order to have multithreaded loaders.
Interesting. Have you verified that you're going to benefit from that option when running the loaders?
@Ark-kun From pytorch "Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run."
Training speed without multithreaded data loading will be significantly slower.
As a workaround,you might be able to use https://kubernetes.io/docs/concepts/workloads/pods/podpreset/
@patrickpoirson Did you find a fix for this? I just switched over to PyTorch's DataLoader and am running into the same issue
@mmwebster Yes, we found the following workaround.
volume = kfp.dslPipelineVolume(name='shm-vol', empty_dir={'medium': 'Memory'})
kfp.dsl.ContainerOp(...., pvolumes={'/dev/shm': volume})
Most helpful comment
@mmwebster Yes, we found the following workaround.
volume = kfp.dslPipelineVolume(name='shm-vol', empty_dir={'medium': 'Memory'}) kfp.dsl.ContainerOp(...., pvolumes={'/dev/shm': volume})