Detectron2: Distributed Training on Kubernetes

Created on 2 Jul 2020 · 13Comments · Source: facebookresearch/detectron2

❓ How to do Distribute Training via Kubernetes using detectron2

Describe what you want to do, including:

Training maskrcnn using detectron2 through a kubernetes service

A usual distributed training of pytorch on kubernetes via etcd server + torchelastic.distributed.launch
OR
A usual distributed training of pytorch training on kubernetes via a PyTorchJob via Kubeflow
OR
A usual distributed training of pytorch training on kubernetes via a ElasticJob

I believe all these methods are what the author @rajdeeppalrajdeep from this detectron2 request meant.

❓ What does an API do and how to use it?

From ./train_net.py --help
Run on multiple machines:
(machine0)$ ./train_net.py --machine-rank 0 --num-machines 2 --dist-url [--other-flags]
(machine1)$ ./train_net.py --machine-rank 1 --num-machines 2 --dist-url [--other-flags]

NOTE:
I have been successful in distributed training detectron2 on "on-prem" cluster via commands given in ./train --help
I have been successful in distributed training maskrcnn-benckmark on an EKS cluster via pytorch elastic mainly because it relies on rendezvous endpoint via an etcd image

Things I have tried

Setting up a TCP Service (URL) as a CLUSTERIP and LOADBALANCER for the --machine-rank 0 deployment and pointed all other deployments to this service in the dist-url. I got the expect result of
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514
This is expected because of how torch.distributed.launch and Kubernetes plays with each other, hence the need for torchelastic or a rendezvous endpoint.

Setting up a rendezvous endpoint for the dist-url as an etcd service, as torchelastic.distributed.launch.
Result: All machine lock

I feel that your example to distributed train are limited as there are no hints to train large scale.

I am just seeking guidance on what I can I do to contribute to making large scale training via kubernetes a reality.

Source

mckunkel

Most helpful comment

@mckunkel were you able to do distributed training with detectron2 on K8s?
I am also interested in the possibility of distributed training of detectron2 models on Kubeflow and would appreciate if you could share your findings.

Greetings,
My apologies for the late reply, I have been collecting my notes on this project to properly distribute my findings.
Please allow me a few more days to collect everything,
Long story short:
Yes, I got it to work on K8's. It was fast!
I had to modify the launch module and the comm package.
I used torchelastic with a custom "pod config" to get Dectron running.

If you would allow me a few more days, I will get everything to you.

mckunkel on 13 Oct 2020

👍3

All 13 comments

As mentioned in https://github.com/facebookresearch/detectron2/issues/1605#issuecomment-644559259, we do not provide support for other third-party service/frameworks.

As you have pointed out, detectron2 can do distributed training, as much as what torch.distributed can do. If there is anything it cannot do and should improve, we don't consider them within the scope of the detectron2 project to fix.

ppwwyyxx on 2 Jul 2020

I'm a bit speechless at this position your are taking because its known that torch.distributed does not run on kubernetes, but torchelastic.distributed.launch does.
A bit disappointing in the lack of foresight for implementing torchelastic.distributed.launch along with torch.distributed

mckunkel on 2 Jul 2020

torch.distributed does not run on kubernetes, but torchelastic.distributed.launch does.

This is not known to me because we do not work with third-party services/frameworks like Kubernetes. Thanks for explaining.

From my position, Kubernetes is just one of the 1000 other great libraries/services/languages that may work nicely together with this project. I don't think it's within the scope of this project to support integration with all of them unless some appears extremely important (torchelastic may be at some point). I hope that's understandable.

However, changes that can make this project easier to integrate with other projects can be a reasonable feature request. As an example, none of our models depend on torch.distributed so I would expect all our models to work out-of-the box in torchelastic. Many of our evaluation code has a distributed=True option so they can be used without torch.distributed.
If there are anything in detectron2 that does not need to depend on torch.distributed but does depend on it now, you can bring that up as a feature request so that more parts of detectron2 can be integrated with torchelastic.distributed. But the integration itself is not within our scope.

ppwwyyxx on 2 Jul 2020

I've tried to make this package work with torchelastic, but out of the box there is too much munge of your custom running to dismantle. So I come back here to respond in hopes of a more scientific discussion, and not an immediate close of the issue.

I understand your position, but what I dont understand is why make the distribution so convoluted? What was wrong with pytorch implementation of
python -m torch.distributed.launch <args> tools.train_net.py <args>. that you felt the need to lock it down to your own spawning.
Don't get me wrong, detectron2 is a good package, but torch.distributed.launch and torchelastic.distributed.launch work so well hand in had that even the page states

If your train script works with torch.distributed.launch it will continue working with torchelastic.distributed.launch with these differences:

where the differences are only 4 small little changes.

I'm not sure if there are

1000 other great libraries/services/languages that may work nicely together with this project

but I know one industry-leading cloud based library it won't will well with and that is any library that has a network setup non-congruent to how the current structure of detecton2 is.

So would you please point me in the direction/example of how to train a model without torch.distributed so that I can try again to implement torchelastic.

Many thanks

mckunkel on 17 Jul 2020

What was wrong with pytorch implementation of
python -m torch.distributed.launch tools.train_net.py . that you felt the need to lock it down to your own spawning

I don't think we have "locked it down" in any way. As mentioned above, no models assume they use torch.distributed and in fact they don't assume they are launched in any specific way. Therefore by construction they can be used when jobs are launched in different ways.

The training script train_net.py uses our own launch mechanism, because it has to use __some__ mechanism, and ours have some benefits over python -m torch.distributed.launch (though there might be disadvantages as well, but it's just a choice anyway).

So would you please point me in the direction/example of how to train a model without torch.distributed so that I can try again to implement torchelastic.

Replace https://github.com/facebookresearch/detectron2/blob/f41099aa6f31aa0b7b82937f0da3ec7d5124656f/tools/train_net.py#L163-L170 by main(args), then the model will train without torch.distributed.

ppwwyyxx on 17 Jul 2020

In order to write a distributed mechanism, I must dismantle the comm package, because even the dataset loader relies on this and the worldsize, which is set in the launch command which relies on assumptions based on "locked down" idealism on how the code should be distributed.

As I stated before

I've tried to make this package work with torchelastic...

The launch mechanism was the first approach I took, but soon realized how locked down it was.

mckunkel on 20 Jul 2020

Just like how train_net.py or some classes in pytorch itself uses torch.distributed, the data loader uses torch.distributed because it needs to use "some" mechanism and torch.distrbuted is the standard one.

Being able to configure the comm package and let it utilize other backends seem like a useful improvement. At the same time, custom data loader can be used to achieve similar behavior without using torch.distributed.

ppwwyyxx on 22 Jul 2020

@mckunkel were you able to do distributed training with detectron2 on K8s?
I am also interested in the possibility of distributed training of detectron2 models on Kubeflow and would appreciate if you could share your findings.

danishsamad on 6 Oct 2020

@mckunkel were you able to do distributed training with detectron2 on K8s?
I am also interested in the possibility of distributed training of detectron2 models on Kubeflow and would appreciate if you could share your findings.

If you would allow me a few more days, I will get everything to you.

mckunkel on 13 Oct 2020

👍3

If you would allow me a few more days, I will get everything to you.

Sure thing, Looking forward to your findings!

danishsamad on 14 Oct 2020

@mckunkel thanks. Looking forward to. :)

cognitiveRobot on 14 Oct 2020

If changing comm is all it takes, I'm adding a separate issue https://github.com/facebookresearch/detectron2/issues/2218 in order to allow switching the implementation of comm package.

ppwwyyxx on 1 Nov 2020

@mckunkel have you deployed it in kubeflow? Also very curious how you got around the issue!