Describe what you want to do, including:
I believe all these methods are what the author @rajdeeppalrajdeep from this detectron2 request meant.
From ./train_net.py --help
Run on multiple machines:
(machine0)$ ./train_net.py --machine-rank 0 --num-machines 2 --dist-url
(machine1)$ ./train_net.py --machine-rank 1 --num-machines 2 --dist-url
NOTE:
I have been successful in distributed training detectron2 on "on-prem" cluster via commands given in ./train --help
I have been successful in distributed training maskrcnn-benckmark on an EKS cluster via pytorch elastic mainly because it relies on rendezvous endpoint via an etcd image
Things I have tried
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514I feel that your example to distributed train are limited as there are no hints to train large scale.
I am just seeking guidance on what I can I do to contribute to making large scale training via kubernetes a reality.
As mentioned in https://github.com/facebookresearch/detectron2/issues/1605#issuecomment-644559259, we do not provide support for other third-party service/frameworks.
As you have pointed out, detectron2 can do distributed training, as much as what torch.distributed can do. If there is anything it cannot do and should improve, we don't consider them within the scope of the detectron2 project to fix.
I'm a bit speechless at this position your are taking because its known that torch.distributed does not run on kubernetes, but torchelastic.distributed.launch does.
A bit disappointing in the lack of foresight for implementing torchelastic.distributed.launch along with torch.distributed
torch.distributed does not run on kubernetes, but torchelastic.distributed.launch does.
This is not known to me because we do not work with third-party services/frameworks like Kubernetes. Thanks for explaining.
From my position, Kubernetes is just one of the 1000 other great libraries/services/languages that may work nicely together with this project. I don't think it's within the scope of this project to support integration with all of them unless some appears extremely important (torchelastic may be at some point). I hope that's understandable.
However, changes that can make this project easier to integrate with other projects can be a reasonable feature request. As an example, none of our models depend on torch.distributed so I would expect all our models to work out-of-the box in torchelastic. Many of our evaluation code has a distributed=True option so they can be used without torch.distributed.
If there are anything in detectron2 that does not need to depend on torch.distributed but does depend on it now, you can bring that up as a feature request so that more parts of detectron2 can be integrated with torchelastic.distributed. But the integration itself is not within our scope.
I've tried to make this package work with torchelastic, but out of the box there is too much munge of your custom running to dismantle. So I come back here to respond in hopes of a more scientific discussion, and not an immediate close of the issue.
I understand your position, but what I dont understand is why make the distribution so convoluted? What was wrong with pytorch implementation of
python -m torch.distributed.launch <args> tools.train_net.py <args>. that you felt the need to lock it down to your own spawning.
Don't get me wrong, detectron2 is a good package, but torch.distributed.launch and torchelastic.distributed.launch work so well hand in had that even the page states
If your train script works with torch.distributed.launch it will continue working with torchelastic.distributed.launch with these differences:
where the differences are only 4 small little changes.
I'm not sure if there are
1000 other great libraries/services/languages that may work nicely together with this project
but I know one industry-leading cloud based library it won't will well with and that is any library that has a network setup non-congruent to how the current structure of detecton2 is.
So would you please point me in the direction/example of how to train a model without torch.distributed so that I can try again to implement torchelastic.
Many thanks
What was wrong with pytorch implementation of
python -m torch.distributed.launchtools.train_net.py . that you felt the need to lock it down to your own spawning
I don't think we have "locked it down" in any way. As mentioned above, no models assume they use torch.distributed and in fact they don't assume they are launched in any specific way. Therefore by construction they can be used when jobs are launched in different ways.
The training script train_net.py uses our own launch mechanism, because it has to use __some__ mechanism, and ours have some benefits over python -m torch.distributed.launch (though there might be disadvantages as well, but it's just a choice anyway).
So would you please point me in the direction/example of how to train a model without torch.distributed so that I can try again to implement torchelastic.
Replace https://github.com/facebookresearch/detectron2/blob/f41099aa6f31aa0b7b82937f0da3ec7d5124656f/tools/train_net.py#L163-L170 by main(args), then the model will train without torch.distributed.
In order to write a distributed mechanism, I must dismantle the comm package, because even the dataset loader relies on this and the worldsize, which is set in the launch command which relies on assumptions based on "locked down" idealism on how the code should be distributed.
As I stated before
I've tried to make this package work with torchelastic...
The launch mechanism was the first approach I took, but soon realized how locked down it was.
Just like how train_net.py or some classes in pytorch itself uses torch.distributed, the data loader uses torch.distributed because it needs to use "some" mechanism and torch.distrbuted is the standard one.
Being able to configure the comm package and let it utilize other backends seem like a useful improvement. At the same time, custom data loader can be used to achieve similar behavior without using torch.distributed.
@mckunkel were you able to do distributed training with detectron2 on K8s?
I am also interested in the possibility of distributed training of detectron2 models on Kubeflow and would appreciate if you could share your findings.
@mckunkel were you able to do distributed training with detectron2 on K8s?
I am also interested in the possibility of distributed training of detectron2 models on Kubeflow and would appreciate if you could share your findings.
Greetings,
My apologies for the late reply, I have been collecting my notes on this project to properly distribute my findings.
Please allow me a few more days to collect everything,
Long story short:
Yes, I got it to work on K8's. It was fast!
I had to modify the launch module and the comm package.
I used torchelastic with a custom "pod config" to get Dectron running.
If you would allow me a few more days, I will get everything to you.
If you would allow me a few more days, I will get everything to you.
Sure thing, Looking forward to your findings!
@mckunkel thanks. Looking forward to. :)
If changing comm is all it takes, I'm adding a separate issue https://github.com/facebookresearch/detectron2/issues/2218 in order to allow switching the implementation of comm package.
@mckunkel have you deployed it in kubeflow? Also very curious how you got around the issue!
Most helpful comment
Greetings,
My apologies for the late reply, I have been collecting my notes on this project to properly distribute my findings.
Please allow me a few more days to collect everything,
Long story short:
Yes, I got it to work on K8's. It was fast!
I had to modify the launch module and the comm package.
I used torchelastic with a custom "pod config" to get Dectron running.
If you would allow me a few more days, I will get everything to you.