Ignite: How to enable distributed training with ignite

Created on 15 Jul 2020  ยท  9Comments  ยท  Source: pytorch/ignite

โ“ Questions/Help/Support

Hi @vfdev-5 ,

I saw you guys added distributed training support in 0.4, that's cool!
Where can I find some example or tutorial to show how to use it in ignite?
I developed distributed training example for MONAI based on native PyTorch APIs, evaluating ignite workflows for it.

Thanks.

question

All 9 comments

Hi @Nic-Ma

Please, see the docs about ignite.distributed and a complete example of cifar10 training.

You can also explore online logs on demo trains server.

Please, do not hesitate to ask other questions.
Thanks

Hi @vfdev-5 ,

Thanks very much for your sharing!
I checked the example program, seems it didn't run distributed validation/test?
The train loss and metrics on TensorBoard were reduced from all processes or only from rank_0?

Thanks.

Hi @Nic-Ma

I checked the example program, seems it didn't run distributed validation/test?

Normally, it should as test dataloader is also using distributed data sampler. Those details are hidden with idist.auto_dataloader. Could you please detail what makes you think that it didn't run distributed validation/test?

The train loss and metrics on TensorBoard were reduced from all processes or only from rank_0?

All metrics are reduced across all participating processes. Train loss is put into RunningAverage metric with common.setup_common_training_handlers and also reduced.

Hi @vfdev-5 ,

Oh, I see, I misunderstood something.
I think sync_all_reduce can support that.
We are evaluating whether we should use the ignite.idist.auto_XXX APIs or add some logic to our existing workflows based on ignite 0.3, because we already have monai.DataLoader, maybe we can just slightly modify it.
Is there some known issue or bug in ignite v0.3 that blocks us to develop distributed training?

Thanks.

Yes, the purpose of sync_all_reduce is to reduce metric values.

We are evaluating whether we should use the ignite.idist.auto_XXX APIs or add some logic to our existing workflows based on ignite 0.3, because we already have monai.DataLoader, maybe we can just slightly modify it.

OK, I see. Methods like idist.auto_* are helpers and optional in some sense. You can still use other parts of the API (like, idist.spawn or idist.Parallel etc). If you prefer to create your own dataloader with correct data sampling etc, no need to use idist.auto_dataloader, however if you still would like to automatically wrap the model by appropriate distributed wrapper, it is still possible to use idist.auto_model without using idist.auto_dataloader...

Is there some known issue or bug in ignite v0.3 that blocks us to develop distributed training?

Metrics distributed computation is not done for XLA devices in v0.3. Otherwise, it should work without problems for GPUs and torch native dist framework.

Sounds good!
Thanks.

@Nic-Ma just for your information, we are also adding a support for Horovod distributed framework, which will be available since v0.4.2 release.

Glad to see that!
Thanks for your explanation.

@Nic-Ma I close this issue as solved, please feel free to reopen this one if needed or open a new one for other questions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Sudy picture Sudy  ยท  4Comments

karfly picture karfly  ยท  4Comments

milongo picture milongo  ยท  3Comments

vfdev-5 picture vfdev-5  ยท  3Comments

CreateRandom picture CreateRandom  ยท  3Comments