Ignite: How to enable distributed training with ignite

Created on 15 Jul 2020 · 9Comments · Source: pytorch/ignite

❓ Questions/Help/Support

Hi @vfdev-5 ,

I saw you guys added distributed training support in 0.4, that's cool!
Where can I find some example or tutorial to show how to use it in ignite?
I developed distributed training example for MONAI based on native PyTorch APIs, evaluating ignite workflows for it.

Thanks.

question

Source

Nic-Ma

👍2

All 9 comments

Hi @Nic-Ma

Please, see the docs about ignite.distributed and a complete example of cifar10 training.

You can also explore online logs on demo trains server.

Please, do not hesitate to ask other questions.
Thanks

vfdev-5 on 15 Jul 2020

Hi @vfdev-5 ,

Thanks very much for your sharing!
I checked the example program, seems it didn't run distributed validation/test?
The train loss and metrics on TensorBoard were reduced from all processes or only from rank_0?

Thanks.

Nic-Ma on 3 Aug 2020

Hi @Nic-Ma

I checked the example program, seems it didn't run distributed validation/test?

Normally, it should as test dataloader is also using distributed data sampler. Those details are hidden with idist.auto_dataloader. Could you please detail what makes you think that it didn't run distributed validation/test?

The train loss and metrics on TensorBoard were reduced from all processes or only from rank_0?

All metrics are reduced across all participating processes. Train loss is put into RunningAverage metric with common.setup_common_training_handlers and also reduced.

vfdev-5 on 3 Aug 2020

Hi @vfdev-5 ,

Oh, I see, I misunderstood something.
I think sync_all_reduce can support that.
We are evaluating whether we should use the ignite.idist.auto_XXX APIs or add some logic to our existing workflows based on ignite 0.3, because we already have monai.DataLoader, maybe we can just slightly modify it.
Is there some known issue or bug in ignite v0.3 that blocks us to develop distributed training?

Thanks.

Nic-Ma on 3 Aug 2020

Yes, the purpose of sync_all_reduce is to reduce metric values.

We are evaluating whether we should use the ignite.idist.auto_XXX APIs or add some logic to our existing workflows based on ignite 0.3, because we already have monai.DataLoader, maybe we can just slightly modify it.

OK, I see. Methods like idist.auto_* are helpers and optional in some sense. You can still use other parts of the API (like, idist.spawn or idist.Parallel etc). If you prefer to create your own dataloader with correct data sampling etc, no need to use idist.auto_dataloader, however if you still would like to automatically wrap the model by appropriate distributed wrapper, it is still possible to use idist.auto_model without using idist.auto_dataloader...

Is there some known issue or bug in ignite v0.3 that blocks us to develop distributed training?

Metrics distributed computation is not done for XLA devices in v0.3. Otherwise, it should work without problems for GPUs and torch native dist framework.

vfdev-5 on 3 Aug 2020

Sounds good!
Thanks.

Nic-Ma on 3 Aug 2020

👍1

@Nic-Ma just for your information, we are also adding a support for Horovod distributed framework, which will be available since v0.4.2 release.

vfdev-5 on 3 Aug 2020

Glad to see that!
Thanks for your explanation.

Nic-Ma on 3 Aug 2020

👍1

@Nic-Ma I close this issue as solved, please feel free to reopen this one if needed or open a new one for other questions.

vfdev-5 on 15 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to resume the best model saved and evaluate on the test dataset

Sudy · 4Comments

Examples are not working

karfly · 4Comments

How do I attach both train and validation metrics to evaluator engine?

milongo · 3Comments

Metrics for GANs

vfdev-5 · 3Comments

Saving double execution cost during training

CreateRandom · 3Comments