The Ignite quickstart is an exciting guide showing the essentials for define and training a simple model. But the examples provided which use multi GPU training do not seem to follow the same simplicity.
Would it be difficult to do something as shown in the code snippet below?
trainer.run(train_loader, max_epochs=100, gpus=[0,1,2,3])
Is there a tutorial that shows how simple it would be to train a model in a multi GPU environment using Ignite?
@Ceceu thanks for asking ! Currently, in stable v0.3.0 release we relies only on native torch distributed API. Example of that can be found here. User needs to manually setup distributed proc group, wrap model with nn.parallel.DistributedDataParallel and execute the script with torch.distributed.launch tool, or use mp.spawn...
However, we aim to simplify this by providing a helper API to work on GPUs, TPUs etc.
The API is still experimental and will be available with v0.4.0 (probably released the next week).
In nightly version we provide a part of the newer API idist: https://pytorch.org/ignite/distributed.html#ignite-distributed
For a complete example of newer API, please, checkout the same cifar10 example in the branch parallel_api.
HTH
@vfdev-5,
These are great news.
Most helpful comment
@Ceceu thanks for asking ! Currently, in stable v0.3.0 release we relies only on native torch distributed API. Example of that can be found here. User needs to manually setup distributed proc group, wrap model with
nn.parallel.DistributedDataParalleland execute the script withtorch.distributed.launchtool, or usemp.spawn...However, we aim to simplify this by providing a helper API to work on GPUs, TPUs etc.
The API is still experimental and will be available with v0.4.0 (probably released the next week).
In nightly version we provide a part of the newer API
idist: https://pytorch.org/ignite/distributed.html#ignite-distributedFor a complete example of newer API, please, checkout the same cifar10 example in the branch
parallel_api.HTH