Apex: relation between apex.parallel.DistributedDataParallel and torch.distributed

Created on 3 Nov 2018 · 3Comments · Source: NVIDIA/apex

I haven't gone through the code yet.
Could anyone give a quickly explain about the relation between apex.parallel.DistributedDataParallel and torch.nn.parallel.DistributedDataParallel, as well as torch.distributed.launch?

Source

xmyqsh

Most helpful comment

Right now, I'd recommend torch.nn.parallel.DistributedDataParallel for all practical purposes. It's pretty darn good (fast and robust).

mcarilli on 17 Jun 2019

👍2

All 3 comments

apex.parallel.DistributedDataParallel and torch.nn.parallel.DistributedDataParallel have the same purpose. They are model wrappers that automatically take care of gradient allreduces during the backward pass. Their usage is almost identical. The Apex version offers some features that the torch version does not, but we plan to merge Apex features into upstream eventually, so for forward compatibility, you may as well just use the torch version.
apex.parallel.DistributedDataParallel example
torch.nn.parallel.DistributedDataParallel example (note the slightly different constructor arguments)
FP16_Optimizer happens to be used in these examples, but its presence is unrelated to the DistributedDataParallel wrappers. You can ignore it.

torch.distributed.launch is a wrapper script intended to spawn multiple processes, and supply them with the arguments and the environment necessary to set up distributed training within each process. torch.distributed.launch can be used with either apex.parallel.DistributedDataParallel or torch.nn.parallel.DistributedDataParallel.

mcarilli on 5 Nov 2018

👍1

I understand that they both have the same purposes, but are there any potential/theoretical advantages to using the apex vs torch, aside from extra options? Performance/speed?