Dear apex developers,
I read from the documentation here:
DistributedDataParallel is optimized for use with NCCL. It achieves high performance by overlapping communication with computation during backward() and bucketing smaller gradient transfers to reduce the total number of transfers required.
I also read the source code of DistributedDataParallel , but I still can't understand the so-called overlapping.
Could you provide some more detailed explanation for me ?
It has been mentioned a couple of times in issues that these days torch's implementation of DDP is preferred. So your question is quite general. Perhaps Stack Overflow is a better place for this kind of question.
@xutianming You should prefer torch.nn.parallel.DistributedDataParallel, which is a drop-in replacement aside from a few constructor arguments, as shown here.
At a high level, both Apex and Torch DDP try to overlap communication with computation during the backward pass by dividing gradients into several buckets. Every time all the gradients for a particular bucket are ready, those gradients are copied into a flat buffer, allreduced, then copied back out to their final destination, the parameters' .grad attributes. Apex and Torch DDP accomplish this by registering hooks on parameters' accumulate_grad functions. Whenever a given parameter receives a gradient, the hook fires, populates that parameter's bucket slot with a reference to the new gradient, checks if the bucket is now full, and if so, kicks off the flatten+allreduce+copy to final .grads for that bucket.
@mcarilli Thanks for your reply. I also read your reply in issue#544 about recommending torch.nn.parallel.DistributedDataParallel.
I read the source code of both Torch and Apex DDP. I am very interested in the specific optimization of Apex DDP, because I find that Apex DDP performs better than Torch DDP in my use case.
Could you tell me more about it ?
@mcarilli Since the gradients are ready in reversed order that they are defined. The better bucketing strategy is by neural network layer. But currently the parameters are bucketed by tensor type. Am I right ?
The allreduces will not let you flatten tensors of different types into the same bucket, so different types must always be in separate buckets. For tensors of the same type, Apex determines its bucketing strategy based on a live backward pass (the first iteration). It records the order in which params receive their gradients and establishes the buckets for all FloatTensor weights based on the order it actually saw them receive their gradients. It then reuses this bucket structure in later iterations, unless the set of parameters that require grad changed. The same strategy is used (separately) for HalfTensor weights.
Most helpful comment
@xutianming You should prefer torch.nn.parallel.DistributedDataParallel, which is a drop-in replacement aside from a few constructor arguments, as shown here.
At a high level, both Apex and Torch DDP try to overlap communication with computation during the backward pass by dividing gradients into several buckets. Every time all the gradients for a particular bucket are ready, those gradients are copied into a flat buffer, allreduced, then copied back out to their final destination, the parameters' .grad attributes. Apex and Torch DDP accomplish this by registering hooks on parameters' accumulate_grad functions. Whenever a given parameter receives a gradient, the hook fires, populates that parameter's bucket slot with a reference to the new gradient, checks if the bucket is now full, and if so, kicks off the flatten+allreduce+copy to final .grads for that bucket.