Apex: strange error when distributed training

Created on 21 Apr 2019  路  4Comments  路  Source: NVIDIA/apex

I used torch.nn.parallel.DistributedDataParallel for distributed training, and the following error occured.
After switching to apex.parallel.DistributedDataParallel, the error disappeared.

The situation happens when I use the mix-precision imagenet pre-trained model (O1) as backbone.

Traceback (most recent call last):
  File "train_ssd.py", line 199, in <module>
    main()
  File "train_ssd.py", line 190, in main
    model = train(cfg, args)
  File "train_ssd.py", line 113, in train
    return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
  File "/media/ycg/experiment1/workspace/SSD/ssd/engine/trainer.py", line 88, in do_train
    loss.backward()
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7ff4dd5fe928>, [[tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), None, None, tensor([[[[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

Most helpful comment

I believe this errors means that some of the parameters in your network are not used to compute the loss function. As a result, their gradients become None, but PyTorch distributed expects each process to provide gradients for each parameter at each time step, so you get an error.

Note that this was not raising any error in PyTorch 0.4, but it raises an error in PyTorch 1.0. In the next PyTorch they will make the update of each parameter optional, but for now can hack it with something like: loss = loss + 0 * sum(p.sum() for p in network.parameters()) I do that sometimes, it's not very nice, but it works and barely increases training time.

All 4 comments

...,

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]]],


        [[[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         ...,

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]]],


        [[[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         ...,

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]]]], device='cuda:0'), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0'), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],

I believe this errors means that some of the parameters in your network are not used to compute the loss function. As a result, their gradients become None, but PyTorch distributed expects each process to provide gradients for each parameter at each time step, so you get an error.

Note that this was not raising any error in PyTorch 0.4, but it raises an error in PyTorch 1.0. In the next PyTorch they will make the update of each parameter optional, but for now can hack it with something like: loss = loss + 0 * sum(p.sum() for p in network.parameters()) I do that sometimes, it's not very nice, but it works and barely increases training time.

@glample Thank you. After I get rid of the unused parts, torch.nn.parallel.DistributedDataParallel seems work alright.

This is (or was) a known issue with torch.nn.parallel.DistributedDataParallel. Recently, @pietern implemented some really nice changes that should enable your use case (https://github.com/pytorch/pytorch/pull/18953), so it's possible that if you update to the latest Pytorch master, torch.nn.parallel.DistributedDataParallel will also work for you.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Hecmay picture Hecmay  路  4Comments

jiangnanyida picture jiangnanyida  路  3Comments

jbraeburn picture jbraeburn  路  4Comments

ccoulombe picture ccoulombe  路  3Comments

rmrao picture rmrao  路  4Comments