Pytorch-lightning: distributed training crashes with dp (list comprehension issue from torch?)

Created on 17 May 2020  路  9Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

I ran a distributed GPU template and get an error with data parallel and scatter_gather from torch nn parallel in particular.

To Reproduce

Steps to reproduce the behavior:

install packages
git clone from master
run basic example gpu job with distributed

Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit self.dp_train(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train self.run_pretrain_routine(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine False) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 424, in evaluation_forward output = model(*args) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward return self.gather(outputs, self.output_device) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather return gather(outputs, output_device, dim=self.dim) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr> for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: zip argument #1 must support iteration

Code sample

run python3 gpu_template.py --gpus 2 --distributed_backend dp

Expected behavior

should run distributed demo job without errors

Environment

  • CUDA:
    - GPU:
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - available: True
    - version: 10.2
  • Packages:
    - numpy: 1.18.4
    - pyTorch_debug: False
    - pyTorch_version: 1.5.0
    - pytorch-lightning: 0.7.6
    - tensorboard: 2.2.1
    - tqdm: 4.46.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.6
    - version: #201812030624 SMP Mon Dec 3 11:25:55 UTC 2018

Additional context

python3 gpu_template.py --gpus 2 --distributed_backend ddp works

bug / fix help wanted won't fix

All 9 comments

Hi! thanks for your contribution!, great first issue!

I was experiencing this problem the other day. It's somewhat related to PyTorch.

if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47

def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))

You'll see that It only supports tensors or dictionries that contain tensors.

The problem for me was that my training_step function returned something like this:

results = {
    "loss": loss,
    "log": all_logs,
    "progress_bar": progress_logs,
}

progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.

def _fix_dp_return_type(self, result, device):
    if isinstance(result, torch.Tensor):
        return result.to(device)
    if isinstance(result, dict):
        return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
    # Must be a number then
    return torch.Tensor([result]).to(device)

I hope there's a better fix for this :)

hmmm I am just returning loss and log so I have to convert the loss to tensor and feed into the device?

feels like this is something that should be covered in the parallel tools in lightning....

though I guess ddp is the recommended backend

I agree. One way to fix is to override the default gather function in pytorch_lightning.overrides.data_parallel.LightningDataParallel.

@nsarang maybe submit a PR with this patch? @ananyahjha93

@williamFalcon Alright. Are you referring to #1895?
I'm not sure how I can work on an existing PR :)

@nsarang you can override the gather function and create a separate PR for it.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings