Pytorch-lightning: distributed training crashes with dp (list comprehension issue from torch?)

Created on 17 May 2020 · 9Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I ran a distributed GPU template and get an error with data parallel and scatter_gather from torch nn parallel in particular.

To Reproduce

Steps to reproduce the behavior:

install packages
git clone from master
run basic example gpu job with distributed

Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit self.dp_train(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train self.run_pretrain_routine(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine False) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 424, in evaluation_forward output = model(*args) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward return self.gather(outputs, self.output_device) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather return gather(outputs, output_device, dim=self.dim) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr> for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: zip argument #1 must support iteration

Code sample

run python3 gpu_template.py --gpus 2 --distributed_backend dp

Expected behavior

should run distributed demo job without errors

Environment

CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.2
Packages:
- numpy: 1.18.4
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.6
- tensorboard: 2.2.1
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.6
- version: #201812030624 SMP Mon Dec 3 11:25:55 UTC 2018

Additional context

python3 gpu_template.py --gpus 2 --distributed_backend ddp works

bug / fix help wanted won't fix

Source

Data-drone

All 9 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 17 May 2020

I was experiencing this problem the other day. It's somewhat related to PyTorch.

if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47

def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))

You'll see that It only supports tensors or dictionries that contain tensors.

The problem for me was that my training_step function returned something like this:

results = {
    "loss": loss,
    "log": all_logs,
    "progress_bar": progress_logs,
}

progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.

def _fix_dp_return_type(self, result, device):
    if isinstance(result, torch.Tensor):
        return result.to(device)
    if isinstance(result, dict):
        return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
    # Must be a number then
    return torch.Tensor([result]).to(device)

I hope there's a better fix for this :)

nsarang on 17 May 2020

👍1

hmmm I am just returning loss and log so I have to convert the loss to tensor and feed into the device?

Data-drone on 20 May 2020

feels like this is something that should be covered in the parallel tools in lightning....

though I guess ddp is the recommended backend

Data-drone on 21 May 2020

👍1

I agree. One way to fix is to override the default gather function in pytorch_lightning.overrides.data_parallel.LightningDataParallel.

nsarang on 22 May 2020

@nsarang maybe submit a PR with this patch? @ananyahjha93

williamFalcon on 22 May 2020

@williamFalcon Alright. Are you referring to #1895?
I'm not sure how I can work on an existing PR :)

nsarang on 22 May 2020

@nsarang you can override the gather function and create a separate PR for it.

ananyahjha93 on 23 May 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.