I ran a distributed GPU template and get an error with data parallel and scatter_gather from torch nn parallel in particular.
Steps to reproduce the behavior:
install packages
git clone from master
run basic example gpu job with distributed
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
File "gpu_template.py", line 80, in <module>
main(hyperparams)
File "gpu_template.py", line 41, in main
trainer.fit(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit
self.dp_train(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train
self.run_pretrain_routine(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
False)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 424, in evaluation_forward
output = model(*args)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
return self.gather(outputs, self.output_device)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
for k in out))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
for k in out))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration
run python3 gpu_template.py --gpus 2 --distributed_backend dp
should run distributed demo job without errors
python3 gpu_template.py --gpus 2 --distributed_backend ddp works
Hi! thanks for your contribution!, great first issue!
I was experiencing this problem the other day. It's somewhat related to PyTorch.
if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47
def gather_map(outputs):
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
return type(out)(map(gather_map, zip(*outputs)))
You'll see that It only supports tensors or dictionries that contain tensors.
The problem for me was that my training_step function returned something like this:
results = {
"loss": loss,
"log": all_logs,
"progress_bar": progress_logs,
}
progress_logs was a dictionary that contained numbers, since I wanted the progress bar to show the moving average, instead of the exact values. So I came up with a hacky function like below to convert the numbers to tensors and move them to the appropriate device.
def _fix_dp_return_type(self, result, device):
if isinstance(result, torch.Tensor):
return result.to(device)
if isinstance(result, dict):
return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
# Must be a number then
return torch.Tensor([result]).to(device)
I hope there's a better fix for this :)
hmmm I am just returning loss and log so I have to convert the loss to tensor and feed into the device?
feels like this is something that should be covered in the parallel tools in lightning....
though I guess ddp is the recommended backend
I agree. One way to fix is to override the default gather function in pytorch_lightning.overrides.data_parallel.LightningDataParallel.
@nsarang maybe submit a PR with this patch? @ananyahjha93
@williamFalcon Alright. Are you referring to #1895?
I'm not sure how I can work on an existing PR :)
@nsarang you can override the gather function and create a separate PR for it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.