I created a NumpyMetric class for an involved metric that requires numpy operations; however, the metric fails when training on multiple GPUs. After some debugging, this appears to be due to the resulting tensor not being mapped back to the appropriate GPU (or any GPU for that matter).
Steps to reproduce the behavior:
class MyNumpyMetric(NumpyMetric):
def forward(self, y_hat, y):
# complicated numpy stuff (no calls to .cpu() or .cuda() or .to() or anything like that)
return metric
__init__
and validation_step
of my PyTorchLightning module, e.g.,class MyNetwork(pl.LightningModule):
def __init__(self, args):
# other init stuff
self.my_metric = MyNumpyMetric('my_metric')
def validation_step(self, batch, batch_idx):
# other validation stuff
my_metric = self.my_metric(y_hat, y) # where y_hat and y are tensors, no .cpu(), .cuda(), .to() called on either
out_dict = dict(val_my_metric=my_metric)
return out_dict
model = MyNetwork(args)
trainer = Trainer(
benchmark=True,
check_val_every_n_epoch=1,
accumulate_grad_batches=1,
min_epochs=n_epochs,
max_epochs=n_epochs,
fast_dev_run=False,
gpus=2,
distributed_backend='dp'
)
trainer.fit(model)
Traceback (most recent call last):
File "./tiramisu3d.py", line 574, in <module>
trainer.fit(model)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 997, in fit
results = self.dp_train(model)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 270, in dp_train
result = self.run_pretrain_routine(model)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in run_pretrain_routine
eval_results = self._evaluate(model,
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 293, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 444, in evaluation_forward
output = model(*args)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
return self.gather(outputs, self.output_device)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in gather_map
return type(out)(((k, gather_map([d[k] for d in outputs]))
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in <genexpr>
return type(out)(((k, gather_map([d[k] for d in outputs]))
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
assert all(map(lambda i: i.is_cuda, inputs))
I will try to do this soon.
I expected no error to occur. The documentation states: "[NumpyMetric] already handles DDP sync and input/output conversions." However, this doesn't appear to be the case in my implementation.
PyTorch and PyTorch Lightning were installed with conda (along with all of the other packages).
I was able to work around this error by adding the following .to()
call to the validation step:
def validation_step(self, batch, batch_idx):
# other validation stuff
my_metric = self.my_metric(y_hat, y)
my_metric = my_metric.to(y_hat.device)
out_dict = dict(val_my_metric=my_metric)
return out_dict
I presume, however, that this is not the intended way to use the NumpyMetric class.
FWIW, I briefly looked at the code to see if I could just submit a PR with the fix (if this isn't user error), but it wasn't clear to me where the best places to look were. If you point me in the right direction, I might be able to submit a PR with the fix.
Hi! thanks for your contribution!, great first issue!
Hi, good news, I believe I have fixed this issue already in #2657, at least it looks very similar. The fix is not released yet (but soon). So if you need it now, install Lightning from master branch. (your workaround is also fine)
@jcreinhold mind try master
and feel free to reopen if needed 馃惏