Any ideas to debug this issue?
Is happening to me in many different models, after I refactored the Result logging from training_step, validation_stepand test_stepmethods, changed the old dictionary-based return to the new Result scheme, training on two GPUs at the same time.
The error doesn't pop if i use distributed_backend='ddp' instead of dp on trainer.
When doing evaluation or test routines on Trainer (either with .fit evaluation at the end of an epoch or calling .test directly),
throws ValueError: All dicts must have the same number of keys.
After seeing the error log i think it has something to do with the metric logging but i can't figure out what exactly. The error pops very inconsistently over epochs and runs. So i'm trying to find any ideas on how maybe i could get more details to get to the root of the issue.
Stack Trace:
File "model_manager.py", line 263, in <module>
helper.train()
File "model_manager.py", line 97, in train
self.trainer.fit(self.module)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in fit results = self.accelerator_backend.train()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_backend.py", line 97, in train
results = self.trainer.run_pretrain_routine(model)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
self.run_training_epoch()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
self.run_evaluation(test_mode=False)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 661, in evaluation_forward
output = model(*args)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 86, in forward
outputs = self.__gather_structured_result(outputs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 101, in __gather_structured_result
outputs = self.gather(outputs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 141, in gather
res = gather_map(outputs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 129, in gather_map
raise ValueError('All dicts must have the same number of keys')
ValueError: All dicts must have the same number of keys
Exception ignored in: <function tqdm.__del__ at 0x7f83fe2ecb80>
Traceback (most recent call last):
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1087, in __del__
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1294, in close
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1472, in display
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1090, in __repr__
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1434, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Steps to reproduce the behavior:
def training_step(self, batch, batch_idx):
"""
Lightning calls this inside the training loop
:param batch:
:return:
"""
# forward pass
x, y = batch['x'], batch['y']
y_pred = self.forward(x)
# calculate loss
loss = self.loss(y_pred, y)
result = ptl.TrainResult(loss)
result.log('train_loss', loss, prog_bar=True)
return result
def validation_step(self, batch, batch_idx):
"""
Lightning calls this inside the validation loop
:param batch:
:return:
"""
x, y = batch['x'], batch['y']
y_pred = self.forward(x)
# calculate loss
loss = self.loss(y_pred, y)
# calculate accurracy
labels_hat = torch.argmax(y_pred, dim=1)
accuracy = torch.sum(y == labels_hat).item() / (len(y) * 1.0)
accuracy = torch.tensor(accuracy)
if self.on_gpu:
accuracy = accuracy.cuda(loss.device.index)
# Checkpoint model based on validation loss
result = ptl.EvalResult(early_stop_on=None, checkpoint_on=loss)
result.log('val_loss', loss, prog_bar=True)
result.log('val_acc', accuracy, prog_bar=True)
return result
Trainer to get training and evaluation steps working a few times. The error will pop up at some random epoch. For me it usually pops at the first 20 epochs. Also if I run Trainer.test() on a crashed epoch probably will fail with the same error.More detailed error. I think it has something to do with the Result objects but I cannot get more detail easily, as I'm running the models on a remote server.
conda, pip, source): condaWe are currently working on refactoring the accelerator backend, which may help fix this issue. That being said, we generally find that DDP training is faster and more stable, so we encourage the use of DDP, which doesn't seem to have this issue with results dicts at the moment. I will look into this in the mean time!
Thanks for the info!
I found out that DDP uses more memory than DP, so I've been preferring DP for a time now, to fit more data on the batches.
I'm not sure if the speedup of DDP accomplishes to surpass the memory efficiency of DP in terms of the final result, but I'll try it out. Thanks again!