Pytorch-lightning: ValueError: All dicts must have the same number of keys on model evaluation output.

Created on 3 Sep 2020 · 3Comments · Source: PyTorchLightning/pytorch-lightning

Any ideas to debug this issue?
Is happening to me in many different models, after I refactored the Result logging from training_step, validation_stepand test_stepmethods, changed the old dictionary-based return to the new Result scheme, training on two GPUs at the same time.

The error doesn't pop if i use distributed_backend='ddp' instead of dp on trainer.

🐛 Bug

When doing evaluation or test routines on Trainer (either with .fit evaluation at the end of an epoch or calling .test directly),
throws ValueError: All dicts must have the same number of keys.

After seeing the error log i think it has something to do with the metric logging but i can't figure out what exactly. The error pops very inconsistently over epochs and runs. So i'm trying to find any ideas on how maybe i could get more details to get to the root of the issue.

Stack Trace:

File "model_manager.py", line 263, in <module>
    helper.train()
  File "model_manager.py", line 97, in train
    self.trainer.fit(self.module)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in fit    results = self.accelerator_backend.train()
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_backend.py", line 97, in train
    results = self.trainer.run_pretrain_routine(model)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 661, in evaluation_forward
    output = model(*args)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 86, in forward
    outputs = self.__gather_structured_result(outputs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 101, in __gather_structured_result
    outputs = self.gather(outputs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 141, in gather
    res = gather_map(outputs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 129, in gather_map
    raise ValueError('All dicts must have the same number of keys')
ValueError: All dicts must have the same number of keys
Exception ignored in: <function tqdm.__del__ at 0x7f83fe2ecb80>
Traceback (most recent call last):
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1087, in __del__
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1294, in close
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1472, in display
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1090, in __repr__
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1434, in format_dict
TypeError: cannot unpack non-iterable NoneType object

To Reproduce

Steps to reproduce the behavior:

Get a simple model for classification. For example, I used PyTorch resnetx model.
Implement training and validation methods returning Results objects. As shown here:

    def training_step(self, batch, batch_idx):
        """
        Lightning calls this inside the training loop
        :param batch:
        :return:
        """
        # forward pass
        x, y = batch['x'], batch['y']
        y_pred = self.forward(x)
        # calculate loss
        loss = self.loss(y_pred, y)
        result = ptl.TrainResult(loss)
        result.log('train_loss', loss, prog_bar=True)
        return result

    def validation_step(self, batch, batch_idx):
        """
        Lightning calls this inside the validation loop
        :param batch:
        :return:
        """
        x, y = batch['x'], batch['y']
        y_pred = self.forward(x)
        # calculate loss
        loss = self.loss(y_pred, y)
        # calculate accurracy
        labels_hat = torch.argmax(y_pred, dim=1)
        accuracy = torch.sum(y == labels_hat).item() / (len(y) * 1.0)
        accuracy = torch.tensor(accuracy)
        if self.on_gpu:
            accuracy = accuracy.cuda(loss.device.index)
        # Checkpoint model based on validation loss
        result = ptl.EvalResult(early_stop_on=None, checkpoint_on=loss)
        result.log('val_loss', loss, prog_bar=True)
        result.log('val_acc', accuracy, prog_bar=True)
        return result

Train the Trainer to get training and evaluation steps working a few times. The error will pop up at some random epoch. For me it usually pops at the first 20 epochs. Also if I run Trainer.test() on a crashed epoch probably will fail with the same error.

Expected behavior

More detailed error. I think it has something to do with the Result objects but I cannot get more detail easily, as I'm running the models on a remote server.

Environment

PyTorch Version (e.g., 1.0): 1.6.0
OS (e.g., Linux): Arch Linux
How you installed PyTorch (conda, pip, source): conda
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: 2 x GeForce GTX 1080 12 GB
Any other relevant information: It worked on previous versions of pytorch-lightning. The error doesnt pop if i use 'ddp'

DP bug / fix help wanted

Source

Vichoko

All 3 comments

We are currently working on refactoring the accelerator backend, which may help fix this issue. That being said, we generally find that DDP training is faster and more stable, so we encourage the use of DDP, which doesn't seem to have this issue with results dicts at the moment. I will look into this in the mean time!

teddykoker on 4 Sep 2020

👍1

Thanks for the info!
I found out that DDP uses more memory than DP, so I've been preferring DP for a time now, to fit more data on the batches.
I'm not sure if the speedup of DDP accomplishes to surpass the memory efficiency of DP in terms of the final result, but I'll try it out. Thanks again!

Vichoko on 4 Sep 2020

Added a response here
https://forums.pytorchlightning.ai/t/dataparallel-crash-with-uneven-number-of-inputs/198/2