Pytorch-lightning: ValueError: only one element tensors can be converted to Python scalars

Created on 24 Mar 2020 · 14Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

This happens in the training loop.

ValueError: only one element tensors can be converted to Python scalars

To Reproduce

From my observation, I believe this happens when the batch size can't be divided by gpu num. For example on the last batch of each epoch, and when you have 4 gpus but set batch size to 2.

Additional context

I think it would be nice to use only some of the gpus the user specified, while printing out a msg telling them that the gpus are not specified correctly. Current implementaion simply throws a not friendly error

Priority P0 bug / fix help wanted

Source

Ir1d

👍1

Most helpful comment

Hi @Richarizardd
I looked at your code and I found that in your validation epoch end, you don't reduce the outputs properly. PL does not do this for you. This is intentional, right @williamFalcon ?
So, in your validation_epoch_end, instead of

for output in outputs:
    metric_total += output[metric_name]
tqdm_dict[metric_name] = metric_total / len(outputs)

you should do the following:

tqdm_dict[metric_name] = torch.stack([output[metric_name] for output in outputs]).mean()

I tested this by adding it to your code and it worked (no error).
As far as I can tell, this is not a bug in PL. However, we could print a better error message.

@Ir1d you probably had the same mistake.

awaelchli on 26 Apr 2020

❤3

All 14 comments

Do I understand correctly, this happens with every Lightning training where batch_size < num_gpus?
Then I would also like to see a warning message like you described and automatically set num_gpus to the batch size.
However, we still have the problem where batch_size > num_gpus for all batches except the last, when we specify drop_last=False in dataloader. What do we do then?

Would the same error occur in torch.nn.DataParallel (without PL)?

awaelchli on 24 Mar 2020

👍1

the last epoch seems a bit tricky.. I'm not expert on this, but I wonder if its possible to send the data to part of the specified gpus (for example 4 gpus in all, but send 3 batches to first 3 gpu). I remember there's some send_batch_to_gpu function in pl

Ir1d on 24 Mar 2020

@neggert @jeffling pls ^^

Borda on 25 Mar 2020

@Ir1d btw this person #1236 is getting the same error but doesn't have GPU.

awaelchli on 26 Mar 2020

@Ir1d mind send a PR or provide an example to replicate it?

Borda on 17 Apr 2020

I can provide an example next week. Currently not sure how to fix this

Ir1d on 17 Apr 2020

Hi @Borda, I've reproduced the same issue here:

https://github.com/Richarizardd/pl_image_classification

Basic image classification with PL using MNIST, CIFAR10, and ImageFolder Datasets from torchvision. If you run the mnist_gpu1.yaml config file, you would get the same issue as @Ir1d

Richarizardd on 25 Apr 2020

👍2

for output in outputs:
    metric_total += output[metric_name]
tqdm_dict[metric_name] = metric_total / len(outputs)

you should do the following:

tqdm_dict[metric_name] = torch.stack([output[metric_name] for output in outputs]).mean()

I tested this by adding it to your code and it worked (no error).
As far as I can tell, this is not a bug in PL. However, we could print a better error message.

@Ir1d you probably had the same mistake.

awaelchli on 26 Apr 2020

❤3

Ir1d on 26 Apr 2020

In @Richarizardd's case, the error was thrown in validation_epoch_end. Could you post your stack trace here so I can check if it's the same?

awaelchli on 26 Apr 2020

Your metric reduction looks fine to me.

awaelchli on 26 Apr 2020

I tried again and there's still this issue in pl v0.7.3

Here's the whole log for a recent run. I set bs=3 on 4 gpus and set pl use all the 4 gpus, and this is happening for the first batch. (in another case when drop_last is not set and bs=12 on 4 gpus, this is happening for the last batch of the epoch, which seems that it happens when bs < num gpus)

I'll try bring a minimal code for reproduction after I finish my midterm tests. Currently my code is available at https://github.com/ir1d/AFN , but the data is a bit large and might be hard to run.

Ir1d on 26 Apr 2020

@Ir1d found the bug in trainer code. It does not reduce the outputs if output_size of training step does not equal num_gpus. I will make a PR to fix it.

awaelchli on 26 Apr 2020

🚀1 👍1

@Ir1d The fix got merged. Kindly asking you to verify the fix with latest master branch. Closing for now.

awaelchli on 2 May 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings