Pytorch-lightning: 0.9.0 converges to worse loss than 0.8.5 with exact same code!

Created on 29 Sep 2020 · 7Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I have code that was working quite nicely with 0.8.5. When I upgraded to 0.9.0, the final loss is WORSE (higher!) and fluctuates a bit. The code is exactly the same. This is very concerning.

(Also 0.9.0 forces tensorboard to downgrade from 2.3.0 to 2.2.0 and I can't really see the curves any more, but that is not my main concern.)

To Reproduce

Steps to reproduce the behavior:

I am running on CPU
I have a gist here with my code in two ipynb notebooks here: https://gist.github.com/turian/e8fd87abb7adf7f357e685c96ec1ef85
ddsp-snare-0-8-5-ipynb (https://gist.github.com/turian/e8fd87abb7adf7f357e685c96ec1ef85#file-ddsp-snare-0-8-5-ipynb) and ddsp-snare-0-9-0-ipynb (https://gist.github.com/turian/e8fd87abb7adf7f357e685c96ec1ef85#file-ddsp-snare-0-9-0-ipynb) are exactly the same, except the first forces install of pytorch lightning 0.8.5 and the second is 0.9.0
When 0.8.5 is run, it always gets down to loss around 12, and the result audio sounds great. Even if I run it with different random initialization.
When 0.9.0 is run, it gets down to loss around 17 (sometimes a little better depending upon the random initialization) and the result audio sounds wrong.

Code sample

I have not yet been able to minimize the bug. If that is necessary for you to help with this bug report, please let me know and I will endeavour to help however I can.

Expected behavior

0.9.0 should give just as good results at 0.8.5 or we should be able to understand why it doesn't and fix it. We should also improve the docs to give an idea how to migrate to 0.9.0 if code changes are required.

Environment

CUDA:
- GPU:
- available: False
- version: None
Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0 vs 0.8.5
- tqdm: 4.48.2
System:
- OS: Darwin
- architecture:
  - 64bit
    
    -
- processor: i386
- python: 3.8.5
- version: Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64

Working as intended question

Source

turian

All 7 comments

can you confirm, in both vesions you are using torch = 1.6?

awaelchli on 30 Sep 2020

Yes, in both versions torch = 1.6

turian on 1 Oct 2020

Hi
I copied your notebooks to google colab
added a pl.seed_everything(100) to the cell in which you call trainer.fit()
In both versions, i get exactly the same loss value after 10 epochs

0.8.5:
https://drive.google.com/file/d/1KP8GRmY7fy_b5bRRU-1K1P3Z0N705rEH/view?usp=sharing

0.9.0
https://drive.google.com/file/d/1K_TL6W-sK_HdHBcncmy5irEMFMU7z_4Q/view?usp=sharing

If your model is sensitive to initialization, of course you will get different results.
When you compare runs you need to set the seed.

awaelchli on 1 Oct 2020

👍1

Also, please be careful with these notebooks. If you run the cell with trainer.fit multiple times, it will not train your model from scratch, it will simply continue because the variables for the model are still in memory from the previous cell.

please confirm asap that my findings are correct.

awaelchli on 1 Oct 2020

👍1

Yes haha.

Please set the seed :) it's stated very boldly in the docs.

This is user error, closing.