Pytorch-lightning: 0.9.0 converges to worse loss than 0.8.5 with exact same code!

Created on 29 Sep 2020  ·  7Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I have code that was working quite nicely with 0.8.5. When I upgraded to 0.9.0, the final loss is WORSE (higher!) and fluctuates a bit. The code is exactly the same. This is very concerning.

(Also 0.9.0 forces tensorboard to downgrade from 2.3.0 to 2.2.0 and I can't really see the curves any more, but that is not my main concern.)

To Reproduce

Steps to reproduce the behavior:

  1. I am running on CPU
  2. I have a gist here with my code in two ipynb notebooks here: https://gist.github.com/turian/e8fd87abb7adf7f357e685c96ec1ef85
  3. ddsp-snare-0-8-5-ipynb (https://gist.github.com/turian/e8fd87abb7adf7f357e685c96ec1ef85#file-ddsp-snare-0-8-5-ipynb) and ddsp-snare-0-9-0-ipynb (https://gist.github.com/turian/e8fd87abb7adf7f357e685c96ec1ef85#file-ddsp-snare-0-9-0-ipynb) are exactly the same, except the first forces install of pytorch lightning 0.8.5 and the second is 0.9.0
  4. When 0.8.5 is run, it always gets down to loss around 12, and the result audio sounds great. Even if I run it with different random initialization.
  5. When 0.9.0 is run, it gets down to loss around 17 (sometimes a little better depending upon the random initialization) and the result audio sounds wrong.

Code sample

I have not yet been able to minimize the bug. If that is necessary for you to help with this bug report, please let me know and I will endeavour to help however I can.

Expected behavior

0.9.0 should give just as good results at 0.8.5 or we should be able to understand why it doesn't and fix it. We should also improve the docs to give an idea how to migrate to 0.9.0 if code changes are required.

Environment

  • CUDA:

    • GPU:

    • available: False

    • version: None

  • Packages:

    • numpy: 1.19.2

    • pyTorch_debug: False

    • pyTorch_version: 1.6.0

    • pytorch-lightning: 0.9.0 vs 0.8.5

    • tqdm: 4.48.2

  • System:

    • OS: Darwin

    • architecture:



      • 64bit


        -



    • processor: i386

    • python: 3.8.5

    • version: Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64

Working as intended question

All 7 comments

can you confirm, in both vesions you are using torch = 1.6?

Yes, in both versions torch = 1.6

Hi
I copied your notebooks to google colab
added a pl.seed_everything(100) to the cell in which you call trainer.fit()
In both versions, i get exactly the same loss value after 10 epochs

0.8.5:
https://drive.google.com/file/d/1KP8GRmY7fy_b5bRRU-1K1P3Z0N705rEH/view?usp=sharing

0.9.0
https://drive.google.com/file/d/1K_TL6W-sK_HdHBcncmy5irEMFMU7z_4Q/view?usp=sharing

If your model is sensitive to initialization, of course you will get different results.
When you compare runs you need to set the seed.

Also, please be careful with these notebooks. If you run the cell with trainer.fit multiple times, it will not train your model from scratch, it will simply continue because the variables for the model are still in memory from the previous cell.

please confirm asap that my findings are correct.

Yes haha.

Please set the seed :) it's stated very boldly in the docs.

This is user error, closing.

@awaelchli could you please give me access to your colabs?

yeah sorry the link was not public.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

srush picture srush  ·  3Comments

chuong98 picture chuong98  ·  3Comments

williamFalcon picture williamFalcon  ·  3Comments

maxime-louis picture maxime-louis  ·  3Comments

as754770178 picture as754770178  ·  3Comments