Pytorch-lightning: Training slows down with long epoch

Created on 4 Jul 2020  ยท  2Comments  ยท  Source: PyTorchLightning/pytorch-lightning

โ“ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I'm doing BERT transfer-learning on a single GPU (the same happens with 2 or 4 GPUs...) and on a large dataset.
Each epoch has about 1.7M steps and training speed linearly slows down such that at some point, the
estimated remaining time starts to increase.

Is it possibly related to the fact that pytorch_lightning is concatenating the outputs of each step to give them to training_epoch_end?

Is it possible to disable this behaviour such that losses and logs are stored just for a few seconds to be written to disk and then discarded?

Code

What have you tried?

What's your environment?

  • Ubuntu 20.04 Server 4.15.0-108-generic
  • CUDA 10.2, CuDNN 7.6.5
  • Pytorch Lightning 0.8.4
  • PyTorch 1.5.1
question

Most helpful comment

I didn't notice that updating Ubuntu the CUDA drivers where moved to version to 11.0. Moreover, I removed this schedulers and now training speed appears to be stable. Closing since I don't think anymore this was something related to pytorch_lightning.

All 2 comments

Removing log and progress_bar entries from the dict returned by the training_step seems to solve the issue.

I didn't notice that updating Ubuntu the CUDA drivers where moved to version to 11.0. Moreover, I removed this schedulers and now training speed appears to be stable. Closing since I don't think anymore this was something related to pytorch_lightning.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxime-louis picture maxime-louis  ยท  3Comments

edenlightning picture edenlightning  ยท  3Comments

williamFalcon picture williamFalcon  ยท  3Comments

justusschock picture justusschock  ยท  3Comments

chuong98 picture chuong98  ยท  3Comments