Pytorch-lightning: Training slows down with long epoch

Created on 4 Jul 2020 · 2Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I'm doing BERT transfer-learning on a single GPU (the same happens with 2 or 4 GPUs...) and on a large dataset.
Each epoch has about 1.7M steps and training speed linearly slows down such that at some point, the
estimated remaining time starts to increase.

Is it possibly related to the fact that pytorch_lightning is concatenating the outputs of each step to give them to training_epoch_end?

Is it possible to disable this behaviour such that losses and logs are stored just for a few seconds to be written to disk and then discarded?

Code

What have you tried?

What's your environment?

Ubuntu 20.04 Server 4.15.0-108-generic
CUDA 10.2, CuDNN 7.6.5
Pytorch Lightning 0.8.4
PyTorch 1.5.1

question

Source

lucadiliello

Most helpful comment

I didn't notice that updating Ubuntu the CUDA drivers where moved to version to 11.0. Moreover, I removed this schedulers and now training speed appears to be stable. Closing since I don't think anymore this was something related to pytorch_lightning.

lucadiliello on 10 Jul 2020

👍2