I'm doing BERT transfer-learning on a single GPU (the same happens with 2 or 4 GPUs...) and on a large dataset.
Each epoch has about 1.7M steps and training speed linearly slows down such that at some point, the
estimated remaining time starts to increase.
Is it possibly related to the fact that pytorch_lightning is concatenating the outputs of each step to give them to training_epoch_end?
Is it possible to disable this behaviour such that losses and logs are stored just for a few seconds to be written to disk and then discarded?
Removing log and progress_bar entries from the dict returned by the training_step seems to solve the issue.
I didn't notice that updating Ubuntu the CUDA drivers where moved to version to 11.0. Moreover, I removed this schedulers and now training speed appears to be stable. Closing since I don't think anymore this was something related to pytorch_lightning.
Most helpful comment
I didn't notice that updating Ubuntu the
CUDAdrivers where moved to version to11.0. Moreover, I removed this schedulers and now training speed appears to be stable. Closing since I don't think anymore this was something related topytorch_lightning.