Pytorch-lightning: Logging on slurm stopped working

Created on 22 Jun 2020  路  5Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

Logging and checkpoint saving stopped working for me when I run experiments via slurm system.
I am using log keys in return functions: training_epoch_end/validation_epoch_end.
Version 0.7.6 works.

To Reproduce

Steps to reproduce the behaviour:

  1. Define Tensorboard logger
  2. Run training using slurm system sbatch ...
  3. No logs.

Code sample

Expected behaviour

Environment

  • PyTorch 1.4.0:
  • PyTorch-lightning 0.8.1,
  • Linux,
  • Python 3.7.6,
  • CUDA/cuDNN 10.1, 7.6.5,
bug / fix help wanted

Most helpful comment

I think this might be due to an issue due to how the rank id is set, I'm not totally sure, but it could have occurred here: https://github.com/PyTorchLightning/pytorch-lightning/pull/2231
I guess it's due to a malfunction with rank_zero_only, sucht that the gated code is never executed.
See also comment in https://github.com/PyTorchLightning/pytorch-lightning/issues/2278#issuecomment-646997797

All 5 comments

Hi! thanks for your contribution!, great first issue!

Hi, I think I'm having the same problem, running locally logs work correctly (I'm sending to comet), but when I run on a cluster through slurm using sbatch or srun, the experiments in comet are created, but none of the logging works.

Edit: Downgraded to 0.7.6 and it works.

I think this might be due to an issue due to how the rank id is set, I'm not totally sure, but it could have occurred here: https://github.com/PyTorchLightning/pytorch-lightning/pull/2231
I guess it's due to a malfunction with rank_zero_only, sucht that the gated code is never executed.
See also comment in https://github.com/PyTorchLightning/pytorch-lightning/issues/2278#issuecomment-646997797

If you want a quick fix, just remove this line. (Dirty solution)

Fixed by #2339

Please run from master or 0.8.2 on June 25

Was this page helpful?
0 / 5 - 0 ratings

Related issues

remisphere picture remisphere  路  3Comments

justusschock picture justusschock  路  3Comments

srush picture srush  路  3Comments

Vichoko picture Vichoko  路  3Comments

jcreinhold picture jcreinhold  路  3Comments