Logging and checkpoint saving stopped working for me when I run experiments via slurm system.
I am using log keys in return functions: training_epoch_end/validation_epoch_end.
Version 0.7.6 works.
Steps to reproduce the behaviour:
sbatch ...Hi! thanks for your contribution!, great first issue!
Hi, I think I'm having the same problem, running locally logs work correctly (I'm sending to comet), but when I run on a cluster through slurm using sbatch or srun, the experiments in comet are created, but none of the logging works.
Edit: Downgraded to 0.7.6 and it works.
I think this might be due to an issue due to how the rank id is set, I'm not totally sure, but it could have occurred here: https://github.com/PyTorchLightning/pytorch-lightning/pull/2231
I guess it's due to a malfunction with rank_zero_only, sucht that the gated code is never executed.
See also comment in https://github.com/PyTorchLightning/pytorch-lightning/issues/2278#issuecomment-646997797
If you want a quick fix, just remove this line. (Dirty solution)
Fixed by #2339
Please run from master or 0.8.2 on June 25
Most helpful comment
I think this might be due to an issue due to how the rank id is set, I'm not totally sure, but it could have occurred here: https://github.com/PyTorchLightning/pytorch-lightning/pull/2231
I guess it's due to a malfunction with
rank_zero_only, sucht that the gated code is never executed.See also comment in https://github.com/PyTorchLightning/pytorch-lightning/issues/2278#issuecomment-646997797