Pytorch-lightning: ModelCheckpoint with custom filepath don't support training on multiple nodes

Created on 11 Aug 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

When training on multiple nodes using ModelCheckpoint with custom filepath, it will raise FileExistsError caused by the following line of code: model_checkpoint.py#L127.

Maybe a try-except block is needed?

Priority P0 bug / fix help wanted

Source

angshine

Most helpful comment

No I think we just need to pass in exist_ok=True into makedirs :)

awaelchli on 11 Aug 2020

👍2

All 6 comments

No I think we just need to pass in exist_ok=True into makedirs :)

awaelchli on 11 Aug 2020

👍2

@f4hy your PR added this line. Do you know a good way to fix it?

awaelchli on 11 Aug 2020

Ah sorry. I think I know what's up. I'll get a patch out this evening. Sorry!

f4hy on 11 Aug 2020

❤1

@angshine I found a few issues with the model checkpoint path stuff. Not 100% sure I found the particular bug you were seeing but I think this should fix it. Can you give my branch in the above PR a test? Sorry to have introduced this bug for you.

f4hy on 12 Aug 2020

Sorry for the late reply, but it seems that this bug has not been fully fixed. This line still raises an exception: tensorboard.compat.tensorflow_stub.errors.AlreadyExistsError: Directory already exists when training with DDP. I still need to manually add a try-except to ignore the exception.

angshine on 3 Sep 2020

@angshine That line has been completely replaced now on master. Can you give it another try. I hope #3320 has finally resolved this.

f4hy on 4 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings