Pytorch-lightning: ModelCheckpoint with custom filepath don't support training on multiple nodes

Created on 11 Aug 2020  路  6Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

When training on multiple nodes using ModelCheckpoint with custom filepath, it will raise FileExistsError caused by the following line of code: model_checkpoint.py#L127.

Maybe a try-except block is needed?

Priority P0 bug / fix help wanted

Most helpful comment

No I think we just need to pass in exist_ok=True into makedirs :)

All 6 comments

No I think we just need to pass in exist_ok=True into makedirs :)

@f4hy your PR added this line. Do you know a good way to fix it?

Ah sorry. I think I know what's up. I'll get a patch out this evening. Sorry!

@angshine I found a few issues with the model checkpoint path stuff. Not 100% sure I found the particular bug you were seeing but I think this should fix it. Can you give my branch in the above PR a test? Sorry to have introduced this bug for you.

Sorry for the late reply, but it seems that this bug has not been fully fixed. This line still raises an exception: tensorboard.compat.tensorflow_stub.errors.AlreadyExistsError: Directory already exists when training with DDP. I still need to manually add a try-except to ignore the exception.

@angshine That line has been completely replaced now on master. Can you give it another try. I hope #3320 has finally resolved this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxime-louis picture maxime-louis  路  3Comments

versatran01 picture versatran01  路  3Comments

srush picture srush  路  3Comments

awaelchli picture awaelchli  路  3Comments

iakremnev picture iakremnev  路  3Comments