When training on multiple nodes using ModelCheckpoint with custom filepath, it will raise FileExistsError caused by the following line of code: model_checkpoint.py#L127.
Maybe a try-except block is needed?
No I think we just need to pass in exist_ok=True into makedirs :)
@f4hy your PR added this line. Do you know a good way to fix it?
Ah sorry. I think I know what's up. I'll get a patch out this evening. Sorry!
@angshine I found a few issues with the model checkpoint path stuff. Not 100% sure I found the particular bug you were seeing but I think this should fix it. Can you give my branch in the above PR a test? Sorry to have introduced this bug for you.
Sorry for the late reply, but it seems that this bug has not been fully fixed. This line still raises an exception: tensorboard.compat.tensorflow_stub.errors.AlreadyExistsError: Directory already exists when training with DDP. I still need to manually add a try-except to ignore the exception.
@angshine That line has been completely replaced now on master. Can you give it another try. I hope #3320 has finally resolved this.
Most helpful comment
No I think we just need to pass in exist_ok=True into makedirs :)