When I run the Tensorflow Object Detection API, start a training, interrupt the training and continue the training later while the Tensorboard is running, training failes, because it tries to rename some checkpoint files, which are apparently locked by the Tensorboard:
2018-01-19 15:54:45.633575: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\
core\framework\op_kernel.cc:1192] Unknown: Failed to rename: C:/Users/Alex/Repositories/MusicObjec
tDetector-TF/MusicObjectDetector/data/checkpoints-faster_rcnn_inception_resnet_v2_atrous_muscima_p
retrained_with_stafflines_dimension_clustering2-train\model.ckpt-92013.index.tempstate676747125244
4121708 to: C:/Users/Alex/Repositories/MusicObjectDetector-TF/MusicObjectDetector/data/checkpoints
-faster_rcnn_inception_resnet_v2_atrous_muscima_pretrained_with_stafflines_dimension_clustering2-t
rain\model.ckpt-92013.index : Access is denied.
; Input/output error
INFO:tensorflow:Error reported to Coordinator: <class tensorflow.python.framework.errors_impl.Unkn
ownError'>, Failed to rename: C:/Users/Alex/Repositories/MusicObjectDetector-TF/MusicObjectDetecto
r/data/checkpoints-faster_rcnn_inception_resnet_v2_atrous_muscima_pretrained_with_stafflines_dimen
sion_clustering2-train\model.ckpt-92013.index.tempstate6767471252444121708 to: C:/Users/Alex/Repos
itories/MusicObjectDetector-TF/MusicObjectDetector/data/checkpoints-faster_rcnn_inception_resnet_v
2_atrous_muscima_pretrained_with_stafflines_dimension_clustering2-train\model.ckpt-92013.index :
Access is denied.
I was wondering, if it would be possible to make sure that the Tensorboard does not lock out any other processes? Or is this entirely impossible to read a file, without locking? I don't know what the TensorBoard actually reads from the *.index file that takes longer than a split-second, releasing the file immediately afterwards. I understand, that loading the events from the events.out.tfevents.*.* takes a while to process, but there is apparently works.
The TensorBoard projector dashboard reads checkpoint files.
https://github.com/tensorflow/tensorboard/blob/dac74be470467f8d01a9e6ad2c3665c9a49f03bb/tensorboard/plugins/projector/projector_plugin.py#L175
To clarify, why does the Tensorflow Object Detection API rename checkpoint files? Thanks!
No Google people have any comments for this issue?
@ybsave The projector plugin is currently structured so that it opens a checkpoint reader and then fetches tensors from the reader on-demand, rather than attempting to consume the entire checkpoint all at once (since checkpoints can be quite large).
I don't think there's much TensorBoard can do to fix this issue on our end. I would recommend either asking the object detection API folks to ensure that their checkpoint logic doesn't attempt to re-write the same checkpoint (which I think is what's leading to the attempted rename here) or asking the TensorFlow folks to make the checkpoint reader robust to checkpoint file renaming (where they might just decide that's not something they care to support).
I have found that I get this error if I have an Explorer window also watching the folder. I close the Explorer window and the error stopped appearing. That leads me to think that it's Explorer that is locking the file and TensorBoard gets locked out.
Same thing happening to me as well. If you turn off the tensorboard and the explorer process, it works without hesitation.
Most helpful comment
I have found that I get this error if I have an Explorer window also watching the folder. I close the Explorer window and the error stopped appearing. That leads me to think that it's Explorer that is locking the file and TensorBoard gets locked out.