Problem occurring when running training job using Google Cloud Machine Learning. Training seems to start (seeing initial loss and step log) but can't save the checkpoint files so the job fails.

date +%s \Thanks
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
TensorFlow installed from
Same problem here! I was following the object detection quick start tutorial very closely, except I was using different data and ssd_mobilenet_v2_coco. I tried runtimes 1.8, 1.9 and 1.10 as well as standard_gpu and standard_p100 master and workers, all resulting in the same error as above.
The whole problem seems to be connected to this /tmp/... directory being used as model directory. In my case it was a typo in the configuration yaml that caused it, but in this and the duplicate issue people are using the --train_dir arguments. These should be --model_dir now, otherwise the instance of tf.estimator.Estimator will be created with None as model directory. It then uses a local temporary directory, which apparently can cause issues when running in the cloud.
Yeah I fixed the problem by changing the --train_dir argument to the --model_dir argument in the job submission command and now the checkpoint files are being saved to the specified bucket directory. The only resource I found that specified to use the --train_dir to save the checkpoint files is here.
Additionally I found further information about how the model directory in the Estimator is used to save the checkpoint file from the TensorFlow docs here.
Closing this issue since its resolved. Thanks!
Most helpful comment
Yeah I fixed the problem by changing the
--train_dirargument to the--model_dirargument in the job submission command and now the checkpoint files are being saved to the specified bucket directory. The only resource I found that specified to use the--train_dirto save the checkpoint files is here.Additionally I found further information about how the model directory in the Estimator is used to save the checkpoint file from the TensorFlow docs here.