Models: OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Not found: No such file or directory

Created on 14 Nov 2018 · 6Comments · Source: tensorflow/models

Problem occurring when running training job using Google Cloud Machine Learning. Training seems to start (seeing initial loss and step log) but can't save the checkpoint files so the job fails.

screen shot 2018-11-13 at 17 45 48

System information

What is the top-level directory of the model you are using: models/research/object-detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS High Sierra
TensorFlow version (use command below): v1.9.0-0-g25c197e023 1.9.0
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A
Exact command to reproduce:
gcloud ml-engine jobs submit training object_detection_date +%s \
--runtime-version 1.9 --python-version 3.5 \
--job-dir=gs://csc3032-fyp/initial-wd-mobilenet/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,cocoapi/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-east1 \
--config ~/cloud.yaml \
-- \
--train_dir=gs://csc3032-fyp/initial-wd-mobilenet/train \
--pipeline_config_path=gs://csc3032-fyp/initial-wd-mobilenet/data/ssd_mobilenet_v1_coco.config

Thanks

research

Source

GDunne96

Most helpful comment

Yeah I fixed the problem by changing the --train_dir argument to the --model_dir argument in the job submission command and now the checkpoint files are being saved to the specified bucket directory. The only resource I found that specified to use the --train_dir to save the checkpoint files is here.

Additionally I found further information about how the model directory in the Estimator is used to save the checkpoint file from the TensorFlow docs here.

GDunne96 on 19 Nov 2018

👍3

All 6 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
TensorFlow installed from

tensorflowbutler on 15 Nov 2018

Same problem here! I was following the object detection quick start tutorial very closely, except I was using different data and ssd_mobilenet_v2_coco. I tried runtimes 1.8, 1.9 and 1.10 as well as standard_gpu and standard_p100 master and workers, all resulting in the same error as above.

MirkoArnold1 on 16 Nov 2018

5704 discusses the same issue.

MirkoArnold1 on 19 Nov 2018

The whole problem seems to be connected to this /tmp/... directory being used as model directory. In my case it was a typo in the configuration yaml that caused it, but in this and the duplicate issue people are using the --train_dir arguments. These should be --model_dir now, otherwise the instance of tf.estimator.Estimator will be created with None as model directory. It then uses a local temporary directory, which apparently can cause issues when running in the cloud.

MirkoArnold1 on 19 Nov 2018

Additionally I found further information about how the model directory in the Estimator is used to save the checkpoint file from the TensorFlow docs here.

GDunne96 on 19 Nov 2018

👍3

Closing this issue since its resolved. Thanks!