Models: OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Not found: No such file or directory

Created on 14 Nov 2018  路  6Comments  路  Source: tensorflow/models

Problem occurring when running training job using Google Cloud Machine Learning. Training seems to start (seeing initial loss and step log) but can't save the checkpoint files so the job fails.

screen shot 2018-11-13 at 17 45 48

System information

  • What is the top-level directory of the model you are using: models/research/object-detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS High Sierra
  • TensorFlow version (use command below): v1.9.0-0-g25c197e023 1.9.0
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A
  • Exact command to reproduce:
    gcloud ml-engine jobs submit training object_detection_date +%s \
    --runtime-version 1.9 --python-version 3.5 \
    --job-dir=gs://csc3032-fyp/initial-wd-mobilenet/train \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,cocoapi/pycocotools-2.0.tar.gz \
    --module-name object_detection.model_main \
    --region us-east1 \
    --config ~/cloud.yaml \
    -- \
    --train_dir=gs://csc3032-fyp/initial-wd-mobilenet/train \
    --pipeline_config_path=gs://csc3032-fyp/initial-wd-mobilenet/data/ssd_mobilenet_v1_coco.config

Thanks

research

Most helpful comment

Yeah I fixed the problem by changing the --train_dir argument to the --model_dir argument in the job submission command and now the checkpoint files are being saved to the specified bucket directory. The only resource I found that specified to use the --train_dir to save the checkpoint files is here.

Additionally I found further information about how the model directory in the Estimator is used to save the checkpoint file from the TensorFlow docs here.

All 6 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
TensorFlow installed from

Same problem here! I was following the object detection quick start tutorial very closely, except I was using different data and ssd_mobilenet_v2_coco. I tried runtimes 1.8, 1.9 and 1.10 as well as standard_gpu and standard_p100 master and workers, all resulting in the same error as above.

5704 discusses the same issue.

The whole problem seems to be connected to this /tmp/... directory being used as model directory. In my case it was a typo in the configuration yaml that caused it, but in this and the duplicate issue people are using the --train_dir arguments. These should be --model_dir now, otherwise the instance of tf.estimator.Estimator will be created with None as model directory. It then uses a local temporary directory, which apparently can cause issues when running in the cloud.

Yeah I fixed the problem by changing the --train_dir argument to the --model_dir argument in the job submission command and now the checkpoint files are being saved to the specified bucket directory. The only resource I found that specified to use the --train_dir to save the checkpoint files is here.

Additionally I found further information about how the model directory in the Estimator is used to save the checkpoint file from the TensorFlow docs here.

Closing this issue since its resolved. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sun9700 picture sun9700  路  3Comments

kamal4493 picture kamal4493  路  3Comments

rakashi picture rakashi  路  3Comments

dsindex picture dsindex  路  3Comments

25b3nk picture 25b3nk  路  3Comments