Hi. I am trying to execute training using Google TPU.
It says: Error recorded from training_loop: File system scheme '[local]' not implemented (file: '/tmp/tmpUUrctt/model.ckpt-0_temp_3546763fe0ab4b32a5353eb8f190192c')
I googled and found that it is a common error and their recommended solution is: All input files and the model directory must use a cloud storage bucket path (gs://bucket-name/...), and this bucket must be accessible from the TPU server.
I have doubled check and confirmed the following:
Can I be missing anything? Any help is much appreciated. Thank you.
Hi, which models are you using?
Here are some tutorials: https://cloud.google.com/tpu/docs/tutorials/resnet-2.x
Adding a few cloud tpu team members to check out here.
@allenwang28 @gagika
It looks like model_dir is set to a local directory in /tmp/, can you double check to make sure that it is a path to a GCS bucket?
@saberkun I am using ssd_mobilenet_v1_0.75 model.
@allenwang28 The training procedure has created a new directory in my bucket "model_dir" before error-ing. I assume that means the path is already correct.
It could be possible that it's creating a directory within data_dir which is a GCS path. Both data_dir and model_dir should be set to GCS buckets
Hi.
I was able to execute on Cloud v3 TPUs using local files. An example here: https://github.com/sayakpaul/Generating-categories-from-arXiv-paper-titles/blob/master/TPU_Experimentation.ipynb.
Does this GCS requirement apply to tensorboard's log dir too? i.e. tf.summary.create_file_writer(logdir...)
Yes, for Cloud TPU usage, model_dir (which is the parent directory for most models' logdir) must be a GCS bucket. As a sanity check, I've run an experiment where I replaced logdir (where data_dir, model_dir were GCS buckets) with a local directory which failed with this same error.
Most helpful comment
It looks like model_dir is set to a local directory in /tmp/, can you double check to make sure that it is a path to a GCS bucket?