I am trying to run a training job in Google Cloud using Tensorflow . I tried to run the training using by running the following command.
gcloud ml-engine jobs submit training training_1 --job-dir=gs://object-detection-bucket-test/train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --region us-central1 --config object_detection/samples/cloud/cloud.yml --runtime-version=1.2 -- --train_dir=gs://object-detection-bucket-test/train --pipeline_config_path=gs://object-detection-bucket-test/data/ssd_mobilenet_v1_coco.config
But When I run a job, I am getting the following error. Any idea why?
The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python2.7/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 210, in init slim_example_decoder.LookupTensor( AttributeError: 'module' object has no attribute 'LookupTensor' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python2.7/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 210, in init slim_example_decoder.LookupTensor( AttributeError: 'module' object has no attribute 'LookupTensor' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python2.7/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 210, in init slim_example_decoder.LookupTensor( AttributeError: 'module' object has no attribute 'LookupTensor' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python2.7/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 210, in init slim_example_decoder.LookupTensor( AttributeError: 'module' object has no attribute 'LookupTensor' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python2.7/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 210, in init slim_example_decoder.LookupTensor( AttributeError: 'module' object has no attribute 'LookupTensor' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python2.7/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 210, in init slim_example_decoder.LookupTensor( AttributeError: 'module' object has no attribute 'LookupTensor'
Tensorflow version: 1.3.0
I'm trying to train a ssd_inception_v2_coco
model on my desktop and I'm getting the same error:
python ../../../tensorflow_models/research/object_detection/train.py ...` on my desktop.
File "../../../tensorflow_models/research/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "../../../tensorflow_models/research/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/media/xxxx/Projects/tensorflow_models/research/object_detection/trainer.py", line 235, in train
train_config.prefetch_queue_capacity, data_augmentation_options)
File "/media/xxxx/Projects/tensorflow_models/research/object_detection/trainer.py", line 59, in create_input_queue
tensor_dict = create_tensor_dict_fn()
File "../../../tensorflow_models/research/object_detection/train.py", line 120, in get_next
dataset_builder.build(config)).get_next()
File "/media/masoud/DATA/Projects/tensorflow_models/research/object_detection/builders/dataset_builder.py", line 138, in build
label_map_proto_file=label_map_proto_file)
File "/media/xxxx/Projects/tensorflow_models/research/object_detection/data_decoders/tf_example_decoder.py", line 210, in __init__
slim_example_decoder.LookupTensor(
AttributeError: 'module' object has no attribute 'LookupTensor'
Tensorflow version: 1.4.1
I also have the same problem when trying to train a ssd_mobilenet model.
After updating version to 1.6.0, it runs normally.
I ran into the same error as well. I am running tensorflow-gpu==1.4.1
on Ubuntu 16.04. I am using the latest commit of models/research:
$ git show
commit d9c430b3aa7c1b2515cfde6ae10973a5e6308cc7
Merge: ac6ab36 080795f
Author: Mark Daoust <[email protected]>
Date: Sun Mar 11 20:40:08 2018 -0700
Merge pull request #3561 from kopankom/fix/assigment-in-loop
unnecessary variable assignment in loop
I will try upgrading to TensorFlow 1.6.0.
Is there a known commit that works with TensorFlow 1.4?
UPDATE:
For TensorFlow 1.4 compatibility I used the fad6075359b852b9c0a4c6f1b068790d44a6441a
commit instead.
$ git clone https://github.com/tensorflow/models/
$ cd models
$ git checkout fad6075359b852b9c0a4c6f1b068790d44a6441a
From there I was able to get past the LookupTensor
tensor error.
I then ran into a _second_ error when trying to run train.py
:
File "/home/ubuntu/.virtualenvs/tfod_api/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1023, in unstack
(axis, -value_shape.ndims, value_shape.ndims))
ValueError: axis = 0 not in [0, 0)
I encountered this error before and resolved it by adding anchorwise_output: true
to the weighted_sigmoid
and weighted_smooth_l1
in my .config
file:
loss {
classification_loss {
weighted_sigmoid {
anchorwise_output: true # add this
}
}
localization_loss {
weighted_smooth_l1 {
anchorwise_output: true #add this
}
}
From there I was able to train the model.
I hope that helps someone!
I'm trying to run training job on the google cloud and I'm getting the same AttributeError: 'module' object has no attribute 'LookupTensor'
I'm trying @jrosebr1 's fix but I can't seem to checkout that fad6075359b852b9c0a4c6f1b068790d44a6441a commit. I'm on Windows 10, python 2.7
it comes back with this:
fatal: Not a git repository (or any of the parent directories): .git
You need to cd
into models
first which is the repo you just cloned down. I'll edit my response to include this.
Thanks for the tip about checkout of the particular commit, @jrosebr1 . Interestingly enough, after I did that, it worked, and I didn't run into the second error that you mentioned above...
@Supersak80 Hm, interesting. Did your .prototxt
file already include the update mentioned? I pulled mine from the bleeding edge of the repo.
@jrosebr1 Which .proto file has those parameters?
@Supersak80 Whoops, I didn't mean to say .prototxt. I mean the pipeline.config. In particular, I needed to update the Faster R-CNN and SSDs for the COCO and Pets examples.
Upgrading to tensorflow 1.5+ would resolve missing LookupTensor issue. I'm also preparing a fix to make it tf 1.4 compatible(likely early next week).
Thanks so much @pkulzc!
jrosebr1 suggestion worked for me though I did not encounter the second error
@jrosebr1,I followed your method in Ubuntu14.04, tensorflow 1.4.0, python3.5, protobuf 3.5.1. The following error occurred during training:
File "/home/dl/anaconda3/lib/python3.5/site-packages/google/protobuf/text_format.py", line 703, in _MergeField
(message_descriptor.full_name, name))
google.protobuf.text_format.ParseError: 167:3 : Message type "object_detection.protos.TrainConfig" has no field named "max_number_of_boxes".
Is it a problem with my protobuf version? thanks for your help
@LXWDL Sorry, I'm not sure. I'm not a TF developer. It seems that it's either an issue with your Protobuf version or an issue but I'm not sure.
@LXWDL I just ran the command 'protoc object_detection/protos/*.proto --python_out=.' again under the research folder and this mistake has never occurred again.
@pkulzc do you have an updated estimate for the fix, or is it already commited?
@relational it's already in.
hi @pkulzc, seems this issue still not fixed on tensorflow 1.4
@FortiLeiZhang did you sync to HEAD? It's at least working with 1.4.0 in with my test local environment.
Hi, @pkulzc. thanks for your reply.
I checked this on 1.4.0/1.4.1/1.5.1/1.6.0, the issue reported in this thread has gone. I thinks this bug could be closed.
However, a new error msg was shown on 1.4.0/1.4.1, but not on 1.5.1/1.6.0:
File "/home/usr/models/research/object_detection/utils/dataset_util.py", line 128, in read_dataset
tf.contrib.data.parallel_interleave(
AttributeError: 'module' object has no attribute 'parallel_interleave'
@FortiLeiZhang have solve this problem?
@TyrionChou, I am not going deeper on 1.4.0. I switched to 1.6.0 and this version is good.
@LeonidasCl , I encountered the same problem as you, but it seemed your method could not work me out. My Protobuf Compilation works well, It confused me a lot, do you have any other ideas?
@aNothing, @pkulzc , it seems the /proto/ssd.proto has been modified recently. In this new version, they removed the field
"optional bool batch_norm_trainable = 6 [default=true];"
Yes, this field has been deprecated. Please see this stackoverflow question if anyone has issue with missing batch_norm_trainable.
I'm also closing this issue as LookupTensor issue has been fixed, which is also stated in faq.
Feel free to reopen this if the same issue happens after your syncing to head.
Most helpful comment
UPDATE:
For TensorFlow 1.4 compatibility I used the
fad6075359b852b9c0a4c6f1b068790d44a6441a
commit instead.From there I was able to get past the
LookupTensor
tensor error.I then ran into a _second_ error when trying to run
train.py
:I encountered this error before and resolved it by adding
anchorwise_output: true
to theweighted_sigmoid
andweighted_smooth_l1
in my.config
file:From there I was able to train the model.
I hope that helps someone!