Models: Have issue training the model from scratch.

Created on 14 Sep 2017 · 3Comments · Source: tensorflow/models

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

It must be a bug or a feature request.
The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

What is the top-level directory of the model you are using:
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow installed from (source or binary):
TensorFlow version (use command below):
Bazel version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Source

25b3nk

👎1

All 3 comments

boa train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=mnist --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=mobilenet_v1
After I have gotten the dataset on board, and I also followed the previous steps of installation. Later also converted the MNIST to the TFRecord format. Then when I ran the train_image_classifier.py form the slim folder of the models repo, I get the following logs. (NOTE: I am using anaconda python aliased as boa, and have stock python alongside.)

WARNING:tensorflow:From train_image_classifier.py:468: softmax_cross_entropy (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.softmax_cross_entropy instead. Note that the order of the logits and labels arguments has been changed.
WARNING:tensorflow:From /opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/losses/python/losses/loss_ops.py:398: compute_weighted_loss (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.compute_weighted_loss instead.
WARNING:tensorflow:From /opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/losses/python/losses/loss_ops.py:151: add_loss (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.add_loss instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2017-09-14 11:23:12.377137: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-14 11:23:12.377158: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-14 11:23:12.377162: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-14 11:23:12.377165: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-14 11:23:12.377169: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-14 11:23:14.890614: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:0 for node 'fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Error reported to Coordinator: , Cannot assign a device to node 'gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/device:GPU:0"](gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape, gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape_1)]]

Caused by op u'gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs', defined at:
File "train_image_classifier.py", line 574, in
tf.app.run()
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_image_classifier.py", line 534, in main
var_list=variables_to_train)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/deployment/model_deploy.py", line 297, in optimize_clones
optimizer, clone, num_clones, regularization_losses, *kwargs)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/deployment/model_deploy.py", line 261, in _optimize_clone
clone_grad = optimizer.compute_gradients(sum_loss, *kwargs)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 386, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 368, in _MaybeCompile
return grad_fn() # Exit early
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/math_grad.py", line 609, in _SubGrad
rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 411, in _broadcast_gradient_args
name=name)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()

...which was originally created as op u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub', defined at:
File "train_image_classifier.py", line 574, in
tf.app.run()
[elided 0 identical lines from previous traceback]
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_image_classifier.py", line 474, in main
clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(args, *kwargs)
File "train_image_classifier.py", line 457, in clone_fn
logits, end_points = network_fn(images)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/nets/nets_factory.py", line 114, in network_fn
return func(images, num_classes, is_training=is_training)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/nets/mobilenet_v1.py", line 323, in mobilenet_v1
conv_defs=conv_defs)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/nets/mobilenet_v1.py", line 232, in mobilenet_v1_base
scope=end_point)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
return func(args, *current_args)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 927, in convolution
outputs = normalizer_fn(outputs, *normalizer_params)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
return func(args, *current_args)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 528, in batch_norm
outputs = layer.apply(inputs, training=is_training)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 320, in apply
return self.__call__(inputs, *kwargs)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 290, in __call__
outputs = self.call(inputs, **kwargs)

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/device:GPU:0"](gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape, gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape_1)]]

Traceback (most recent call last):
File "train_image_classifier.py", line 574, in
tf.app.run()
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_image_classifier.py", line 570, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 725, in train
master, start_standard_services=False, config=session_config) as sess:
File "/opt/anaconda2/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
start_standard_services=start_standard_services)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 706, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 262, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/device:GPU:0"](gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape, gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape_1)]]

Caused by op u'gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs', defined at:
File "train_image_classifier.py", line 574, in
tf.app.run()
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_image_classifier.py", line 534, in main
var_list=variables_to_train)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/deployment/model_deploy.py", line 297, in optimize_clones
optimizer, clone, num_clones, regularization_losses, *kwargs)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/deployment/model_deploy.py", line 261, in _optimize_clone
clone_grad = optimizer.compute_gradients(sum_loss, *kwargs)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 386, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 368, in _MaybeCompile
return grad_fn() # Exit early
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/math_grad.py", line 609, in _SubGrad
rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 411, in _broadcast_gradient_args
name=name)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()

...which was originally created as op u'MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub', defined at:
File "train_image_classifier.py", line 574, in
tf.app.run()
[elided 0 identical lines from previous traceback]
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_image_classifier.py", line 474, in main
clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(args, *kwargs)
File "train_image_classifier.py", line 457, in clone_fn
logits, end_points = network_fn(images)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/nets/nets_factory.py", line 114, in network_fn
return func(images, num_classes, is_training=is_training)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/nets/mobilenet_v1.py", line 323, in mobilenet_v1
conv_defs=conv_defs)
File "/home/csb/path/to/projects/RnD/mobilenet/tensorflow_models/slim/nets/mobilenet_v1.py", line 232, in mobilenet_v1_base
scope=end_point)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
return func(args, *current_args)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 927, in convolution
outputs = normalizer_fn(outputs, *normalizer_params)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
return func(args, *current_args)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 528, in batch_norm
outputs = layer.apply(inputs, training=is_training)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 320, in apply
return self.__call__(inputs, *kwargs)
File "/opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 290, in __call__
outputs = self.call(inputs, **kwargs)

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/device:GPU:0"](gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape, gradients/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/moments/sufficient_statistics/Sub_grad/Shape_1)]]

Please help me with the issue, I am not very good with Tensorflow. Atleast help me with what the error might be and how to check next time I face similar issue.

boa -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
The above code gives

('unknown', '1.1.0')

I am using the stock git repo and have made no changes. No GPU on my PC. OS: Linux ubuntu 14.04.
Tensorflow installed using conda install tensorflow

25b3nk on 14 Sep 2017

👎1

Have you read the instructions?

Please go to Stack Overflow for help and support:
http://stackoverflow.com/questions/tagged/tensorflow
If you open a GitHub issue, here is our policy:
It must be a bug or a feature request.
The form below must be filled out.
We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed.

It is your own problem, not a bug. The error you received is very clear: TF can't find GPU. Try to read at least what program prints. It is because you've installed CPU-only version of tensorflow from repositories.
Go on Stack Overflow.

UndeadBlow on 14 Sep 2017

👍1

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. Thanks!