Models: Error running train_lenet example -- no supported GPU kernel for 'Predictions/Softmax' ?

Created on 15 Nov 2017  路  13Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: slim
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows 7
  • TensorFlow installed from (source or binary): binary using pip install
  • TensorFlow version (use command below): 1.4.0 / 1.5.0 dev GPU
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 8.0 / 6.1
  • GPU model and memory: K1100M 4G (only 2G recognized by CUDA)
  • Exact command to reproduce:

Describe the problem

Trying to run this example on windows, error when training :

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Predictions/Softmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
     [[Node: Predictions/Softmax = Softmax[T=DT_FLOAT, _device="/device:GPU:0"](Predictions/Reshape)]]


The tensorflow installation and cuda is ok since I can still train object-detection flawlessly.
Tried the 1.5 nightly dev version but same error.
Thanks for any help in advance

Full trace

WARNING:tensorflow:From ***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py:400: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From ***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py:468: softmax_cross_entropy (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.softmax_cross_entropy instead. Note that the order of the logits and labels arguments has been changed.
WARNING:tensorflow:From ***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\losses\python\losses\loss_ops.py:399: compute_weighted_loss (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.compute_weighted_loss instead.
WARNING:tensorflow:From ***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\losses\python\losses\loss_ops.py:152: add_arg_scope.<locals>.func_with_args (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.add_loss instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2017-11-15 16:33:40.904052: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2017-11-15 16:33:41.038052: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1031] Found device 0 with properties: 
name: Quadro K1100M major: 3 minor: 0 memoryClockRate(GHz): 0.7055
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.46GiB
2017-11-15 16:33:41.038052: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0, compute capability: 3.0)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Cannot assign a device for operation 'Predictions/Softmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
     [[Node: Predictions/Softmax = Softmax[T=DT_FLOAT, _device="/device:GPU:0"](Predictions/Reshape)]]

Caused by op 'Predictions/Softmax', defined at:
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 574, in <module>
    tf.app.run()
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 474, in main
    clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\deployment\model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 457, in clone_fn
    logits, end_points = network_fn(images)
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\nets\nets_factory.py", line 135, in network_fn
    return func(images, num_classes, is_training=is_training, **kwargs)
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\nets\lenet.py", line 77, in lenet
    end_points['Predictions'] = prediction_fn(logits, scope='Predictions')
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 2598, in softmax
    predictions = nn.softmax(logits_2d)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 1667, in softmax
    return _softmax(logits, gen_nn_ops._softmax, dim, name)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 1610, in _softmax
    return compute_op(logits, name=name)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4367, in _softmax
    "Softmax", logits=logits, name=name)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\ops.py", line 3042, in create_op
    op_def=op_def)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\ops.py", line 1521, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Predictions/Softmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
     [[Node: Predictions/Softmax = Softmax[T=DT_FLOAT, _device="/device:GPU:0"](Predictions/Reshape)]]

Traceback (most recent call last):
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 1323, in _do_call
    return fn(*args)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 1293, in _run_fn
    self._extend_graph()
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 1354, in _extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'Predictions/Softmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
     [[Node: Predictions/Softmax = Softmax[T=DT_FLOAT, _device="/device:GPU:0"](Predictions/Reshape)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 574, in <module>
    tf.app.run()
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 570, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 746, in train
    master, start_standard_services=False, config=session_config) as sess:
  File "***\Anaconda3-5.0.0\lib\contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\training\supervisor.py", line 992, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\training\supervisor.py", line 820, in stop
    ignore_live_threads=ignore_live_threads)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\training\coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "***\Anaconda3-5.0.0\lib\site-packages\six.py", line 686, in reraise
    raise value
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\training\supervisor.py", line 981, in managed_session
    start_standard_services=start_standard_services)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\training\supervisor.py", line 718, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\training\session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 889, in run
    run_metadata_ptr)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _do_run
    options, run_metadata)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\client\session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'Predictions/Softmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
     [[Node: Predictions/Softmax = Softmax[T=DT_FLOAT, _device="/device:GPU:0"](Predictions/Reshape)]]

Caused by op 'Predictions/Softmax', defined at:
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 574, in <module>
    tf.app.run()
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 474, in main
    clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\deployment\model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\train_image_classifier.py", line 457, in clone_fn
    logits, end_points = network_fn(images)
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\nets\nets_factory.py", line 135, in network_fn
    return func(images, num_classes, is_training=is_training, **kwargs)
  File "***\Anaconda3-5.0.0\Lib\site-packages\tensorflow\models\research\slim\nets\lenet.py", line 77, in lenet
    end_points['Predictions'] = prediction_fn(logits, scope='Predictions')
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 2598, in softmax
    predictions = nn.softmax(logits_2d)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 1667, in softmax
    return _softmax(logits, gen_nn_ops._softmax, dim, name)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 1610, in _softmax
    return compute_op(logits, name=name)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4367, in _softmax
    "Softmax", logits=logits, name=name)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\ops.py", line 3042, in create_op
    op_def=op_def)
  File "***\Anaconda3-5.0.0\lib\site-packages\tensorflow\python\framework\ops.py", line 1521, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Predictions/Softmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
     [[Node: Predictions/Softmax = Softmax[T=DT_FLOAT, _device="/device:GPU:0"](Predictions/Reshape)]]

awaiting model gardener

Most helpful comment

add allow_soft_placement=True to the session config :
tf.ConfigProto(allow_soft_placement=True)

All 13 comments

@thenbasilmanran, any advice to offer?

I ran into the same problem with another model of slim. For me, a workaround for the problem was to add "with tf.device('/cpu:0'):" in front of the softmax definition. I think, in your case this is probably in front of line 77 in lenet.py.

@philippwerner Thank you for the help! I encountered the same problem for lenet and mobilenet as well.
Could you give some more details on where you added the line? Is it gen_nn_ops.py ?

@jingyibo123: I only fixed the problem for NASnet, but I think the fix should work for the other networks as well if used at the right line of code. If you want to fix it for all networks, the softmax() function in tensorflow/python/ops/nn_ops.py, line 1667 is probably a good place to add it. Try to replace

return _softmax(logits, gen_nn_ops._softmax, dim, name)

by

with tf.device('/cpu:0'):
    return _softmax(logits, gen_nn_ops._softmax, dim, name)

@all: I know this is only a quick and dirty fix. We are still looking for a clean solution that only uses CPU if no GPU kernel is available. Maybe there should be some exception handling + warning to stderr?

@jingyibo123 I have the same error with you but I have tried methods of philippwerner鈥榮 but I failed.Have
you ever solve it?

Having exactly the same issue running locally with GeForce GTX 980TI. Not a problem AWS EC2 K80 though. Any update by any chance? Thx

@mrry We're getting a bunch of reports that the softmax GPU kernel isn't available on Windows. Is this expected?

AFAICT we compile the op for GPU and test it as part of the Windows GPU build: search for "softmax_op_gpu.cu.cc" in this build log output.

@philippwerner Thank you, that worked for me.

@seppestaes the solution of philippwerner works well.
Since the GPU kernel of Softmax has been compiled we should't have this issue anymore in the future.

add allow_soft_placement=True to the session config :
tf.ConfigProto(allow_soft_placement=True)

   Edit in the train_image_classifier.py.     This worked for me ....
    ###########################
    # Kicks off the training. #
    ###########################

    session_config = tf.ConfigProto(allow_soft_placement=True)

    slim.learning.train(
            train_tensor,
            logdir=FLAGS.train_dir,
            master=FLAGS.master,
            is_chief=(FLAGS.task == 0),
            init_fn=_get_init_fn(),
            summary_op=summary_op,
            number_of_steps=FLAGS.max_number_of_steps,
            log_every_n_steps=FLAGS.log_every_n_steps,
            save_summaries_secs=FLAGS.save_summaries_secs,
            save_interval_secs=FLAGS.save_interval_secs,
            sync_optimizer=optimizer if FLAGS.sync_replicas else None,
            session_config=session_config,
            )

found the problem: tensorflow version 1.2 doosent know how to work with TPU and fails when it tr to get free GPU

Was this page helpful?
0 / 5 - 0 ratings