Serving: Cannot assign a device for operation 'save_1/ShardedFilename'

Created on 13 Jun 2017 · 6Comments · Source: tensorflow/serving

Ubuntu 14.04.5 LTS
CUDA 8.0
Bazel 0.5.0
TensorFlow Serving installed as per the installation page

When I load my SavedModel into the tensorflow_model_server, I get the following output/error:

2017-06-12 17:08:59.735992: I tensorflow_serving/model_servers/main.cc:155] Building single TensorFlow model file config:  model_name: tfrec model_base_path: /u2/tf_models/20170118115237/export/ model_version_policy: 0
2017-06-12 17:08:59.736452: I tensorflow_serving/model_servers/server_core.cc:375] Adding/updating models.
2017-06-12 17:08:59.736493: I tensorflow_serving/model_servers/server_core.cc:421]  (Re-)adding model: tfrec
2017-06-12 17:08:59.838431: I tensorflow_serving/core/basic_manager.cc:698] Successfully reserved resources to load servable {name: tfrec version: 8}
2017-06-12 17:08:59.838485: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: tfrec version: 8}
2017-06-12 17:08:59.838525: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: tfrec version: 8}
2017-06-12 17:08:59.838600: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /u2/tf_models/20170118115237/export/8
2017-06-12 17:08:59.838645: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:227] Loading SavedModel from: /u2/tf_models/20170118115237/export/8
2017-06-12 17:09:01.481736: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-12 17:09:01.481770: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-12 17:09:01.481777: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-12 17:09:01.481782: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-12 17:09:01.481788: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-06-12 17:09:04.364379: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:938] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:09:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-06-12 17:09:04.364442: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:959] DMA: 0
2017-06-12 17:09:04.364452: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:969] 0:   Y
2017-06-12 17:09:04.364472: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1028] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:09:00.0)
2017-06-12 17:09:11.453574: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:275] Loading SavedModel: fail. Took 11614926 microseconds.
2017-06-12 17:09:11.643941: E tensorflow_serving/util/retrier.cc:38] Loading servable: {name: tfrec version: 8} failed:
Invalid argument: Cannot assign a device for operation 'save_1/ShardedFilename': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
ShardedFilename: CPU
         [[Node: save_1/ShardedFilename = ShardedFilename[_output_shapes=[[]], _device="/device:GPU:0"](save_1/StringJoin, save_1/ShardedFilename/shard, save_1/num_shards)]]

I can load the model just fine outside of TF Serving (perhaps because of 'allow_soft_placement=True'). I'm not sure why the saver is assigned to the GPU -- I even tried explicitly assigning it to the CPU prior to exporting the SavedModel, but with no change.

Specifying 'clear_devices=True' during the export works around this error, but leads to other problems and it shouldn't really be necessary to clear all device specifications.

I think this related to issue #403, though that has gone over two months without a response.
```

Source

wingsbr

👍6

All 6 comments

I'm not sure what's going on, but a next step would be to inspect the graph def protobuf file to see what device assignments are recorded in there. If you force all nodes to cpu and they show up as cpu in the graph def file, then the problem may lie in the TF-Core (not TF-Serving) layer. If, OTOH, the graph def file disagrees with what you requested for the assignments, then the model saver may be at fault.

Also, you've probably already done this but just in case: Be sure you've ./configure'ed tensorflow with cuda.

chrisolston on 13 Jun 2017

Also, be sure you used --config=cuda when you ran "bazel build tensorflow_serving/...".

chrisolston on 14 Jun 2017

I tracked that op back to the SavedModel export (tf.saved_model.builder.SavedModelBuilder). When I exclude that, the offending op is not included in the graph. However, I'm still puzzled about why the op is being placed on the GPU, since I'm explicitly assigning all of the SavedModelBuilder code to the CPU:

with tf.device('/cpu:0'):
        builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
        ...

I was able to resolve the issue (and get the server to load) by modifying the device assignments in the exported 'saved_model.pbtxt' file. Though, this also required converting all of the 'ConcatV2', 'Pack', and 'SparseToDense' ops which had all been assigned to the GPU and had no corresponding kernel.

I'm thinking the reason this wasn't a problem in a regular TF environment was due to 'allow_soft_placement=True'. Is there any way to enable soft placement in TF serving?

wingsbr on 14 Jun 2017

👍1

What else is inside your "with"? Anything that might override the device? Is the actual save call in there (versus merely constructing the builder)?

SavedModelBuilder invokes the tensorflow saver to generate the graphdef, so if there is a bug it's probably in that layer.

re: allow_soft_placement: We could add it quite easily, via an edit to tensorflow_serving/model_servers/main.cc. Just add a flag and pipe it to session_bundle_config.session_config.allow_soft_placement. Perhaps you can send us a PR for that.

chrisolston on 15 Jun 2017

The full "with" is as follows:

    with tf.device('/cpu:0'):
        builder = tf.saved_model.builder.SavedModelBuilder(export_dir)

        prediction_signature = signature_def_utils.build_signature_def( 
            inputs=inputs,
            outputs=outputs,
            method_name=signature_constants.PREDICT_METHOD_NAME)

        legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
        builder.add_meta_graph_and_variables(session,
                                            [tag_constants.SERVING],
                                            signature_def_map={
                                                'recog': prediction_signature,
                                            },
                                            clear_devices=True,
                                            legacy_init_op=legacy_init_op)
        builder.save()
        builder.save(as_text=True)

I re-built the tensorflow_model_server with flags to enable the missing instructions mentioned in the first post (SSE4.1, SSE4.2, AVX, AVX2, FMA) and that seemingly fixed a number of problems. Between that, re-enabling 'clear_devices=True' (as shown above), and enabling soft placement in main.cc it seems that everything is now working.

Thank you for the help!

wingsbr on 15 Jun 2017

❤1

The following error message does not necessarily mean the error was from assigning to GPU device.

Cannot assign a device for operation 'save_1/ShardedFilename': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

I will explain what I found and how.

grep "Cannot assign a device for operation" from tensorflow/tensorflow folder.
This resulted Placer::Run() in tensorflow/tensorflow/core/common_runtime/placer.cc.

    Status status = colocation_graph.GetDevicesForNode(node, &devices);
    if (!status.ok()) {
      return AttachDef(
          errors::InvalidArgument("Cannot assign a device for operation '",
                                  node->name(), "': ", status.error_message()),
          *node);
    }

Review colocation_graph.GetDevicesForNode().
Looking at the code, it appears that the function tries to locate the given node from given device. If it cannot find the node from given device, it will use soft placement to find the node in other devices. This is why the function can report error showing device that was not the type of device the caller explicitly assigned.

Once the function fails to find the node from any device, it will report error.

So the error can happen if,

There is no such device. Or,
There is no such node

For case 1, the error message would be
'Operation was explicitly assigned to ??? but available devices are [???, ???]'.

For case 2, the error message would be,
'Cannot assign a device for operation '???': Could not satisfy explicit device specification '???' because...'.

So the most likely reason why the OP got this error would be freezing operation not exporting 'save_1/ShardedFilename' node from the original graph.

And that is the bug that tensorflow team needs to chase after.