Serving: performance issue of tensorflow serving

Created on 24 Mar 2017 · 12Comments · Source: tensorflow/serving

Hi there!

Could you help me? I have the following issue: my trained model predicts 10x times slower via tensorflow serving than python version. I've debuged server-side code and understood that this piece of code consumes all time:

  TF_RETURN_IF_ERROR(bundle->session->Run(run_options, input_tensors,
                                          output_tensor_names, {}, &outputs,
                                          &run_metadata));

The full listing is here:

Status SavedModelPredict(const RunOptions& run_options, ServerCore* core,
                         const PredictRequest& request,
                         PredictResponse* response) {
  // Validate signatures.
  ServableHandle<SavedModelBundle> bundle;
  TF_RETURN_IF_ERROR(core->GetServableHandle(request.model_spec(), &bundle));

  const string signature_name = request.model_spec().signature_name().empty()
                                    ? kDefaultServingSignatureDefKey
                                    : request.model_spec().signature_name();
  auto iter = bundle->meta_graph_def.signature_def().find(signature_name);
  if (iter == bundle->meta_graph_def.signature_def().end()) {
    return errors::FailedPrecondition(
        "Default serving signature key not found.");
  }
  SignatureDef signature = iter->second;

  std::vector<std::pair<string, Tensor>> input_tensors;
  std::vector<string> output_tensor_names;
  std::vector<string> output_tensor_aliases;
  TF_RETURN_IF_ERROR(PreProcessPrediction(signature, request, &input_tensors,
                                          &output_tensor_names,
                                          &output_tensor_aliases));
  std::vector<Tensor> outputs;
  RunMetadata run_metadata;
auto t1 = std::chrono::steady_clock::now();
  TF_RETURN_IF_ERROR(bundle->session->Run(run_options, input_tensors,
                                          output_tensor_names, {}, &outputs,
                                          &run_metadata));

auto t2 = std::chrono::steady_clock::now();


  auto result = PostProcessPredictionResult(signature, output_tensor_aliases, outputs,
                                     response);
auto t3 = std::chrono::steady_clock::now();
std::cout << "p1 t2-t1: " << std::chrono::duration <double, std::milli> (t2-t1).count() << " msec" << std::endl;
std::cout << "p2 t3-t2: " << std::chrono::duration <double, std::milli> (t3-t2).count() << " msec" << std::endl;
return result;
}

This method is executed by tensorflow_serving/model_servers/main.cc.

I used py/keras for training. Export of model was performed in the following way:

def convert_model(source, export_path, export_version):
    K.set_learning_phase(0)
    model = load_model(source)
    sess = K.get_session()

    config = model.get_config()
    weights = model.get_weights()
    model = Model.from_config(config)
    model.set_weights(weights)

    saver = tf.train.Saver(sharded=True)
    model_exporter = exporter.Exporter(saver)
    model_exporter.init(sess.graph.as_graph_def(), named_graph_signatures={
        'inputs': exporter.generic_signature({'first': model.input[0], 'second': model.input[1]}),
        'outputs': exporter.generic_signature({'predictions': model.output})})
    model_exporter.export(export_path, tf.constant(export_version), sess)

Performance measurements (python and tensorflow serving versions) were conducted with disabled gpu. Do you have any ideas?

Best regards,
Mikhail

performance

Source

myurushkin

Most helpful comment

Here's an example of the contents of a batching parameters file:
max_batch_size { value: 200 }
batch_timeout_micros { value: 10000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 1 }

For your IDE question I don't know the answer. Perhaps you can open a separate issue for that, and maybe somebody from the community would be familiar with qtcreator.

chrisolston on 24 Mar 2017

👍5 🎉1

All 12 comments

If the Session::Run() method is what consumes all the time, then it's likely in the tensorflow layer (not tensorflow-serving), unless the following: Can you check whether you are getting a wrapped Session e.g. BatchingSession? It's possible that you've configured batching with "bad" tuning parameters that is slowing you down, e.g. a long timeout s.t. the batcher is waiting a long time.

chrisolston on 24 Mar 2017

Yes, BatchingSession is used.
What about batching parameters.. I didn't understand how to use platform_config_file/batching_parameters_file. That's why I've added several cmd parameters:

      tensorflow::Flag("max_batch_size", &max_batch_size, "max batch size"),
      tensorflow::Flag("num_batch_threads", &num_batch_threads, "num of batch threads"),
      tensorflow::Flag("batch_timeout_micros", &batch_timeout_micros, "batch timeout micros"),

After that I used them in the following way:

 if (platform_config_file.empty()) {
    SessionBundleConfig session_bundle_config;
    // Batching config
    if (enable_batching) {
      BatchingParameters* batching_parameters =
          session_bundle_config.mutable_batching_parameters();
      if (batching_parameters_file.empty()) {
        batching_parameters->mutable_thread_pool_name()->set_value(
            "model_server_batch_threads");
          {
          auto* param = new google::protobuf::Int64Value();
          param->set_value(num_batch_threads);
          batching_parameters->set_allocated_num_batch_threads(param);
          }
          {
          auto* param = new google::protobuf::Int64Value();
          param->set_value(max_batch_size);
          batching_parameters->set_allocated_max_batch_size(param);
          }
          {
          auto* param = new google::protobuf::Int64Value();
          param->set_value(batch_timeout_micros);
          batching_parameters->set_allocated_batch_timeout_micros(param);
          }
    /*
          {
          auto* param = new google::protobuf::Int64Value();
          param->set_value(max_enqueued_batches);
          batching_parameters->set_allocated_max_enqueued_batches(param);
          }
    */
          {
          auto* param = new google::protobuf::Int64Value();
          param->set_value(batch_timeout_micros);
          batching_parameters->set_allocated_batch_timeout_micros(param);
          }

          std::cout << "--------------------------\n";
          std::cout << "num_batch_threads: " << batching_parameters->num_batch_threads().value() << std::endl;
          std::cout << "max_batch_size: " << batching_parameters->max_batch_size().value() << std::endl;
          std::cout << "max_enqueued_batches: " << batching_parameters->max_enqueued_batches().value() << std::endl;
          std::cout << "batch_timeout_micros: " << batching_parameters->batch_timeout_micros().value() << std::endl;
          std::cout << "--------------------------\n";
      } else {
        *batching_parameters =
            ReadProtoFromFile<BatchingParameters>(batching_parameters_file);
      }
    } else if (!batching_parameters_file.empty()) {
      CHECK(false)  // Crash ok
          << "You supplied --batching_parameters_file without "
             "--enable_batching";
    }

    session_bundle_config.mutable_session_config()
        ->set_intra_op_parallelism_threads(tensorflow_session_parallelism);
    session_bundle_config.mutable_session_config()
        ->set_inter_op_parallelism_threads(tensorflow_session_parallelism);
    options.platform_config_map = CreateTensorFlowPlatformConfigMap(
        session_bundle_config, use_saved_model);
  } else {
    options.platform_config_map = ParsePlatformConfigMap(platform_config_file);
  }

Example of cmd arguments:

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=my-model-name --model_base_path=model_path --enable_batching --max_batch_size=128 --num_batch_threads=8 --batch_timeout_micros=2

This is an ouput:

num_batch_threads: 8
max_batch_size: 128
max_enqueued_batches: 0
batch_timeout_micros: 2
--------------------------

I tried to tune batch_timeout_micros/num_batch_threads/max_batch_size in several ways, but it didn't help.

myurushkin on 24 Mar 2017

Thanks for the detailed explanation. Some thoughts:

The simplest way to configure batching is with the (new) --batching_parameters_file flag. Just create a file containing an ascii representation of a BatchingParameters proto, and pass the file path as the flag value.
Since your concern is the performance of the tf-serving c++ case versus python, I would suggest trying to set up an apples-to-apples comparison. Assuming the python case is not doing any batching, then how about starting with batching disabled in the c++ case?
Batching parameter tuning can be tricky. See tensorflow_serving/batching/README.md for some ideas. Unfortunately there is no universal best approach because it depends on your model, system environment, and latency requirements. One thing I did notice is that your batch_timeout_micros is very small. It's possible that you are getting batches of size 1, which would mean you are paying for the overhead of batching but getting none of the benefit. If your model is very cheap, then it's possible that the overhead of batching costs more than running the model itself.

-Chris

chrisolston on 24 Mar 2017

👍3 🎉1

Thank you very much, Chris!

Could you show me example of file with batching_paramers proto please? I didn't find any examples in repo.
I also had a problem with developing/patching of tensorflow related stuff. Is it possible to use qtcreator or another IDE with bazel-based tensorflow serving stuff? At the moment I make patches with notepad with no debugger support..

Since your concern is the performance of the tf-serving c++ case versus python, I would suggest trying to set up an apples-to-apples comparison. Assuming the python case is not doing any batching, then how about starting with batching disabled in the c++ case?

No I used python with batching. So my performance comparison is correct to my mind.

Best,
Mikhail

myurushkin on 24 Mar 2017

👍1

For your IDE question I don't know the answer. Perhaps you can open a separate issue for that, and maybe somebody from the community would be familiar with qtcreator.

chrisolston on 24 Mar 2017

👍5 🎉1

Thank you so much!

Best regards,
Mikhail

myurushkin on 24 Mar 2017

Hi @myurushkin, it looks like the performance and batching config related questions are resolved. Closing the issue but please feel free to reopen if there is anything unresolved. Thanks!

sukritiramesh on 18 Apr 2017

Hi ,

Any one idea How export trained tensorflow model? How generate signature? I have already trained model.
How I export it for serving tensorflow and how write client (RPC) ?

avanish123 on 29 May 2017

👍1

Export:
`def save_model(pred_data, exp_pred, session, version=1):
export_path = os.path.join(EXPORT_DIR, tf.compat.as_bytes(str(version)))
save_builder = builder.SavedModelBuilder(export_path)

inputs = {'source': utils.build_tensor_info(pred_data[0]),
          'target_indices': utils.build_tensor_info(pred_data[1]),
          'target_values': utils.build_tensor_info(pred_data[2]),
          'target_shape': utils.build_tensor_info(pred_data[3]),
          'source_len': utils.build_tensor_info(pred_data[4])
          }

outputs = {'decoded': utils.build_tensor_info(exp_pred[0]),
           'error_rate': utils.build_tensor_info(exp_pred[1])
           }

prediction_signature = signature_def_utils.build_signature_def(inputs=inputs, outputs=outputs, method_name=signature_constants.PREDICT_METHOD_NAME)
signature_def_map = {'predict_speech': prediction_signature, tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: prediction_signature}
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
tags = [tag_constants.SERVING]

save_builder.add_meta_graph_and_variables(sess=session, tags=tags, signature_def_map=signature_def_map, legacy_init_op=legacy_init_op)
return save_builder.save()

Client:import tensorflow as tf
from grpc.beta import implementations

import serve.predict_pb2 as predict_pb2
import serve.prediction_service as service

host, port = '192.168.8.38', 8500
channel = implementations.insecure_channel(host, int(port))
stub = service.beta_create_PredictionService_stub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'DLSTM_L4_W256_R1'
request.model_spec.signature_name = 'predict_speech'

num_test_batches = test[1]
for ti in range(num_test_batches):
test_data = next(test[2])
request.inputs['source'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[0].value))
request.inputs['target_indices'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[1].value))
request.inputs['target_values'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[2].value))
request.inputs['target_shape'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[3].value))
request.inputs['source_len'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[4].value))

result = stub.Predict(request, 50.0)`

shahinkl on 29 Jun 2017

@myurushkin Are you able to fix your performance issues by configuring batch_parameters?

karthikvadla on 27 Oct 2018

Hi @karthikvadla,
Actually, I managed to tune my c++ service, but due to the several reasons (support difficulties of C++ code and gprc latency overhead) I decided to reimplement whole service to golang (no tensorflow serving).

myurushkin on 29 Oct 2018

I using Batching to predict but receive error:
"Batching session Run() input tensors must have equal 0th-dimension size"