Hi there!
Could you help me? I have the following issue: my trained model predicts 10x times slower via tensorflow serving than python version. I've debuged server-side code and understood that this piece of code consumes all time:
TF_RETURN_IF_ERROR(bundle->session->Run(run_options, input_tensors,
output_tensor_names, {}, &outputs,
&run_metadata));
The full listing is here:
Status SavedModelPredict(const RunOptions& run_options, ServerCore* core,
const PredictRequest& request,
PredictResponse* response) {
// Validate signatures.
ServableHandle<SavedModelBundle> bundle;
TF_RETURN_IF_ERROR(core->GetServableHandle(request.model_spec(), &bundle));
const string signature_name = request.model_spec().signature_name().empty()
? kDefaultServingSignatureDefKey
: request.model_spec().signature_name();
auto iter = bundle->meta_graph_def.signature_def().find(signature_name);
if (iter == bundle->meta_graph_def.signature_def().end()) {
return errors::FailedPrecondition(
"Default serving signature key not found.");
}
SignatureDef signature = iter->second;
std::vector<std::pair<string, Tensor>> input_tensors;
std::vector<string> output_tensor_names;
std::vector<string> output_tensor_aliases;
TF_RETURN_IF_ERROR(PreProcessPrediction(signature, request, &input_tensors,
&output_tensor_names,
&output_tensor_aliases));
std::vector<Tensor> outputs;
RunMetadata run_metadata;
auto t1 = std::chrono::steady_clock::now();
TF_RETURN_IF_ERROR(bundle->session->Run(run_options, input_tensors,
output_tensor_names, {}, &outputs,
&run_metadata));
auto t2 = std::chrono::steady_clock::now();
auto result = PostProcessPredictionResult(signature, output_tensor_aliases, outputs,
response);
auto t3 = std::chrono::steady_clock::now();
std::cout << "p1 t2-t1: " << std::chrono::duration <double, std::milli> (t2-t1).count() << " msec" << std::endl;
std::cout << "p2 t3-t2: " << std::chrono::duration <double, std::milli> (t3-t2).count() << " msec" << std::endl;
return result;
}
This method is executed by tensorflow_serving/model_servers/main.cc.
I used py/keras for training. Export of model was performed in the following way:
def convert_model(source, export_path, export_version):
K.set_learning_phase(0)
model = load_model(source)
sess = K.get_session()
config = model.get_config()
weights = model.get_weights()
model = Model.from_config(config)
model.set_weights(weights)
saver = tf.train.Saver(sharded=True)
model_exporter = exporter.Exporter(saver)
model_exporter.init(sess.graph.as_graph_def(), named_graph_signatures={
'inputs': exporter.generic_signature({'first': model.input[0], 'second': model.input[1]}),
'outputs': exporter.generic_signature({'predictions': model.output})})
model_exporter.export(export_path, tf.constant(export_version), sess)
Performance measurements (python and tensorflow serving versions) were conducted with disabled gpu. Do you have any ideas?
Best regards,
Mikhail
If the Session::Run() method is what consumes all the time, then it's likely in the tensorflow layer (not tensorflow-serving), unless the following: Can you check whether you are getting a wrapped Session e.g. BatchingSession? It's possible that you've configured batching with "bad" tuning parameters that is slowing you down, e.g. a long timeout s.t. the batcher is waiting a long time.
Yes, BatchingSession is used.
What about batching parameters.. I didn't understand how to use platform_config_file/batching_parameters_file. That's why I've added several cmd parameters:
tensorflow::Flag("max_batch_size", &max_batch_size, "max batch size"),
tensorflow::Flag("num_batch_threads", &num_batch_threads, "num of batch threads"),
tensorflow::Flag("batch_timeout_micros", &batch_timeout_micros, "batch timeout micros"),
After that I used them in the following way:
if (platform_config_file.empty()) {
SessionBundleConfig session_bundle_config;
// Batching config
if (enable_batching) {
BatchingParameters* batching_parameters =
session_bundle_config.mutable_batching_parameters();
if (batching_parameters_file.empty()) {
batching_parameters->mutable_thread_pool_name()->set_value(
"model_server_batch_threads");
{
auto* param = new google::protobuf::Int64Value();
param->set_value(num_batch_threads);
batching_parameters->set_allocated_num_batch_threads(param);
}
{
auto* param = new google::protobuf::Int64Value();
param->set_value(max_batch_size);
batching_parameters->set_allocated_max_batch_size(param);
}
{
auto* param = new google::protobuf::Int64Value();
param->set_value(batch_timeout_micros);
batching_parameters->set_allocated_batch_timeout_micros(param);
}
/*
{
auto* param = new google::protobuf::Int64Value();
param->set_value(max_enqueued_batches);
batching_parameters->set_allocated_max_enqueued_batches(param);
}
*/
{
auto* param = new google::protobuf::Int64Value();
param->set_value(batch_timeout_micros);
batching_parameters->set_allocated_batch_timeout_micros(param);
}
std::cout << "--------------------------\n";
std::cout << "num_batch_threads: " << batching_parameters->num_batch_threads().value() << std::endl;
std::cout << "max_batch_size: " << batching_parameters->max_batch_size().value() << std::endl;
std::cout << "max_enqueued_batches: " << batching_parameters->max_enqueued_batches().value() << std::endl;
std::cout << "batch_timeout_micros: " << batching_parameters->batch_timeout_micros().value() << std::endl;
std::cout << "--------------------------\n";
} else {
*batching_parameters =
ReadProtoFromFile<BatchingParameters>(batching_parameters_file);
}
} else if (!batching_parameters_file.empty()) {
CHECK(false) // Crash ok
<< "You supplied --batching_parameters_file without "
"--enable_batching";
}
session_bundle_config.mutable_session_config()
->set_intra_op_parallelism_threads(tensorflow_session_parallelism);
session_bundle_config.mutable_session_config()
->set_inter_op_parallelism_threads(tensorflow_session_parallelism);
options.platform_config_map = CreateTensorFlowPlatformConfigMap(
session_bundle_config, use_saved_model);
} else {
options.platform_config_map = ParsePlatformConfigMap(platform_config_file);
}
Example of cmd arguments:
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=my-model-name --model_base_path=model_path --enable_batching --max_batch_size=128 --num_batch_threads=8 --batch_timeout_micros=2
This is an ouput:
num_batch_threads: 8
max_batch_size: 128
max_enqueued_batches: 0
batch_timeout_micros: 2
--------------------------
I tried to tune batch_timeout_micros/num_batch_threads/max_batch_size in several ways, but it didn't help.
Thanks for the detailed explanation. Some thoughts:
The simplest way to configure batching is with the (new) --batching_parameters_file flag. Just create a file containing an ascii representation of a BatchingParameters proto, and pass the file path as the flag value.
Since your concern is the performance of the tf-serving c++ case versus python, I would suggest trying to set up an apples-to-apples comparison. Assuming the python case is not doing any batching, then how about starting with batching disabled in the c++ case?
Batching parameter tuning can be tricky. See tensorflow_serving/batching/README.md for some ideas. Unfortunately there is no universal best approach because it depends on your model, system environment, and latency requirements. One thing I did notice is that your batch_timeout_micros is very small. It's possible that you are getting batches of size 1, which would mean you are paying for the overhead of batching but getting none of the benefit. If your model is very cheap, then it's possible that the overhead of batching costs more than running the model itself.
-Chris
Thank you very much, Chris!
Could you show me example of file with batching_paramers proto please? I didn't find any examples in repo.
I also had a problem with developing/patching of tensorflow related stuff. Is it possible to use qtcreator or another IDE with bazel-based tensorflow serving stuff? At the moment I make patches with notepad with no debugger support..
Since your concern is the performance of the tf-serving c++ case versus python, I would suggest trying to set up an apples-to-apples comparison. Assuming the python case is not doing any batching, then how about starting with batching disabled in the c++ case?
No I used python with batching. So my performance comparison is correct to my mind.
Best,
Mikhail
Here's an example of the contents of a batching parameters file:
max_batch_size { value: 200 }
batch_timeout_micros { value: 10000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 1 }
For your IDE question I don't know the answer. Perhaps you can open a separate issue for that, and maybe somebody from the community would be familiar with qtcreator.
Thank you so much!
Best regards,
Mikhail
Hi @myurushkin, it looks like the performance and batching config related questions are resolved. Closing the issue but please feel free to reopen if there is anything unresolved. Thanks!
Hi ,
Any one idea How export trained tensorflow model? How generate signature? I have already trained model.
How I export it for serving tensorflow and how write client (RPC) ?
Export:
`def save_model(pred_data, exp_pred, session, version=1):
export_path = os.path.join(EXPORT_DIR, tf.compat.as_bytes(str(version)))
save_builder = builder.SavedModelBuilder(export_path)
inputs = {'source': utils.build_tensor_info(pred_data[0]),
'target_indices': utils.build_tensor_info(pred_data[1]),
'target_values': utils.build_tensor_info(pred_data[2]),
'target_shape': utils.build_tensor_info(pred_data[3]),
'source_len': utils.build_tensor_info(pred_data[4])
}
outputs = {'decoded': utils.build_tensor_info(exp_pred[0]),
'error_rate': utils.build_tensor_info(exp_pred[1])
}
prediction_signature = signature_def_utils.build_signature_def(inputs=inputs, outputs=outputs, method_name=signature_constants.PREDICT_METHOD_NAME)
signature_def_map = {'predict_speech': prediction_signature, tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: prediction_signature}
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
tags = [tag_constants.SERVING]
save_builder.add_meta_graph_and_variables(sess=session, tags=tags, signature_def_map=signature_def_map, legacy_init_op=legacy_init_op)
return save_builder.save()
Client:
import tensorflow as tf
from grpc.beta import implementations
import serve.predict_pb2 as predict_pb2
import serve.prediction_service as service
host, port = '192.168.8.38', 8500
channel = implementations.insecure_channel(host, int(port))
stub = service.beta_create_PredictionService_stub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'DLSTM_L4_W256_R1'
request.model_spec.signature_name = 'predict_speech'
num_test_batches = test[1]
for ti in range(num_test_batches):
test_data = next(test[2])
request.inputs['source'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[0].value))
request.inputs['target_indices'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[1].value))
request.inputs['target_values'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[2].value))
request.inputs['target_shape'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[3].value))
request.inputs['source_len'].CopyFrom(tf.contrib.util.make_tensor_proto(test_data[4].value))
result = stub.Predict(request, 50.0)`
@myurushkin Are you able to fix your performance issues by configuring batch_parameters?
Hi @karthikvadla,
Actually, I managed to tune my c++ service, but due to the several reasons (support difficulties of C++ code and gprc latency overhead) I decided to reimplement whole service to golang (no tensorflow serving).
I using Batching to predict but receive error:
"Batching session Run() input tensors must have equal 0th-dimension size"
Most helpful comment
Here's an example of the contents of a batching parameters file:
max_batch_size { value: 200 }
batch_timeout_micros { value: 10000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 1 }
For your IDE question I don't know the answer. Perhaps you can open a separate issue for that, and maybe somebody from the community would be familiar with qtcreator.