Serving: Is it normal that it costs around 15 seconds when I am using the exported inception v3 model in tensorflow serving to predict an image from imagenet?

Created on 10 Apr 2017  路  19Comments  路  Source: tensorflow/serving

I have exported inception v3 model in .pb+variables format using the inception saved model python script in the example folder and deployed it in my tensorflow serving machine using bazel. But when I use the inception client in the example folder to send an image from imagenet (like a rose image) to serving for prediction, it costs 15 seconds for end to end time.
Also, I have added some code to monitor only the one sentence result = stub.Predict(request, request_timeout) and it turns out that it costs more than 14 seconds. I have another Alexnet model and it costs about 2.5 seconds just to predict one image. Is this normal or there is some wrong with the sample code or model, or the prediction time is proportional to the depth of the model, like deep models would cost way much time than the not that deep models when predicting only one image? Would it cost less time on average if I predict a batch of images at one time in some way? By the way, my machine is a normal virtual machine, Linux+Intel machine 4 core cpu and 8GB memory.
Another question is, are there any best practices or performance tuning methods for tensorflow serving? As this latency (I am not sure if the prediction time counts the latency) is not that acceptable. Or is it possible that the prediction time would decrease like 5-10 times if I use a more powerful machine, like 16 cores and 32 GB memory, or even more that 10 times with GPU's acceleration?
Thanks a lot for answering those many questions! If the questions are not that proper here, I could also post them to StackOverflow, but I am just not sure whether someone would answer them to the point.

performance

Most helpful comment

I also test Inception V3 and got the same problem.

I am using the instruction:

bazel build --config=mkl --config=opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma tensorflow_serving/...

to compile the serving code as default.

Here are some performance numbers I got on CPU (Intel Core i7-5930K CPU @ 3.50GHz). Batch size = 1.

| | Time (per inference) | Slowdown |
| --------------------------------------------------- |:---------------------------:| :---------------:|
| Tensorflow 1.0 | 0.062s | 1x |
| Tensorflow Serving (with SSE/AVX) | 1.815s | 29.2x |
| Tensorflow Serving (without SSE/AVX) | 3.596s | 57.9x |

I am running the server and client on the same machine and, therefore, the communication cost is negligible.

I use perf to profile the execution. Tensorflow and Tensorflow Serving have very similar results (Eigen is running for most of the time), which makes it difficult to find the bottleneck.

Any idea of how this slow down happens? Thanks.

All 19 comments

I have similar issue.

I am using one of pretrained models from this repo https://github.com/davidsandberg/facenet for face embedding. Said model is converted into two formats: one is just frozen graph .pb, the other one is also frozen graph, but saved in tf-serving format .pb + (empty) variables.

It takes 0.9 seconds to embed 10 images (1 by 1) if I run the model inside a python process using plain tensorflow lib
It takes around 26 seconds to embed 10 images (1 by 1) if I run the model inside tensorflow serving, and query it over GRPC from python process. Using batching helps but not much, cutting the time to around 22 seconds.

Anyway tf-serving seems to be 20 times slower than just using tf lib on my laptop.
The specs if they matter are, Core i7 4th gen, 8GB ram, no GPU. Tensorflow is compiled without any optimizations for tensorflow serving, for python it's just installed via pip but it appears to be also without any optimizations. Tensorflow serving runs inside docker image (Dockerfile.devel from docs)

Any ideas what might be the problem here?

Would it be possible to use something like pprof to see where the performance snag is ?

You could try compiling in optimized mode bazel build -c opt ...

I have a similar issue. I serve Faster RCNN in TF serving (https://github.com/endernewton/tf-faster-rcnn). When running it on the same docker container in python it takes on my laptop (no GPU) about 8-10 sec per image. When serving it in TF serving, the exact same model in the same docker container takes around 6 minutes; which is an increase of more than 30 times! Batching is not enabled and I basically did the same steps as in the Mnist tutorial. Does anyone have an idea what the issue could be?

I fixed my problem. I always got the following warnings:

2017-05-04 14:22:00.608491: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-04 14:22:00.608523: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-04 14:22:00.608529: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-04 14:22:00.608533: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-04 14:22:00.608538: W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 

Initially I did not worry about them, since I got them also in python and there it was running much faster. Though since I could not find anything else I started addressing them. And by compiling AVX2, FMA etc I got it running fast, now it runs in 4-5 seconds on CPU in TF serving vs 8-10 seconds in python (and 6 minutes in TF serving without optimization)!

So here how to compile TF serving with the optimizations mentioned above:

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.2    //tensorflow_serving/model_servers:tensorflow_model_server

Hope it helps any of you with your performance issues.

Source: http://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructionsx

@markusnagel I think 4-6 seconds is still quite a long time. I was able to predict an image for a freezed resnetv2 model in about 1 second (excl. initializing/loading the model) in python. With serving it takes me about 15seconds. I'm compiling with optimizations now and will post an update.

@markusnagel, hi! Please, help to understand how have you serve Faster RCNN in TF serving??
I can't find how to avoid using PyFunc or serve them.
https://github.com/endernewton/tf-faster-rcnn/issues/113#issuecomment-306112530

@vaklyuenkov see my explanation in that issue. Next time please tag me there, than we don't spam this issue with non issue related things. Thanks.

I've faced with the same problem: python implementation about 10 times faster then tensorflow_model_server.

I want to emphasize the importance of building in optimized mode (bazel build -c opt ...), as Vinu pointed out above. That can make a huge difference for the performance of c++ binaries.

For the next posts on this thread, please specify whether you used -c opt (and if not, retry with -c opt before posting :). Let's rule out that basic problem before going deeper.

I've used the following options:

--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.2

I also test Inception V3 and got the same problem.

I am using the instruction:

bazel build --config=mkl --config=opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma tensorflow_serving/...

to compile the serving code as default.

Here are some performance numbers I got on CPU (Intel Core i7-5930K CPU @ 3.50GHz). Batch size = 1.

| | Time (per inference) | Slowdown |
| --------------------------------------------------- |:---------------------------:| :---------------:|
| Tensorflow 1.0 | 0.062s | 1x |
| Tensorflow Serving (with SSE/AVX) | 1.815s | 29.2x |
| Tensorflow Serving (without SSE/AVX) | 3.596s | 57.9x |

I am running the server and client on the same machine and, therefore, the communication cost is negligible.

I use perf to profile the execution. Tensorflow and Tensorflow Serving have very similar results (Eigen is running for most of the time), which makes it difficult to find the bottleneck.

Any idea of how this slow down happens? Thanks.

Just in case people find typing out the full list of compiler flags annoying/potentially fragile, I just discovered that using --copt=-march=native is valid and will include all applicable compiler flags when building:

bazel build -c opt --copt=-march=native tensorflow_serving/...

However, I think MKL isn't being used since the performance is still worse than running the same SavedModel with Tensorflow in Python. Is there any way to get MKL working?

Problem solved. Please compile the serving code with

bazel build --config=mkl --config=opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-O3 tensorflow_serving/...

Another option --copt=-O3 is added. It works for me that all the performance numbers become reasonable.

Hi @jiecaoyu

What --copt=-O3 option do? Why it was omitted?

Hi @mahnunchik

I think it forces the C/C++ compiler to optimize the code very heavily for performance.

I am not sure about the reason why it was omitted by default. I am writing gemm libraries for my own usage and -O3 is always enabled.

In my case, it seems that preparing tensor proto (using tf.contrib.util.make_tensor_proto()) in serving client as request input takes too much time. It costs 1.8s in this step in my code using cpu device only. Maybe this post could help by replacing this function.

@CrowbarKZ You have metioned you were able to serve frozen graph (.pb file) without variables like (empty variables) in tf-serving. May I ask you how you were able to do it?

@atulpant I don't remember 100% precisely, but I think I just followed official docs on how to export model for tensorflow serving. (https://www.tensorflow.org/serving/serving_basic#train_and_export_tensorflow_model)

Was this page helpful?
0 / 5 - 0 ratings