Serving: Batching parameters

Created on 6 Mar 2017 · 16Comments · Source: tensorflow/serving

What is the best way to use batching parameters like max_batch_size, batch_timeout_micros, num_batch_threads and other parameters? Tried using them while running the query client.

In the below example I have 100 images and I want to batch in size of 10. The query runs for all images instead of 10.
bazel-bin/tensorflow_serving/example/demo_batch --server=localhost:9000 --max_batch_size = 10

Also, for batch scheduling how to make it run every 10 secs after the first batch is done? Appreciate some ideas..

gpu utilization contributions welcome feature performance

Source

sskgit

Most helpful comment

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0

build tensorflow_model_server

bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server

run server using batching config file

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

sreddybr3 on 26 Nov 2017

👍14 ❤2

All 16 comments

Check out batching/README.md.

chrisolston on 6 Mar 2017

Thanks @chrisolston . I went thru the README.md for BatchingSession and BasicBatchScedhuler. Most of the parameters needed are in these 2 files.

However, based on query client (inception_client.py), it doesn't look like it calls basic_batch_scheduler and/or BasicBatchScheduler. So, the question is how are these session and scheduling parameters passed to the query client? Looks like I am missing something...

sskgit on 7 Mar 2017

With the ModelServer binary the batching takes place in the server not the client. I just checked and unfortunately the batching parameters are currently hard-coded to the defaults. It would be easy to extend model_servers/main.cc to be able to accept the parameters via a textual proto file (analogous to model_config_file) containing a BatchingParameters proto, if you're up for making a (simple) code contribution :)

As a work-around, you can hard-code some specific values into the BatchingParameters proto in main.cc.

chrisolston on 7 Mar 2017

On second thought, since it's only a few lines of code I will add the flag. I'm working on the change internally at Google, and if things go smoothly it will propagate to open-source in a week or so.

chrisolston on 7 Mar 2017

@chrisolston Thanks for the clarification. Will try the workaround for now.

sskgit on 7 Mar 2017

As of now, tensorflow_model_server seems to support batching_parameter_file as an argument. Can someone point me to a template or specification for this file? I searched around and could not come across anything in a timely manner

abuvaneswari on 8 Sep 2017

As the flag documentation states, it's an ascii protobuf for the BatchingParameters proto (defined in session_bundle_config.proto). You can find information about the ascii protobuf format elsewhere. It basically looks like:
message {
field1: value1
field2: value2
}

chrisolston on 8 Sep 2017

What is the default setting for batching parameters when enable_batching=true? I looked at the following config file, played with the parameter settings and ran the model server against each of the settings, but the results do not match up with what I get when I do not supply any file name to batching_parameters_file (but have enable_batching=true)..
serving/tensorflow_serving/servables/tensorflow/testdata/batching_config.txt

abuvaneswari on 12 Sep 2017

The default values are in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/batching/basic_batch_scheduler.h

chrisolston on 12 Sep 2017

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0

build tensorflow_model_server

bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server

run server using batching config file

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

sreddybr3 on 26 Nov 2017

👍14 ❤2

Hi @sreddybr3,
I'm trying to use batching to speed up inference. In my setting, tensorflow is not built in optimized mode but it should be ok for batching. In my test case, the input shape is [32, 112, 112, 3], so in batching.conf I set max_batch_size to 32. This will cost the same time to finish the test, say, 500 requests. While if I increase the max_batch_size, the performance is even worse. I even tweak the value of num_batch_threads which seems not helping too much. Do you have any thoughts? Thanks!

pharrellyhy on 24 Jul 2018

😕1 👍1

This is a great discussion and points to a need for documentation of batching for modelserver binary. I have opened #1379 to add docs - if anyone has any thoughts or suggestions please comment on that issue :)

misterpeddy on 14 Jun 2019

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:
export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf
batching.conf file contents
max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }
I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1
will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

I am trying to use a batching config file using --batching_parameters_file. Where this file has to be placed in order to be used?

TheR3d1 on 5 Jul 2019

I am running TFS (custom build for GPU) with standard InceptionV3 model.
Amazon EC2 P2.xlarge instance:
export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf
batching.conf file contents
max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }
I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256
Tensorflow Serving initialisation logs shows that GPU is visible and utilised.
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1
will show how GPU is being utilised when performance script is run.
Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?
I am trying to use a batching config file using --batching_parameters_file. Where this file has to be placed in order to be used?

just give the absolute path of the file. In the case of docker, you will have to mount your local folder or have the file in the container itself.

aaur0 on 18 Aug 2019

Can we reload the batch config after the server starts ? I see model config has such function.

mikezhang95 on 20 May 2020

👍2

Same problem . I run tf-serving like this :

sudo docker run -p 8501:8501 -d --name="tf_serving" \
--mount type=bind,source=/mnt1/zhaodachuan/tf_model/push/lr,target=/models/push_lr \
-v /mnt1/zhaodachuan/tf-serving/config_file/batch_size.config:/models/config/batch_size.config \
-e MODEL_NAME=push_lr -t tensorflow/serving --enable_batching=true \
--batching_parameters_file=/models/config/batch_size.config

And my batch_size.config is :

max_batch_size { value: 1000000 }
batch_timeout_micros { value: 0 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 8 }

It works as slow as I don't use --enable_batching=true

DachuanZhao on 27 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

TensorFlow Serving example clients error, Python 2.7

rsk-07 · 3Comments

Serving "metadata" - empty input signature

marcoadurno · 3Comments

Op type not registered 'ClipByValue' in binary running on 229d61c80ffd

cchung100m · 4Comments

tensorflow_model_server ignores visible gpu device

vikeshkhanna · 3Comments

inception-client error with tensorflow-serving-apis, but works well with bazel built server

TonyChouZJU · 4Comments