Serving: Batching parameters

Created on 6 Mar 2017  路  16Comments  路  Source: tensorflow/serving

What is the best way to use batching parameters like max_batch_size, batch_timeout_micros, num_batch_threads and other parameters? Tried using them while running the query client.

In the below example I have 100 images and I want to batch in size of 10. The query runs for all images instead of 10.
bazel-bin/tensorflow_serving/example/demo_batch --server=localhost:9000 --max_batch_size = 10

Also, for batch scheduling how to make it run every 10 secs after the first batch is done? Appreciate some ideas..

gpu utilization contributions welcome feature performance

Most helpful comment

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

All 16 comments

Check out batching/README.md.

Thanks @chrisolston . I went thru the README.md for BatchingSession and BasicBatchScedhuler. Most of the parameters needed are in these 2 files.

However, based on query client (inception_client.py), it doesn't look like it calls basic_batch_scheduler and/or BasicBatchScheduler. So, the question is how are these session and scheduling parameters passed to the query client? Looks like I am missing something...

With the ModelServer binary the batching takes place in the server not the client. I just checked and unfortunately the batching parameters are currently hard-coded to the defaults. It would be easy to extend model_servers/main.cc to be able to accept the parameters via a textual proto file (analogous to model_config_file) containing a BatchingParameters proto, if you're up for making a (simple) code contribution :)

As a work-around, you can hard-code some specific values into the BatchingParameters proto in main.cc.

On second thought, since it's only a few lines of code I will add the flag. I'm working on the change internally at Google, and if things go smoothly it will propagate to open-source in a week or so.

@chrisolston Thanks for the clarification. Will try the workaround for now.

As of now, tensorflow_model_server seems to support batching_parameter_file as an argument. Can someone point me to a template or specification for this file? I searched around and could not come across anything in a timely manner

As the flag documentation states, it's an ascii protobuf for the BatchingParameters proto (defined in session_bundle_config.proto). You can find information about the ascii protobuf format elsewhere. It basically looks like:
message {
field1: value1
field2: value2
}

What is the default setting for batching parameters when enable_batching=true? I looked at the following config file, played with the parameter settings and ran the model server against each of the settings, but the results do not match up with what I get when I do not supply any file name to batching_parameters_file (but have enable_batching=true)..
serving/tensorflow_serving/servables/tensorflow/testdata/batching_config.txt

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

Hi @sreddybr3,
I'm trying to use batching to speed up inference. In my setting, tensorflow is not built in optimized mode but it should be ok for batching. In my test case, the input shape is [32, 112, 112, 3], so in batching.conf I set max_batch_size to 32. This will cost the same time to finish the test, say, 500 requests. While if I increase the max_batch_size, the performance is even worse. I even tweak the value of num_batch_threads which seems not helping too much. Do you have any thoughts? Thanks!

This is a great discussion and points to a need for documentation of batching for modelserver binary. I have opened #1379 to add docs - if anyone has any thoughts or suggestions please comment on that issue :)

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

I am trying to use a batching config file using --batching_parameters_file. Where this file has to be placed in order to be used?

I am running TFS (custom build for GPU) with standard InceptionV3 model.
Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256
Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.
Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

I am trying to use a batching config file using --batching_parameters_file. Where this file has to be placed in order to be used?

just give the absolute path of the file. In the case of docker, you will have to mount your local folder or have the file in the container itself.

Can we reload the batch config after the server starts ? I see model config has such function.

Same problem . I run tf-serving like this :

sudo docker run -p 8501:8501 -d --name="tf_serving" \
--mount type=bind,source=/mnt1/zhaodachuan/tf_model/push/lr,target=/models/push_lr \
-v /mnt1/zhaodachuan/tf-serving/config_file/batch_size.config:/models/config/batch_size.config \
-e MODEL_NAME=push_lr -t tensorflow/serving --enable_batching=true \
--batching_parameters_file=/models/config/batch_size.config

And my batch_size.config is :

max_batch_size { value: 1000000 }
batch_timeout_micros { value: 0 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 8 }

It works as slow as I don't use --enable_batching=true

Was this page helpful?
0 / 5 - 0 ratings