serving server does not infer batch

Created on 19 Jul 2019  路  15Comments  路  Source: tensorflow/serving

OS: Ubuntu 18.04
docker image: tensorflow/serving:latest-gpu
Tensorflow: 2.0.0a0

I'm using tesorflow/serving:latest-gpu, but I think it does not use gpu.
I request input with shape [48, 256, 256, 3] and it takes about 12 seconds.
But when I request with shape [1, 256, 256, 3] it takes about 0.3 seconds.

this is request code

# grids shape: [4, 48, 256, 256, 3]
grids, positions, inds = infer_preprocess(img, mask, FLAGS.n_grids)

headers = {'content-type': 'application/json'}
predictions = []

for grid in grids:
    # A grid shape: [48, 256, 256, 3]
    grid = grid/127.5 - 1
    data = json.dumps({
        'signature_name': 'serving_default', 'instances': grid.tolist()
    })

     json_response = requests.post('http://10.113.66.143:30256/v1/models/new_test:predict', data=data, headers=headers)

    prediction = json.loads(json_response.text)
    print('DONE')
    try:
        prediction = np.array(prediction['predictions'])
        predictions.append(prediction)
    except:
        print(prediction['error'])

and batching_parameters.conf

num_batch_threads { value: 48 }
batch_timeout_micros { value: 5000}
max_batch_size {value: 20000001}

I ran server.sh

sudo nvidia-docker run -t --rm -p 8501:8501 -v ~/models:/root/models --name serve tensorflow/serving:latest-gpu --enable_batching=true --batching_parameters_file=/root/models/batching_parameters.txt --model_config_file=/root/models/model_specific.conf

I guess serving server infers input one by one.
How can server infers a batch?

awaiting response performance

Most helpful comment

There's a performance guide I'm merging very soon that should help with this.

All 15 comments

@jusonn ,
Can you please refer this link and confirm if this helps. Thanks.

@rmothukuru
thanks. But I think it's too low level for me.

My problem is serving server does not infer batch data like in training step, ex) feeding 64 batch size data into a model.

I have the same problem as you. Can you solve it?

What @rmothukuru posted is the library-level batching guide - if that's too low-level, please take a look at the config-level batching guide.

However, please do keep in mind, TF serving matching is inter-request level; meaning, it creates a latency-aware queue to batch together requests that are arriving independently before running them through the graph. This is useful if you have many independent clients calling TF serving that cannot coordinate with one another.

If you're sending multiple requests from the same client, then it makes little sense to configure batching on TF Serving. What you instead want is to stack the different examples together along the zeroth dimension and send all examples with a single request. This is what it seems you have done - stacking 48 examples together and sending them with a single call to TF Serving, at which point nothing about the batching configuration is relevant (you only have a single request) the entire thing gets fed into session.run at once and at that point it's executing your graph on your hardware as it would if you did session.run() in python (i.e. it's out of serving's domain).

If you're observing the latency climb linearly with the number of examples you're batching, that's a sign that there exist some portion of your graph execution that's not vectorized - it could be i/o, could be pre/post-processing or other portions that are forced on the cpu. If you'd like us to debug and help, please provide your model and example requests and we'll take a look.

@unclepeddy thanks for precise information.
https://gist.github.com/jusonn/1e77e70d238b645ab4aa73351370a5fb
Here are my model and example request.
If more information is needed, please tell me.

@aaroey I've done some basic testing and was able to reproduce this behavior - what would be the best way of profiling the GPU to understand why batching does not help the inference time?

@unclepeddy I'm not sure how request batching works in TF serving. About why batching doesn't help the inference time, I think that:

  • When batching is enabled, TF serving is still using GPU for inference. As a quick verification, if you run with a model that has a MaxPoolingOp with NCHW format (which is only supported on GPU), it should work (I verified that).
  • It's possible that the overhead of the REST layer hides the latency improvement. I did some similar test with official resnet model using batch size 128, with batching config:
    num_batch_threads { value: 16 } batch_timeout_micros { value: 5000 } max_batch_size { value: 128 }
    processing one request using REST API took 5s, while using grpc took 0.57s (~10x faster).
  • With the same official resnet model and batching config, processing 128 request with batch size 1 took 2.24s, so there is a speedup using batching

@jusonn @unclepeddy could you try with grpc API and see how it works?

@aaroey
Should I optimize the graph additionally, such as NCHW format?
If so, is there any guide that helps me how to optimize model for serving?

If you're using official resnet model I think only NCHW should be accepted (@aaroey please confirm).
As suggested above, can you try running it with grpc to validate the REST isn't the bottleneck?

@unclepeddy I think both can be run on GPU, but on non-Volta architectures or non-FP16 use case it'll always convert NHWC to NCHW, adding extra cost. However, before running the graph, the graph optimizer (grappler) should have already done the conversion on the graph level, to minimize the overall cost. Also see https://github.com/tensorflow/tensorflow/issues/8286.

I tried grpc api with my model with input size from 1x256x256x3 to 48x256x256x3.
I measure the grpc request time for each input size.

Request time increases when input_tensors become large.
However, I am not sure whether it is because the serving server does not infer batch or because large input tensor size makes grpc transfer slower.

0.09607696533203125  input_tensor_size: 1x256x256x3
0.006090879440307617
0.003573894500732422
0.005861043930053711
0.007306098937988281
0.013091087341308594
0.01860499382019043
0.016251325607299805
0.026106834411621094
0.0329890251159668
0.030529022216796875
0.035444021224975586
0.047998905181884766
0.05263996124267578
0.042489051818847656
0.054933786392211914
0.04903006553649902
0.053598880767822266
0.05814504623413086
0.056781768798828125
0.05947089195251465
0.05758523941040039
0.0656280517578125
0.07093977928161621
0.07089090347290039
0.0668189525604248
0.07346105575561523
0.07367134094238281
0.08284163475036621
0.0810396671295166
0.08105087280273438
0.0848538875579834
0.09189009666442871
0.09484314918518066
0.09187889099121094
0.09624099731445312
0.1020050048828125
0.09921789169311523
0.10622882843017578
0.10701394081115723
0.10987281799316406
0.10901403427124023
0.10963582992553711
0.11941289901733398
0.1192789077758789
0.11253595352172852
0.12174415588378906
0.12098407745361328 input_tensor_size: 48x256x256x3

There's a performance guide I'm merging very soon that should help with this.

Thanks, I just came up with a question.
Does TF serving server designed for online inference? Which infer right away when an input it comes, not making it mini-batch.

Performance guide [1] and TensorBoard inference instructions [2] are now available.

To answer your question, yes it's designed for online inference. The inter-request batching feature is not turned on by default [3].

[1] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/performance.md
[2] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/tensorboard.md
[3] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/serving_config.md#batching-configuration

Nice guide!
I didn't know inference profiling exists and using prometheus with tf serving.
My questions are solved now so I'm going to close this issue.
Thanks for the documentation.

Was this page helpful?
0 / 5 - 0 ratings