Serving: TF-serving Model Inference: increase inference concurrency with persistent latency

Created on 3 Oct 2019 · 5Comments · Source: tensorflow/serving

I am serving one LSTM model and while I am sending 20 concurrent requests to TF-serving, somehow the latency for each of each is not persistent.

I am serving TEXT-MODEL (LSTM)

This is my tf-serving start command:

export CUDA_VISIBLE_DEVICES=0 && export LD_LIBRARY_PATH="/usr/local/nccl2/lib" && /home/serving/serving-bin/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=8004 --model_config_file=/home/serving/serving-bin/serving_models/config_files/test.conf --enable_batching=true --enable_model_warmup=true --per_process_gpu_memory_fraction=0.9 --grpc_channel_arguments=[grpc.max_concurrent_streams=300]

I need the latency of per model inference 100req/sec but that seems quite impossible using tf-serving, is there any flag which I am missing?

What I am trying to say is: If I send 100 streaming requests concurrently for inference to single model running of TF-serving, starting 15 give output in 300-400ms and then the latency of other 80-85 requests increases (i.e: the 100th request is returning prediction after 3sec) basically non-persistent latency for concurrent request.
I am asking that if there's a flag which maintains the persistence of the latency of individual request that all 100 request return output in 300-400 ms.

And as per necessity, I can't use batch processing of the request, which I know is faster, but I cant.

Any help is appreciated, thanks.

awaiting response performance

Source

gr8Adakron

Most helpful comment

Unfortunately, variance in latency, especially during concurrent request execution is expected.

You can try using the tensorflow_model_server APT package, or building your own with the right optimization settings as per the instructions here.

Another thing you can try are the flags tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism. Set these ideally to the number of cpu cores in your machine. For example:

$ tensorflow_model_server --port=8500 --rest_api_port=8501 \
--tensorflow_intra_op_parallelism=4 \
--tensorflow_inter_op_parallelism=4 \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME}

vinuraja on 11 Oct 2019

👍2 👀1

All 5 comments

@ewilderj @kchodorow @lamberta any help please?

gr8Adakron on 3 Oct 2019

@gr8Adakron Can you please provide some reproducible script for us to reproduce this issue. Thanks!

gowthamkpr on 3 Oct 2019

I am talking about any model with persistent latency if you want I am sharing my client script to call the model:

class Serving():
    def __init__(self):
        """
        :param host: localhost/127.0.0.1 
        :param port: 8006

        """
        super(Serving, self).__init__()
        self.host    = "127.0.0.1"
        self.port    = "8004"

    def predictResponse_into_nparray(self,response,output_tensor_name):
        dims         = response.outputs[output_tensor_name].tensor_shape.dim
        shape        = tuple(d.size for d in dims)
        return np.reshape(response.outputs[output_tensor_name].float_val, shape)

    def create_connection(self,port):
        self.hostport = f"{self.host}:{port}"
        #..> new api-version
        # channel       = grpc.insecure_channel(self.hostport)
        # stub          = prediction_service_pb2_grpc.PredictionServiceStub(channel)

        #..> old api-version
        channel       = implementations.insecure_channel(self.host, int(port))
        stub          = prediction_service_pb2.beta_create_PredictionService_stub(channel)

        return stub

    def serving_model_prediction(self,batch_of_input):
        port_number                       = "8004"
        model_name                        = "bert"
        stub                              = self.create_connection(port_number)
        #print('\t - Enter serving_model_prediction')
        request                           = predict_pb2.PredictRequest()
        request.model_spec.name           = model_name
        request.model_spec.signature_name = tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
        inputs                            = ["label_ids", "input_ids", "input_mask", "segment_ids"]
        output_tensor                     = "output"
        for single_input in inputs:
            request.inputs[single_input].CopyFrom(
                tf.contrib.util.make_tensor_proto(batch_of_input[single_input], dtype='int64'))
        result_future = stub.Predict.future(request, 10)
        result = result_future.result()
        # result = stub.Predict(request, 30)
        #print(result)
        output                            = self.predictResponse_into_nparray(result,output_tensor)
        #print(f'\t - Exiting serving_model_prediction {output}')
        del request
        return output

If I am sending 500 concurrent inference request the all the 500 requests should return prediction in 500ms or whatever the best througput of the model-inference is; that way.

I guess, TF-Serving is a well-developed server mechanism so this function ougth to be existing already, if I am not missing anything.
hope it helps!

gr8Adakron on 4 Oct 2019

@gowthamkpr