I am serving one LSTM model and while I am sending 20 concurrent requests to TF-serving, somehow the latency for each of each is not persistent.
I am serving TEXT-MODEL (LSTM)
This is my tf-serving start command:
export CUDA_VISIBLE_DEVICES=0 && export LD_LIBRARY_PATH="/usr/local/nccl2/lib" && /home/serving/serving-bin/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=8004 --model_config_file=/home/serving/serving-bin/serving_models/config_files/test.conf --enable_batching=true --enable_model_warmup=true --per_process_gpu_memory_fraction=0.9 --grpc_channel_arguments=[grpc.max_concurrent_streams=300]
I need the latency of per model inference 100req/sec but that seems quite impossible using tf-serving, is there any flag which I am missing?
What I am trying to say is: If I send 100 streaming requests concurrently for inference to single model running of TF-serving, starting 15 give output in 300-400ms and then the latency of other 80-85 requests increases (i.e: the 100th request is returning prediction after 3sec) basically non-persistent latency for concurrent request.
I am asking that if there's a flag which maintains the persistence of the latency of individual request that all 100 request return output in 300-400 ms.
And as per necessity, I can't use batch processing of the request, which I know is faster, but I cant.
Any help is appreciated, thanks.
@ewilderj @kchodorow @lamberta any help please?
@gr8Adakron Can you please provide some reproducible script for us to reproduce this issue. Thanks!
I am talking about any model with persistent latency if you want I am sharing my client script to call the model:
class Serving():
def __init__(self):
"""
:param host: localhost/127.0.0.1
:param port: 8006
"""
super(Serving, self).__init__()
self.host = "127.0.0.1"
self.port = "8004"
def predictResponse_into_nparray(self,response,output_tensor_name):
dims = response.outputs[output_tensor_name].tensor_shape.dim
shape = tuple(d.size for d in dims)
return np.reshape(response.outputs[output_tensor_name].float_val, shape)
def create_connection(self,port):
self.hostport = f"{self.host}:{port}"
#..> new api-version
# channel = grpc.insecure_channel(self.hostport)
# stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
#..> old api-version
channel = implementations.insecure_channel(self.host, int(port))
stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
return stub
def serving_model_prediction(self,batch_of_input):
port_number = "8004"
model_name = "bert"
stub = self.create_connection(port_number)
#print('\t - Enter serving_model_prediction')
request = predict_pb2.PredictRequest()
request.model_spec.name = model_name
request.model_spec.signature_name = tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
inputs = ["label_ids", "input_ids", "input_mask", "segment_ids"]
output_tensor = "output"
for single_input in inputs:
request.inputs[single_input].CopyFrom(
tf.contrib.util.make_tensor_proto(batch_of_input[single_input], dtype='int64'))
result_future = stub.Predict.future(request, 10)
result = result_future.result()
# result = stub.Predict(request, 30)
#print(result)
output = self.predictResponse_into_nparray(result,output_tensor)
#print(f'\t - Exiting serving_model_prediction {output}')
del request
return output
If I am sending 500 concurrent inference request the all the 500 requests should return prediction in 500ms or whatever the best througput of the model-inference is; that way.
I guess, TF-Serving is a well-developed server mechanism so this function ougth to be existing already, if I am not missing anything.
hope it helps!
@gowthamkpr
Unfortunately, variance in latency, especially during concurrent request execution is expected.
You can try using the tensorflow_model_server APT package, or building your own with the right optimization settings as per the instructions here.
Another thing you can try are the flags tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism. Set these ideally to the number of cpu cores in your machine. For example:
$ tensorflow_model_server --port=8500 --rest_api_port=8501 \
--tensorflow_intra_op_parallelism=4 \
--tensorflow_inter_op_parallelism=4 \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME}
Most helpful comment
Unfortunately, variance in latency, especially during concurrent request execution is expected.
You can try using the
tensorflow_model_serverAPT package, or building your own with the right optimization settings as per the instructions here.Another thing you can try are the flags
tensorflow_intra_op_parallelismandtensorflow_inter_op_parallelism. Set these ideally to the number of cpu cores in your machine. For example: