Serving: Serving large models

Created on 23 Jul 2020 · 10Comments · Source: tensorflow/serving

System information

Tensorflow - 2.3.0-rc1
CUDA-10
TensorRT-6

Describe the problem

I am trying to convert a GPT2 model, the saved model size is about 1.9GB. It causes an issue when I try to use TF serving for deployment as it hits a protobuf limit of 1 GB. I have tried to not build TRT engines before deployement too, but it did not affect the size of the saved_model.pb. Is there a reason for a limit on the protobuf? if not is there a way to increase the size?

Additional context

To speed up serving, I am forced to use a TF-TRT saved model.

awaiting response bug

Source

bharatv007

👍1

Most helpful comment

System information

Tensorflow - 2.3.0-rc1
CUDA-10
TensorRT-6
TF Serving : from docker image tensorflow/serving:2.2.0-gpu

Describe the problem

I am converting a GPT2 model using TF-TRT for an optimized inference, the saved model size is about 1.9GB after conversion by TensorRT. It causes an issue when I try to use TF serving for deployment as it hits a protobuf limit of 1 GB. I have tried to not build TRT engines before deployement too, but it did not affect the size of the saved_model.pb. Is there a reason for a limit on the protobuf? if not is there a way to increase the size?
Here is the link to my models.
https://drive.google.com/drive/folders/1EAXCqySLBqLMek7iBHis7LamUn8KRyFo?usp=sharing

Source code / logs

This is the code to convert any saved model using TF-TRT and it outputs a TF saved model with TRT engines (will build it the first time it is called and will be cached). I have verified that my converted model does work and for larger batches I see 20% gain in speeds. I want to serve the same model using TFServing.

import tensorflow as tf
import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
def trt_convert(saved_model_dir, output_dir, precision='FP32'):
    """To convert TF saved model to TF-TRT saved model. Only runs on Nvidia GPUs.

    Args:
        saved_model_dir (str): directory to tf saved model
        output_dir (str): output directory to save TF-TRT model
        precision (str, optional): To se the precision of conveted model. Defaults to 'FP32'.
    """

    params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(maximum_cached_engines=10000,
                                                        precision_mode=precision,
                                                        use_calibration=precision == 'INT8',
                                                        max_workspace_size_bytes=(1 << 32))
    converter = trt.TrtGraphConverterV2(input_saved_model_dir=saved_model_dir, conversion_params=params)
    converter.convert()
    converter.save(output_dir)
    return

model_dir = 'sample_tf'
output_dir = 'sample_tf/converted'
trt_convert(model_dir, output_dir, precision='FP16') # converts the model
#Load the model and test
saved_model_loaded = tf.saved_model.load(output_dir)
graph_func = saved_model_loaded.signatures['serving_default']
inputs = tf.random.uniform((4, 20), maxval=1000, dtype=tf.int64)
graph_func(context=inputs,
                   n_samples=tf.consant(1),
                   next_n=tf.constant(15),
                   temperature=tf.constant(1.0))

So I test my converted model, it runs without any issues. But, when I try to deploy it (TF Serving), I get this error.
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/io/coded_stream.cc:192] A protocol message was rejected because it was too big (more than 1073741824 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in net/proto2/io/public/coded_stream.h. 2020-07-17 14:45:45.677211: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: fail: Data loss: Can't parse /mnt/model/1/saved_model.pb as binary proto. Took 929926 microseconds.

Other attempts

I tried exporting a smaller model <1GB and I was able to serve it.

bharatv007 on 28 Jul 2020

👍3

All 10 comments

@bharatv007,
Can you please elaborate on what issue you are facing and In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thanks!

rmothukuru on 28 Jul 2020

👍1

System information

Tensorflow - 2.3.0-rc1
CUDA-10
TensorRT-6
TF Serving : from docker image tensorflow/serving:2.2.0-gpu

Describe the problem

Source code / logs

import tensorflow as tf
import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
def trt_convert(saved_model_dir, output_dir, precision='FP32'):
    """To convert TF saved model to TF-TRT saved model. Only runs on Nvidia GPUs.

    Args:
        saved_model_dir (str): directory to tf saved model
        output_dir (str): output directory to save TF-TRT model
        precision (str, optional): To se the precision of conveted model. Defaults to 'FP32'.
    """

    params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(maximum_cached_engines=10000,
                                                        precision_mode=precision,
                                                        use_calibration=precision == 'INT8',
                                                        max_workspace_size_bytes=(1 << 32))
    converter = trt.TrtGraphConverterV2(input_saved_model_dir=saved_model_dir, conversion_params=params)
    converter.convert()
    converter.save(output_dir)
    return

model_dir = 'sample_tf'
output_dir = 'sample_tf/converted'
trt_convert(model_dir, output_dir, precision='FP16') # converts the model
#Load the model and test
saved_model_loaded = tf.saved_model.load(output_dir)
graph_func = saved_model_loaded.signatures['serving_default']
inputs = tf.random.uniform((4, 20), maxval=1000, dtype=tf.int64)
graph_func(context=inputs,
                   n_samples=tf.consant(1),
                   next_n=tf.constant(15),
                   temperature=tf.constant(1.0))

Other attempts

I tried exporting a smaller model <1GB and I was able to serve it.

bharatv007 on 28 Jul 2020

👍3

@bharatv007 This is similar to the issue #1686 Can we please close this issue here and track it in a single place. Let me know if you think otherwise. Thanks!

gowthamkpr on 7 Aug 2020

A mentioned in another thread, the limitation comes from Tensorflow, which comes from proto buf. The limitation is on the .pb file. If this is caused by large constant, you could use tf.Variable instead and load the value via an assign. It will avoid including it in the .pb file.

shadowdragon89 on 7 Aug 2020

I noticed that when I save the model using tensorflow saved model, it creates a saved_model.pb, variables and assets folder. The saved_model.pb is 688kb, variables folder size matches the model size and assets is empty. When I convert it to TensorRT, it creates the same three, this time with TRT engines in assets folder but saved_model.pb is 1.9GB. I assume TensorRT saves the model as constants, not variables in the .pb file.

bharatv007 on 10 Aug 2020

This seems to be caused by the TFR converter to store the variables in the nodedef. It would be good to get answer why converter do so and if it can be changed.

shadowdragon89 on 27 Aug 2020

@bharatv007,
Can you please respond to @shadowdragon89's comment above. Thanks!

rmothukuru on 18 Sep 2020

@shadowdragon89 I do not understand what you mean, can you please elaborate on that?

bharatv007 on 18 Sep 2020

can you please test this with 2.4.0-rc3 gpu docker images? upcoming serving 2.4.0 should have fix from tensorflow/tensorflow@dc3099c