Serving: Serving large models

Created on 23 Jul 2020  路  10Comments  路  Source: tensorflow/serving

System information

Tensorflow - 2.3.0-rc1
CUDA-10
TensorRT-6

Describe the problem

I am trying to convert a GPT2 model, the saved model size is about 1.9GB. It causes an issue when I try to use TF serving for deployment as it hits a protobuf limit of 1 GB. I have tried to not build TRT engines before deployement too, but it did not affect the size of the saved_model.pb. Is there a reason for a limit on the protobuf? if not is there a way to increase the size?

Additional context

To speed up serving, I am forced to use a TF-TRT saved model.

awaiting response bug

Most helpful comment

System information

Tensorflow - 2.3.0-rc1
CUDA-10
TensorRT-6
TF Serving : from docker image tensorflow/serving:2.2.0-gpu

Describe the problem

I am converting a GPT2 model using TF-TRT for an optimized inference, the saved model size is about 1.9GB after conversion by TensorRT. It causes an issue when I try to use TF serving for deployment as it hits a protobuf limit of 1 GB. I have tried to not build TRT engines before deployement too, but it did not affect the size of the saved_model.pb. Is there a reason for a limit on the protobuf? if not is there a way to increase the size?
Here is the link to my models.
https://drive.google.com/drive/folders/1EAXCqySLBqLMek7iBHis7LamUn8KRyFo?usp=sharing

Source code / logs

This is the code to convert any saved model using TF-TRT and it outputs a TF saved model with TRT engines (will build it the first time it is called and will be cached). I have verified that my converted model does work and for larger batches I see 20% gain in speeds. I want to serve the same model using TFServing.

import tensorflow as tf
import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
def trt_convert(saved_model_dir, output_dir, precision='FP32'):
    """To convert TF saved model to TF-TRT saved model. Only runs on Nvidia GPUs.

    Args:
        saved_model_dir (str): directory to tf saved model
        output_dir (str): output directory to save TF-TRT model
        precision (str, optional): To se the precision of conveted model. Defaults to 'FP32'.
    """

    params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(maximum_cached_engines=10000,
                                                        precision_mode=precision,
                                                        use_calibration=precision == 'INT8',
                                                        max_workspace_size_bytes=(1 << 32))
    converter = trt.TrtGraphConverterV2(input_saved_model_dir=saved_model_dir, conversion_params=params)
    converter.convert()
    converter.save(output_dir)
    return

model_dir = 'sample_tf'
output_dir = 'sample_tf/converted'
trt_convert(model_dir, output_dir, precision='FP16') # converts the model
#Load the model and test
saved_model_loaded = tf.saved_model.load(output_dir)
graph_func = saved_model_loaded.signatures['serving_default']
inputs = tf.random.uniform((4, 20), maxval=1000, dtype=tf.int64)
graph_func(context=inputs,
                   n_samples=tf.consant(1),
                   next_n=tf.constant(15),
                   temperature=tf.constant(1.0))

So I test my converted model, it runs without any issues. But, when I try to deploy it (TF Serving), I get this error.
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/io/coded_stream.cc:192] A protocol message was rejected because it was too big (more than 1073741824 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in net/proto2/io/public/coded_stream.h. 2020-07-17 14:45:45.677211: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: fail: Data loss: Can't parse /mnt/model/1/saved_model.pb as binary proto. Took 929926 microseconds.

Other attempts

I tried exporting a smaller model <1GB and I was able to serve it.

All 10 comments

@bharatv007,
Can you please elaborate on what issue you are facing and In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thanks!

System information

Tensorflow - 2.3.0-rc1
CUDA-10
TensorRT-6
TF Serving : from docker image tensorflow/serving:2.2.0-gpu

Describe the problem

I am converting a GPT2 model using TF-TRT for an optimized inference, the saved model size is about 1.9GB after conversion by TensorRT. It causes an issue when I try to use TF serving for deployment as it hits a protobuf limit of 1 GB. I have tried to not build TRT engines before deployement too, but it did not affect the size of the saved_model.pb. Is there a reason for a limit on the protobuf? if not is there a way to increase the size?
Here is the link to my models.
https://drive.google.com/drive/folders/1EAXCqySLBqLMek7iBHis7LamUn8KRyFo?usp=sharing

Source code / logs

This is the code to convert any saved model using TF-TRT and it outputs a TF saved model with TRT engines (will build it the first time it is called and will be cached). I have verified that my converted model does work and for larger batches I see 20% gain in speeds. I want to serve the same model using TFServing.

import tensorflow as tf
import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
def trt_convert(saved_model_dir, output_dir, precision='FP32'):
    """To convert TF saved model to TF-TRT saved model. Only runs on Nvidia GPUs.

    Args:
        saved_model_dir (str): directory to tf saved model
        output_dir (str): output directory to save TF-TRT model
        precision (str, optional): To se the precision of conveted model. Defaults to 'FP32'.
    """

    params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(maximum_cached_engines=10000,
                                                        precision_mode=precision,
                                                        use_calibration=precision == 'INT8',
                                                        max_workspace_size_bytes=(1 << 32))
    converter = trt.TrtGraphConverterV2(input_saved_model_dir=saved_model_dir, conversion_params=params)
    converter.convert()
    converter.save(output_dir)
    return

model_dir = 'sample_tf'
output_dir = 'sample_tf/converted'
trt_convert(model_dir, output_dir, precision='FP16') # converts the model
#Load the model and test
saved_model_loaded = tf.saved_model.load(output_dir)
graph_func = saved_model_loaded.signatures['serving_default']
inputs = tf.random.uniform((4, 20), maxval=1000, dtype=tf.int64)
graph_func(context=inputs,
                   n_samples=tf.consant(1),
                   next_n=tf.constant(15),
                   temperature=tf.constant(1.0))

So I test my converted model, it runs without any issues. But, when I try to deploy it (TF Serving), I get this error.
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/io/coded_stream.cc:192] A protocol message was rejected because it was too big (more than 1073741824 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in net/proto2/io/public/coded_stream.h. 2020-07-17 14:45:45.677211: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: fail: Data loss: Can't parse /mnt/model/1/saved_model.pb as binary proto. Took 929926 microseconds.

Other attempts

I tried exporting a smaller model <1GB and I was able to serve it.

@bharatv007 This is similar to the issue #1686 Can we please close this issue here and track it in a single place. Let me know if you think otherwise. Thanks!

A mentioned in another thread, the limitation comes from Tensorflow, which comes from proto buf. The limitation is on the .pb file. If this is caused by large constant, you could use tf.Variable instead and load the value via an assign. It will avoid including it in the .pb file.

I noticed that when I save the model using tensorflow saved model, it creates a saved_model.pb, variables and assets folder. The saved_model.pb is 688kb, variables folder size matches the model size and assets is empty. When I convert it to TensorRT, it creates the same three, this time with TRT engines in assets folder but saved_model.pb is 1.9GB. I assume TensorRT saves the model as constants, not variables in the .pb file.

This seems to be caused by the TFR converter to store the variables in the nodedef. It would be good to get answer why converter do so and if it can be changed.

@bharatv007,
Can you please respond to @shadowdragon89's comment above. Thanks!

@shadowdragon89 I do not understand what you mean, can you please elaborate on that?

can you please test this with 2.4.0-rc3 gpu docker images? upcoming serving 2.4.0 should have fix from tensorflow/tensorflow@dc3099c

Thanks, I will try that and get back.

Was this page helpful?
0 / 5 - 0 ratings