Tensorrt: How to deploy an ONNX model with int8 calibration?

Created on 18 May 2020  ·  16Comments  ·  Source: NVIDIA/TensorRT

Hello, I'm trying to do int8 calibration on an ONNX model with C++ API. I see there are samples of INT8 with caffemodel and ONNX MNIST. But how to quantize an ONNX model? Is there any samples or guidance to follow? Thank you.

INT8 QAT

Most helpful comment

Hi @cocoyen1995 ,

First you need to implement Class int8EntroyCalibrator like in this file
tensorRT.txt

Then in the step of convert onnx model to TRT engine, you need to declare an instance of int8EntroyCalibrator like
calibrator = new int8EntroyCalibrator(maxBatchSize, calibration_images, calibration_table_save_path);

Then pass calibrator to config->setInt8Calibrator(calibrator);

config is declared by auto config = SampleUniquePtr(builder->createBuilderConfig());

Remember you have to do exactly the same image preprocess when calibration and inference. You can refer to function prepareImage in the file I uploaded.

For more details, you can refer to TensorRT's official INT8 example code.

Hope this helps. Feel free if you wanna speak Chinese cuz my English is not that good and may make you feel confused lol

All 16 comments

Hi,

For INT8 calibration, you'll need to provide your own calibration data and implement an Int8 Calibrator. There's a decent example of some of those things here: https://github.com/rmccorm4/tensorrt-utils/tree/master/classification/imagenet

You can also try quantization-aware training (QAT) when training in the original framework, such as TF, and export this to ONNX with a tool like tf2onnx. I believe there is some support for these FakeQuant* nodes in both tf2onnx and TensorRT: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qat-tf

@rmccorm4 Hi, thank you for your reply. It helps and I'm doing good with my work now. The int8 version of PSENet is only 30ms faster than FP32 version on V100, which is slower than I expected. I wish you could give more detailed instructions on int8 calibration. The BatchStream class you offer only supports images with .batch or .ppm format, which is definitely not user-friendly. Thank you for your help again : )

The link I referenced is actually expects .jpeg images to use out of the box, which is the imagenet dataset's format: https://www.github.com/rmccorm4/tensorrt-utils/tree/master/classification%2Fimagenet%2FImagenetCalibrator.py

But all it does is read the images into numpy arrays, and normalize them. You can similarly represent any kind of data as numpy arrays/matrices/etc., you'll just have to tweak the code a little bit.

@rmccorm4 Yeaaah, but I'm working with C++ API : ) What I‘m trying to say is the develop guide and samples didn't cover certain cases. For example, I'm trying to doing int8 calibration on an ONNX model with C++ API. I can't figure out how to input .jpg image stream, and whether I should build int8 engine in onnx2TRTmodel() or loadTRTmodel() to read calibrationTable by your Document. There are many things we need to figure out ourselves. It would be better if there were more detailed instructions, but you guys still do a great job. Thank you!

Hi @le8888e ,

Since calibration is typically done offline, my personal recommendation is that using Python will be faster and easier, as there are many tools and libraries to load and normalize data (numpy, tensorflow, pytorch, pycuda, etc.).

You can calibrate using the python API, save the calibration cache to a file, and then load the calibration cache later in C++ if you wish. You can also do the calibration with the C++ API, but I just think it's a bit more complicated to handle the data, and typically requires setting up OpenCV and other libraries for the average use case.

I don't have much experience with the C++ API outside of inference (load engine, create context, infer inputs). Everything before the inference stage (parse model, create network, set builder flags, build engine, save engine to file, etc.) can typically done offline and therefore with the Python API (or even trtexec) and then serialize your engine to a file to load at runtime with C++ API.

Hi @rmccorm4 ,

So in the phase of building an engine, we do
config->setFlag(BuilderFlag::kINT8);
and
buildEngineWithConfig(*network, *config);
then save engine to file.

In the phase of inference, we invoke the engine on local disk. If I follow these steps above, will the inference run in int8 mode? Cuz in my experiment int8 runs slightly faster than fp32. I'm wondering if some steps are missing.
Here is my main code, would you mind check it out? : )
tensorRT.txt
And in which phase does TensorRT invoke calibrationTable on local disk? When create a calibrator, readCalibrationCache() will be called. But in the phase of inference, create engine from file will not create a calibrator, so is the calibration data saved in engine file?

Thank you.

I have a question,

To run the AlexNet network on DLA using trtexec in INT8 mode, issue:
./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --int8 --allowGPUFallback
the official example doing, can it be directly converted to int8,
then add command --saveEngine=AlexNet.trt ,it means AlexNet.trt Already is a quantified model?

Why do I need my own data set and calibration table, or the official command --int8 is only official test

Hi,

For INT8 calibration, you'll need to provide your own calibration data and implement an Int8 Calibrator. There's a decent example of some of those things here: https://github.com/rmccorm4/tensorrt-utils/tree/master/classification/imagenet

You can also try quantization-aware training (QAT) when training in the original framework, such as TF, and export this to ONNX with a tool like tf2onnx. I believe there is some support for these FakeQuant* nodes in both tf2onnx and TensorRT: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qat-tf

Hi @rmccorm4 , I've generated an offline caliabration table with your python scripts, and here comes a question, how can I load the calibration table with C++ API without implementing an Int8 Calibrator since there exists a cache? My code has to run on different platforms, so I cannot just export offline engines with trtexec. Looking forward to your reply, thank you.

Hi @llk2why ,

(For others) If your use case can generate engines offline, you can just read in the calibration cache using trtexec:

trtexec --fp16 --int8 --calib=<calibration_cache_file> --onnx=model.onnx

My code has to run on different platforms, so I cannot just export offline engines with trtexec

You can implement a very simple/minimal calibrator, where I believe the only methods you actually need to implement are readCalibrationCache and writeCalibrationCache.

For every other method, I believe you can just give a dummy implementation of to be clear that it expects a pre-calibrated cache file:

{
    throw std::runtime_error{"Not Implemented"};
}

Some extra notes:

  1. You may also be able to just use/call the trtexec source code in your application: https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/trtexec/trtexec.cpp
  2. Sample calibrator implementation here: https://github.com/NVIDIA/TensorRT/blob/master/samples/common/EntropyCalibrator.h. The type of calibrator used/implemented here should not matter, as the scales are already fixed in the calibration cache file and are simply read in by your calibrator implementation to set the dynamic ranges of each tensor to (-scale, +scale).
  3. Calibrator implementation used by trtexec here: https://github.com/NVIDIA/TensorRT/blob/master/samples/common/sampleEngines.cpp#L157-L252

You should be able to take out the calibrator parts from some of these links above and use your calibration cache file.

@rmccorm4 Thank you, it seems to make sense, I will give it a try right now.

@rmccorm4 It works, but apart from these:

You can implement a very simple/minimal calibrator, where I believe the only methods you actually need to implement are readCalibrationCache and writeCalibrationCache.

getBatchSize got called as well, so I just implemented with return 1;, I am not sure whether the batch_size has any side effect.

Thanks for the update @llk2why. Going to close this as the description and resolution seemed to work for you.

@rmccorm4 Hi, thank you for your reply. It helps and I'm doing good with my work now. The in8 version of PSENet is only 30ms faster than FP32 version on V100, which is slower than I expected. I wish you could give more detailed instructions on int8 calibration. The BatchStream class you offer only supports images with .batch or .ppm format, which is definitely not user-friendly. Thank you for your help again : )

@rmccorm4 Yeaaah, but I'm working with C++ API : ) What I‘m trying to say is the develop guide and samples didn't cover certain cases. For example, I'm trying to doing int8 calibration on an ONNX model with C++ API. I can't figure out how to input .jpg image stream, and whether I should build int8 engine in onnx2TRTmodel() or loadTRTmodel() to read calibrationTable by your Document. There are many things we need to figure out ourselves. It would be better if there were more detailed instructions, but you guys still do a great job. Thank you!

Hi @le8888e ,

I'm trying to convert an onnx model(UNET in my case) to INT8 engine as you did before with C++.
I've searched for a while but didn't find any example of making calibration file with c++(only found many with python).
Since I'm on a Windows computer, cannot installing tensorrt with python, those of work can't be done as I expected...
I want to know how you make calibration file with reading images and complete the conversion of INT8 engine you've done with C++.
Could you please share how to do so in detail?
Thanks in advance for any help or advice!

Hi @cocoyen1995 ,

First you need to implement Class int8EntroyCalibrator like in this file
tensorRT.txt

Then in the step of convert onnx model to TRT engine, you need to declare an instance of int8EntroyCalibrator like
calibrator = new int8EntroyCalibrator(maxBatchSize, calibration_images, calibration_table_save_path);

Then pass calibrator to config->setInt8Calibrator(calibrator);

config is declared by auto config = SampleUniquePtr(builder->createBuilderConfig());

Remember you have to do exactly the same image preprocess when calibration and inference. You can refer to function prepareImage in the file I uploaded.

For more details, you can refer to TensorRT's official INT8 example code.

Hope this helps. Feel free if you wanna speak Chinese cuz my English is not that good and may make you feel confused lol

Hi, @le8888e ,

Thanks for your quick reply!
I'll try that out and let you know the result ^^
Have a nice weekend!
如果過程中有其他問題我再提出,先說聲謝謝囉!

Hi @le8888e ,

不好意思又打擾了。我剛剛仔細看了下您提供的程式碼,針對prepareImage()這個函式的部分想請問一下,
如果我的模型訓練時是用HWC的順序,截至目前我完成的推論程式也是用HWC,這樣有需要換成CHW嗎?
另外,若原本我的模型吃的輸入是使用1/255.0做歸一化,一樣要改成1/127.5去做嗎?
(還是方便的話能跟您加個微信進一步請教嗎? 我的id是cocococoyenyen,想說這部分的討論好像太細節了不確定在這邊發問適不適合ˊˋ

Was this page helpful?
0 / 5 - 0 ratings

Related issues

WangXuanBT picture WangXuanBT  ·  3Comments

SyGoing picture SyGoing  ·  4Comments

anmol039w picture anmol039w  ·  5Comments

prathik-naidu picture prathik-naidu  ·  3Comments

lapolonio picture lapolonio  ·  5Comments