Tensorrt: How to get INT8 calibration cache format in TensorRT?

Created on 18 Jun 2020 · 10Comments · Source: NVIDIA/TensorRT

Hi,

I am working on converting floating point deep model to an int8 model using TensorRT. Instead of generating cache file using TensorRT, I would like to generate my own cache file to TensorRT's use for calibration. However the open-sourced codebase of TensorRT does not provide much detail about the calibration cache file format. Can anyone give some hints?

Thanks

INT8 question

Source

liming312

❤1 👍1

Most helpful comment

Hi @liming312 ,

I'm not sure how deep the open source portion goes, but your best bet is probably to look around https://github.com/onnx/onnx-tensorrt/blob/master/builtin_op_importers.cpp for the ops you're interested in and see if you can track down what it maps to when converting the ops you're interested in.

rmccorm4 on 23 Jun 2020

👍2

All 10 comments

Hi @liming312,

I believe the only unintuitive part of the Calibration Cache file is the hex values next to each layer. These are explained in more detail in this thread: https://forums.developer.nvidia.com/t/calibrationtable-and-executable-engine/109352/6

Also, if you've already computed the range of activations for each layer, you can manually set the ranges yourself with the TensorRT API, and let TensorRT create the calibration cache from your values: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#set_tensor_mp_c

rmccorm4 on 23 Jun 2020

Hi @rmccorm4 ,

Thank you so much for the links. I have successfully done this part. One more question to bother. I used Resnet-18 in PyTorch. In the calibration cache file there are three layers named

(Unnamed Layer* 48) [Constant]_output
(Unnamed Layer* 50) [Shuffle]_output
(Unnamed Layer* 51) [Scale]_output
I have no idea of the implementation details about TRT. I guess the first is related to average pooling and the second is related to resize but not sure about that. Can you give some hints about what they layers are?

The context of the three layers are listed as below. The network structure is shown in the attached image. Some more info from onnx model viewer: The outputs of the last ReLU, GlobalAveragePool, Flatten, MatMul and Add are named 86, 87, 88, 90, 92 respectively.

Thanks again.

"86": {
"scale": 0.38845354318618774,
"min": 0,
"max": 0,
"offset": 0
},
"87": {
"scale": 0.38845354318618774,
"min": 0,
"max": 0,
"offset": 0
},
"88": {
"scale": 0.38845354318618774,
"min": 0,
"max": 0,
"offset": 0
},
"(Unnamed Layer* 48) [Constant]_output": {
"scale": 0.005822753068059683,
"min": 0,
"max": 0,
"offset": 0
},
"90": {
"scale": 0.24475298821926117,
"min": 0,
"max": 0,
"offset": 0
},
"(Unnamed Layer* 50) [Shuffle]_output": {
"scale": 0.24128367006778717,
"min": 0,
"max": 0,
"offset": 0
},
"(Unnamed Layer* 51) [Scale]_output": {
"scale": 0.24169020354747772,
"min": 0,
"max": 0,
"offset": 0
},
"92": {
"scale": 0.24475298821926117,
"min": 0,
"max": 0,
"offset": 0
}
}

liming312 on 23 Jun 2020

👍1

Hi @liming312 ,

Sorry I'm not too sure. ~I don't think there will generally be a 1:1 mapping between original model layers and TRT engine layers, due to the optimizations and layer fusions that take place under the hood when building the engine. You'll probably have to experiment with it.~

As you said, it may be up to implementation details in the TensorRT ONNX parser that certain layers aren't translated 1:1 and may be combined, or broken out into a combination of several smaller ops.

rmccorm4 on 23 Jun 2020

👍1

@rmccorm4

Thank you very much! Just to confirm, NVidia does not release the code for this part?

liming312 on 23 Jun 2020

👍1

Hi @liming312 ,

rmccorm4 on 23 Jun 2020

👍2

Hi @rmccorm4,

Sorry to bother you. I'm trying to create an INT8 inference engine.
After some research, my understanding is that,
I have to prepare the trained image and label file to get a cache file for creating INT8 engine in TRT,
or getting the tensor's min max value and handle this part manually by coding.

Here is the code I've done so far for engine conversion(the function ONNX2TRT()),
type FP16 works fine, but type INT8 seems to get the same result with FP16's engine.
(although the engine file is a bit smaller)
The inference time, result are all the same, and it seem like lacking of the two elements mentioned above is the reason.

Since I'm on a Windows computer, and I can't successfully install tensorrt with python version,
I'm neither familiar with Docker, so I want to use the second method to complete the INT8 engine conversion.

So far, I've use netron to see each nodes of my model, and get the model tensors' min max value through the info it shows.
I see this part of instruction in the document,

I'm confused about how to deal with this part, because I only have the min/max value of convolution layers of my model,
the input/output layer doesn't show any info of tensor's value(through netron).

And I saw MNIST example, the INT8 engine conversion only set like this

config->setFlag(BuilderFlag::kINT8);

but adding the part handle INT8's dynamic shape range while constructing network(?),
and only with input and output layer(??)

So I don't know how to arrange this part of code correctly...

Should I use the right way to get input\output tensor value range or what to complete the conversion?

cocoyen1995 on 11 Sep 2020

Hello @cocoyen1995 , thanks for your question.

so I want to use the second method to complete the INT8 engine conversion.

FYI, you can also use calibration cache in C++, please check the sample code in
https://github.com/NVIDIA/TensorRT/blob/master/samples/common/EntropyCalibrator.h#L66
https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/sampleINT8/sampleINT8.cpp#L238

the input/output layer doesn't show any info of tensor's value(through netron).

The tensor information is not part of the model, only the weights and constant values are in the model, and can be visible in netron. If you want to use your own calibration algorithm, you have to inference with the calibration data in your favorite framework, and mark all the intermediate layers as output, so that you can collect data distribution for all the tensors, then decide the proper min/max using your own algorithm.

And I saw MNIST example, the INT8 engine conversion only set like this

this is not a good example show how to use int8 calibration, this "fake" scale works because the task on mnist is not challenge. In complex case like BERT, we have to generate int8 range from algorithm. https://github.com/NVIDIA/TensorRT/blob/release/7.1/demo/BERT/builder.py#L585

For more detail about calibration, please check http://arxiv.org/abs/2004.09602

Thanks.

ttyio on 30 Sep 2020

Close since no activity in 3 weeks, please reopen if you have more questions, thanks!

ttyio on 26 Oct 2020

Sorry I didn't notice there's a new reply...
For those who might looking for more details,
I've finished the calibration process successfully with the reply provided here
Still thanks for your reply!

cocoyen1995 on 26 Oct 2020

Sorry I didn't notice there's a new reply...
For those who might looking for more details,
I've finished the calibration process successfully with the reply provided here
Still thanks for your reply!

thank you too :-)

ttyio on 26 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings