Starting from TFLite importer to relay sounds great. cc @jroesch @ajtulloch @yzhliu

tqchen on 31 Dec 2018

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

ZihengJiang on 4 Jan 2019

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

Thanks for reminding. However, I don't fully understand your reminder. Do you mean I should be careful quantize or multiply / add ops? If we import existing quantized model like TFLite, we shouldn't see quantize ops any more.

FrozenGene on 4 Jan 2019

Hi, I recently wrote some code to read in the tflite quantized examples and translate them to nnef output. Their operations are pretty similar to nnvm ops. I translated the two mobilenets and the four inception models. There's a cmake config that pulls down all the models and converts them. Please feel free to use whatever you want from it. I forked the NNEF Tools project, https://github.com/jnorwood and put the converter under the contrib/converters/tflite_converters/tflite_to_nnef

I only added processing for the ops I needed, and I only did quantized data. tflite uses uint8 quantization, btw, with offsets for both weights and features. Biases are int32. NNEF passes quantization configuration in a separate file from the graph. Also, note that tflite uses nhwc everywhere.

jnorwood on 19 Mar 2019

@FrozenGene I am interested in contributing to this Issue. Is it possible to share the progress?

anijain2305 on 15 May 2019

Hey, @anijain2305 Thanks for your interest. Currently, I am doing https://github.com/dmlc/tvm/pull/3141. After that, I will start it. BTW, our internal support is based on NNVM and we have completed support it, we have the same result compared with TFLite and have better performance than TFLite. However, I have to spare some time translating to Relay when to make PR. But I have to say that I am busy this month in our product development and it will go to open source progress in my company. I will @ you when that PR is ready.

FrozenGene on 15 May 2019

Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

Discussion

Other non-TVM related links that were used to understand quantization

GemmLowP - Doc
TFlite reference code

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)

Op quantize

def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss

The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.

Op quantized_conv2d

def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """

    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.

    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss further

This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.
- First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
- Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.
The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).

Op dequantize

Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.

def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """

anijain2305 on 29 May 2019

@anijain2305

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously. see TFLite's CalculateActivationRangeUint8 function.

From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example, which is used very widely and have common ops we should consider. For example, depthwise convolution / add / pool and so on.

FrozenGene on 29 May 2019

From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example,

Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.

Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.

Also, the MobilenetV2 q_add inputs require rescale... but in both q_concat and q_add you can recalculate the prior op downscale multipliers so you can eliminate the extra rescales.

Also, depending on your allocation capabilities, you can get rid of all concats.

jnorwood on 29 May 2019

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

jackwish on 29 May 2019

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.

I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

anijain2305 on 29 May 2019

Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.

Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.

Make sense. For now, I was thinking of not worrying about depth-wise conv. So, decided to take Inception V3 into account. I think given we are in the starting position, I don't have any big inclination towards any network. My motive is to focus on getting the right infrastructure in the start and showcase it with one large network. The performance micro-optimizations can then phased.

anijain2305 on 29 May 2019

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.

anijain2305 on 29 May 2019

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

FrozenGene on 29 May 2019

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?

Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.

The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.

anijain2305 on 29 May 2019

For the q_conv2d, we will add two more arguments.
  output_min=0, 
  output_max=0
These will be used for restrict the output range, which could be calculated previously.
I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.
No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.
In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?

Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.

The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.

Yes, I agree when we don't have activation, we don't need anything. However, Another thing we should consider: How to integrate with other libraries, such as QNNPACK. QNNPACK also need output min / output max too. https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63

FrozenGene on 29 May 2019

Here are some points to discuss:

namespace for the tflite quantize style dialect
List of ops that might need tvm's compute declaration
set of possible passes that lower the rest into the core ops

Some of the discussions involve fusion, and that is something where TVM might be able to help. For example, in the current symmetric scheme, clip, relu6, and subsequent downcasting ops are automatically fused into the conv2d ops. While the conv2d op can simply just output int32(because followup ops will get fused).

I agree with @anijain2305 that we could try to get something minimum that is working, then start thinking about possible rewriting rules to get to some useful patterns if we decide that manual intervention is necessary.

Ideally, we should have a generic schedule template that works for any fused patterns, just as those in the current symmetric version, so we do not need to have all the different variants of fused conv2d ops

also cc @vinx13 @ZihengJiang

tqchen on 30 May 2019

I want to point out that the min and max values you mentioned are not related to the activation range in the original model. They are saturation values. In the case of mobilenet, for example, which has relu_6 use everywhere, I'm printing out the min and max activation values from the tflite mobilenet V2 below. The model uses uint8 downscale between layers, and uses the min and max value to clamp/saturate the values to 0..255 for all layers in that model. The thing it could be used for (but isn't here) is for more or fewer quantization bits or for signed int quantization ... but tflite is using all uint8 quantization for MobilenetV2.

the amin and amax values below are tflite output_activation_min, output_activation_max from their quantized reference ops for conv and dw_conv.

(base) jay@jay-desktop:~/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log
`
(base) jay@jay-desktop:~/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log
---------conv in_h=224, in_w=224,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1992157658,shft=-7,amin=0, amax=255
-------dwconv in_h=112, in_w=112,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1254985768,shft=-1,amin=0, amax=255
---------conv in_h=112, in_w=112,out_h=112,out_w=112,f_h=1,f_w=1,mpy=2090511665,shft=-5,amin=0, amax=255
-------dwconv in_h=112, in_w=112,out_h=56,out_w=56,f_h=3,f_w=3,mpy=1729896231,shft=-1,amin=0, amax=255
---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=2081950125,shft=-6,amin=0, amax=255
-------dwconv in_h=56, in_w=56,out_h=56,out_w=56,f_h=3,f_w=3,mpy=2080045879,shft=-4,amin=0, amax=255
---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=1890535782,shft=-6,amin=0, amax=255
-------dwconv in_h=56, in_w=56,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1151606277,shft=-5,amin=0, amax=255
---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=2089579858,shft=-7,amin=0, amax=255
-------dwconv in_h=28, in_w=28,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1410648286,shft=-4,amin=0, amax=255
---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=1767908551,shft=-7,amin=0, amax=255
-------dwconv in_h=28, in_w=28,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1850037283,shft=-6,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1260482936,shft=-6,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1269068532,shft=-4,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1456865727,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1464063813,shft=-4,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1364297475,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1948805937,shft=-5,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=2136047634,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1671906928,shft=-5,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1327474777,shft=-6,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1330877207,shft=-5,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1497258311,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1076915935,shft=-6,amin=0, amax=255
---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1124144746,shft=-6,amin=0, amax=255
-------dwconv in_h=7, in_w=7,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1083785823,shft=-2,amin=0, amax=255
---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1240259613,shft=-5,amin=0, amax=255
---------conv in_h=1, in_w=1,out_h=1,out_w=1,f_h=1,f_w=1,mpy=1553319078,shft=-10,amin=0, amax=255

`

jnorwood on 30 May 2019

similarly, for the tflite quantized inception v3 model, all those output_activation_min, output_activation_max are 0 and 255
I'll attach a zip file with the log.
inv3.zip

jnorwood on 30 May 2019

to explain a little further ... during training they determine the range of input values, and they determine the downscale multiplier that will shrink the observed range to 0..255 (for the uint8 quantization). The fp downscale multiplier is coverted to integer mpy and right-shift constants, which are the mpy and shft values in my log. At inference time, the downscaled accumulator (after applying the downscale) may be outside the uint8 quantization range, and so they clamp/saturate to that range. In these current models, they are using uint8 quantization ... so the range is 0..255, but it appears to me they are providing the min and max to support other numbers of bits in the quantization. I see support for several 4 bit gpu implementations recently, so maybe this is to support something like that.

jnorwood on 30 May 2019

Some comments for @anijain2305 's reply :)

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.

A network uses operators (or layers or anything we'd like to call it) regardless of the accumulation format. The format is part of a software system's mechanism. So, I guess we don't need a accumulator_dtype and the out_dtype is what we want. The discussion is about whether we put requantization inside the conv2d op.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.

I was saying extending existing tensor rather than introduce new tensor type. I assume that this won't lead to new Relay opt :)

EDIT: Btw, the channel-wise quantization parameter is likely to be included in TensorFlow/TFLite, also the TVM stack as a roadmap. In this way, it could be easier to manage a tensor described parameter.

jackwish on 30 May 2019

👍1

Regarding @jnorwood 's comments on output min/max of conv2d.

Your observations about the values of output min max are correct. But they are still activations. One thing I always try to deliver is that, the INT8 values in quantization are a representation of original FP32 values.

When we talking about ReLU6 activations, it means that in FP32 format, the op outputs FP32 values in range [0, 6]. For INT8 quantization, INT8 data is an representation of FP32 value, which means, the output min/max (which is typically [0, 255] of INT8 type in pre-provided quantized MobileNet) are representing [0, 6] of FP32 type - the INT8 0/255 is actually FP32 0/6. Try the output scale (0.023528477177023888) with the activation min/max, we will get value range like [0, 5.999761581420898] (from output of the first conv of the pre-provided quantized MobileNet).

Conclusions can easily draw once we have this in mind :)

jackwish on 30 May 2019

I would suggest to design the infrastructure that supports both symmetric/asymmetric quantization. We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

namespace for the tflite quantize style dialect

I think this is required for both asymmetric and symmetric quantization. These ops will be rewritten to low-level instructions by a Relay pass. How about using relay.op._quantization as the namespace? So, the operations can be relay.op._quantization.conv2d or relay.op._quantization.quantize.

List of ops that might need tvm's compute declaration

I am not sure yet. The only unknown to me are the special rounding operations that are used in converting the Floating point to Integer multiplication in scaling the quantized conv matrix. But, they might already be covered in current low-level ops.

set of possible passes that lower the rest into the core ops

I was hoping to re-use the FForwardRewrite infrastructure to lower the ops. Do you anticipate more passes here?

anijain2305 on 31 May 2019

We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.

jnorwood on 31 May 2019

We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.

TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...

jackwish on 31 May 2019

This is most probably out of the context of the issue, but is it possible for all of the people commenting here to join a conference call for an hour and figure out the next steps? I can take notes and document them here for everybody else to see. I think it will be more productive.

anijain2305 on 31 May 2019

re "conference calls". I totally agree that calling or in person sync will speed up reaching consensus. Doing most of the development in the public archivable process is preferred https://docs.tvm.ai/contribute/committer_guide.html#public-archive-principle

We do need to acknowledge the overhead of the asynchronous communication, but should also acknowledge the gains we get by leaving a trace for the broader community. I would encourage us to try to rely more on asynchronous communication in public channels first. The main bottleneck of asynchronous discussion is the overhead of latency and a good way to improve it is to

Here is a possible proposal:

Everyone who are primarily driving this process, send out a proposal
- List of points to be discussed.
- List questions
- List pros and cons of decisions if there is a decision being made.
Every one can critique

We could also use the slack for semi-sync chats, but please note that everything relates to design decision need to be properly sent back to the public channel. I understand that there is more overhead in this approach, but I believe it is a price worth paying to get more people involved.

tqchen on 1 Jun 2019

👍2

TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...

You might also consider symmetric signed int8 for weights, and unsigned uint8 for for source and destination, since uint8 will give an extra bit of precision following activations. Intel appears to preferentially support this form in their examples, and their new DLBoost avx512 vector instructions also appear to preferentially support this form.

https://intel.github.io/mkl-dnn/ex_int8_simplenet.html

https://www.intel.ai/nervana/wp-content/uploads/sites/53/2018/05/Lower-Numerical-Precision-Deep-Learning-Inference-Training.pdf

These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in 𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑢8) format, the other in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑠8) format with the accumulation in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡32(𝑠32) format.

jnorwood on 1 Jun 2019

TensorFlow quantization-aware training supports both asymmetric/symmetric. We are seeing asymmetric models because it is the default. If we'd like to start from symmetric approach, set the symmetric and go on. Which, requires extra effort I think...

You might also consider symmetric signed int8 for weights, and unsigned uint8 for for source and destination, since uint8 will give an extra bit of precision following activations. Intel appears to preferentially support this form in their examples, and their new DLBoost avx512 vector instructions also appear to preferentially support this form.

https://intel.github.io/mkl-dnn/ex_int8_simplenet.html

https://www.intel.ai/nervana/wp-content/uploads/sites/53/2018/05/Lower-Numerical-Precision-Deep-Learning-Inference-Training.pdf

These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in 𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑢8) format, the other in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡8(𝑠8) format with the accumulation in 𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑡32(𝑠32) format.

I am sorry, but I fail to get the reasoning between your comment uint8 will give an extra bit of precision following activations, and the material you listed. Would you please make it a bit more clear? AFAIK, uint8 and int8 has same value capacity, so there could be no extra precision.

jackwish on 2 Jun 2019

@jackwish
If relu activations are used, there is no need to use half of the representation space for negative values; thus the extra bit of precision.

eqy on 2 Jun 2019

This makes sense.

Best Regards
Zhenhua

eqy notifications@github.com 于2019年6月2日周日上午9:29写道：

@jackwish https://github.com/jackwish
If relu activations are used, there is no need to use half of the
representation space for negative values; thus the extra bit of precision.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/tvm/issues/2351?email_source=notifications&email_token=ABFVHDPXH4PBFFVSL2MXIQTPYMO55A5CNFSM4GMOMOS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWXLWTQ#issuecomment-497990478,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABFVHDM763W7HEQXQ55I6PLPYMO55ANCNFSM4GMOMOSQ
.

jackwish on 2 Jun 2019

Ok, lets try to finalize the high-level design points. Lets first discuss the

Namespace for the tflite quantize style dialect

Requirements

This should support both symmetric and asymmetric.
These ops should never go through codegen. They will be lowered to low-level Relay ops (like existing conv, round etc) using FForwardRewrite or similar kind of Relay infrastructure.

Proposal

How about using relay.op._quantization as the namespace? So, the operations can be relay.op._quantization.conv2d or relay.op._quantization.quantize.

Pros

Separation of concerns - Restricts the number of ops for which TVM compute has to be written.
Good readability/debugging - Framework parsing will be easier compared to directly lowering to low-level Relay ops. Also, one can look at the quantized annotation ops and understand the quantization flow.

Cons

Getting the best performance might require some new Relay passes. It might require working on a peephole optimizer or some complicated fusion. (Symmetric quantization might already work very well with existing Relay infrastructure. Asymmetric most probably will need more efforts.)

Let me know your thoughts on this. As we achieve consensus, I can start prototyping these operators with stubbing implementation.

anijain2305 on 4 Jun 2019

@FrozenGene @jackwish can you also try to send a proposal as well? it would be great to have a global picture of what is in everyone's mind

tqchen on 5 Jun 2019

@tqchen We are very busy at our one internal project this period. I will talk with @jackwish next Monday. However, sending the proposal maybe should wait us finishing this project. Sorry for that.

FrozenGene on 7 Jun 2019

After NCHW support was removed from tflite.py 3 weeks ago #3141 all TFLite models can not be compiled for ARM cpu and Mali GPU.

apivovarov on 11 Jun 2019

@tqchen @FrozenGene @jackwish

I have added a prototype patch. I think it will be helpful to use that patch to drive the discussion further.

anijain2305 on 14 Jun 2019

@anijain2305 see the code quickly and I know your thought (combine operator to complete q_conv2d). However as commented before, how do we integrate with qnnpack when we don't have output_min / output_max? I think we could have these two arguments, if mxnet don't have, we could leave them the default values.

FrozenGene on 14 Jun 2019

@FrozenGene Thanks for replying. I might be wrong, but I don't think it is a good design to take one codegen backend like QNNPACK and make changes all the way into Relay APIs to make the connection. In my opinion, APIs must be minimal.

But, your point of using QNNPACK is completely valid. I have been thinking about that myself, dreading the painful experience of write tensorized kernel for Intel x86, and hoping to somehow use OpenVINO/MKLDNN. But, similarly, I don't think adding MKLDNN/OpenVINO arguments in the Relay API will be right choice either there.

One way to handle this is to separate out the Relay operators API that we are discussing and the infrastructure to use external codegen like QNNPACK. I think it is entirely possible to write Relay passes for each codegen backend and then rewrite/fuse the Relay ops in a manner that the codegen backend can understand. In this case, we do not creep in the backend specific idiosyncracies into the Relay op API, while also having a well-defined infrastructure that shows how to add external codegens.

anijain2305 on 15 Jun 2019

@anijain2305 I understand your thought. I agree we should make the api minimal. However, no matter what way, q_conv2d’s int32 output should be clamped into uint8 range. If you don’t pass min / max, you also need do output = std::max(output, 0) and output = std::min(output, 255) then return output. So why not we set the default the value output_min = 0 / output_max = 255, and make the computation be output = std::max(output, output_min) and output= std::min(output, output_max) which will be suitable for tflite / mxnet / qnnpack and so on... API design is very important, we should consider as far as we could(tflite / mxnet , even other library we should also consider, qnnpack is a very high performance library on arm cpu, we can not avoid discussing it in my opinion), otherwise we have to do tricky workaround in the future when we do something. This is my point I wish to express before.

FrozenGene on 15 Jun 2019

@FrozenGene a clarifying question to your above comment. If we pass in the output scale and shift can we not compute int32-> int8 by simply adding more nodes in the graph.

shoubhik on 15 Jun 2019

@FrozenGene For the output_min and max, isn't the out_dtype enough? If its uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I don't see any reason the values will be any different, unless you want to fuse the quantized relu in the quantized convolution from the starting itself. Please let me know if I am understanding something wrong. I think we should not fuse operators in the frontend and let Relay graph fusion take care of that.

Let's see what others think about this. @tqchen @yzhliu @ZihengJiang What are your thoughts on this?

anijain2305 on 15 Jun 2019

The tflite quantized convolution reference implementation passes in both limits as int32 values. It appears to me this would let them simulate smaller than 8 bit quantizations, if that is something you want to support.

this is from tensorflow/lite/kernels/internal/reference/conv.h

`
acc = MultiplyByQuantizedMultiplier(acc, output_multiplier,
output_shift);

      acc += output_offset;

      acc = std::max(acc, output_activation_min);

      acc = std::min(acc, output_activation_max);

`

jnorwood on 15 Jun 2019

It appears to me this would let them simulate smaller than 8 bit quantizations.

If simulating 8 smaller bit is the case, 8 bit should be able to hold activation min/max value.

jackwish on 16 Jun 2019

@FrozenGene a clarifying question to your above comment. If we pass in the output scale and shift can we not compute int32-> int8 by simply adding more nodes in the graph.

doesn't understand your comment fully. do you mean could we avoid int32 -> int8 computation? If so, I think we can not. We need requant operation (int32 -> int8) (https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/internal/reference/conv.h#L171)

FrozenGene on 16 Jun 2019

It appears to me this would let them simulate smaller than 8 bit quantizations.

If _simulating 8 smaller bit_ is the case, 8 bit should be able to hold activation min/max value.

8 bits could hold. But what the value output_min / output_max is ? I think @jnorwood want to express this point. Because we can not just simply use out_dtype to decide what the value range is. But if we insert clip op in frontend, I think it also could handle. Need some logic to calculate the min / max. see my next comment.

FrozenGene on 16 Jun 2019

@FrozenGene For the output_min and max, isn't the out_dtype enough? If its uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I don't see any reason the values will be any different, unless you want to fuse the quantized relu in the quantized convolution from the starting itself. Please let me know if I am understanding something wrong. I think we should not fuse operators in the frontend and let Relay graph fusion take care of that.

Let's see what others think about this. @tqchen @yzhliu @ZihengJiang What are your thoughts on this?

I think it is ok. If we do this way, we should insert one clamp if we have activation.
Like our tflite frontend

# If we have fused activations
if fused_activation_fn != ActivationFunctionType.NONE:
  if weight_tensor_type == TensorType.UINT8:
     # implement this function  
     output_min, output_max = self.calculate_activation_range_uint8(output_scale, output_zero_point, fused_activation_fn)
     # insert clip
     out = _op.clip(out, output_min, output_max)
  out = self.convert_fused_activation_function(out, fused_activation_fn)

FrozenGene on 16 Jun 2019

👍1

I think it is ok. If we do this way, we should insert one clamp if we have activation.
Like our tflite frontend

Yes, I agree with that. That's exactly what I was thinking.

anijain2305 on 16 Jun 2019

The min and max are not conditional on existence of activation operation in the original model. They are there to saturate the downscaled and offset adjusted 32 bit signed int accumulator to the min and max value of the uint8 quantized bit range.

Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation.

jnorwood on 16 Jun 2019

Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation

Ah, I see. That finally makes sense.
So, this is not about activation. This is about what representation one is using for storing the floating point values. For example, if it is 7-bits, we will need the output min/max saturations. Cool, I will add them into the API and add corresponding documentation.

anijain2305 on 17 Jun 2019

So, this is not about activation.

Of course it comes from activation, and is related to zero point and scale. For this min/max activation:

They are even named with activation when used in computing kernel
The min/max is generated at the prepare stage of convolution
The function in 2 eventually calls CalculateActivationRangeQuantizedImpl
Min/max are set to the representable value range of a data type ONLY when there is no activation is found in the fused operator.

jackwish on 17 Jun 2019

yes, right. The scaling constant computed during training is based on the range of values seen after fused in activations (at least that is true for the tflite quantized models I've looked at). That includes being after the relu6 positive clipping also. During inference, the min and max saturation values are just handling saturation of values seen outside the range expected from the training... whether or not there was a fused in activation operation during training.

jnorwood on 17 Jun 2019

It appears to me this would let them simulate smaller than 8 bit quantizations.

If _simulating 8 smaller bit_ is the case, 8 bit should be able to hold activation min/max value.

8 bits could hold. But what the value output_min / output_max is ? I think @jnorwood want to express this point. Because we can not just simply use out_dtype to decide what the value range is. But if we insert clip op in frontend, I think it also could handle. Need some logic to calculate the min / max. see my next comment.

I was saying the It appears to me this would let them simulate smaller than 8 bit quantizations reasoning could be somehow not the only possibility.

jackwish on 17 Jun 2019

During inference, the min and max saturation values are just handling saturation of values seen outside the range expected from the training...

I guess the saturation is exactly what activations (ReLU family) mean, semantically. :)

jackwish on 17 Jun 2019

Although the quantized conv result is held in uint8, it could be static casted to signed int8, or even fewer than 8 bit quantization. That would require both min and max saturations, as in the reference tflite quantized conv implementation

Ah, I see. That finally makes sense.
So, this is not about activation. This is about what representation one is using for storing the floating point values. For example, if it is 7-bits, we will need the output min/max saturations. Cool, I will add them into the API and add corresponding documentation.

See @jackwish 's comment. As my code calculate_activation_range_uint8 means, only when no activation, we will have the range of data type. i.e. if we don't have activation, we will have 0 - 255 if it is uint8. If we have RELU6, we will have https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L152

So, how about if we are 7-bits, alright, we could also use 8 bits to represent output_min / output_max in conv's compute kernel. i.e. the output_min / output_max is 0 / 255. But in our frontend, we will be like this:

# If we are 7-bits
  if weight_tensor_type == TensorType.UINT7:
     # implement this function  
     output_min, output_max = self.calculate_activation_range_uint7(output_scale, output_zero_point, fused_activation_fn)
     # insert clip
     out = _op.clip(out, output_min, output_max)

That is to say no matter whether we have activation, we will have one clip. If no activation, we will clamp it to 0 / 127. Because we represent it in 0 / 255, this is 8 bits range. If we have activation, for example, RELU6, the code will change too https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L152:

   *act_min = std::max(qmin, quantize(0.0));
   *act_max = std::min(qmax, quantize(6.0));

q_min is 0, q_max is 127.

So, if we decide to insert clip operator in frontend, we could handle fewer 8 bits too.

One potential optimization :
If TVM support data type like UINT7, we could do the logic like UINT8, which means we could avoid inserting clip operator in frontend if we have no activation (just set out_dtype be UINT7). But however, i think it shouldn't be the bottleneck.

FrozenGene on 17 Jun 2019

I guess the saturation is exactly what activations (ReLU family) mean, semantically. :)

In the case of the tflite quantized models I've looked at, the batch normalization and relu6 operations in training are fused into the conv operations used during inference. You probably need to fuse the relu6 to match their results.

This paper removes the relu6 and batch norm associated with the depthwise convs in a mobilenet modification. You would still need the min and max values for those depthwise conv operations even though there is no fused activation. So, that is all I was trying to say ... those min and max values are really to saturate the quantization range, rather than representing an activation operation.

https://arxiv.org/pdf/1803.08607.pdf

jnorwood on 17 Jun 2019

https://arxiv.org/pdf/1803.08607.pdf

Qualcomm's Way? Let us see the Google's TFLite model:

We have the quantized model doesn't remove RELU6 in dw conv / conv. I think we should focus on the TFLite's code / TFLite's way.

Come back to Qualcomm's paper, if we decide to support that way, we could also write logic in frontend and insert correct clip operator. However, I think we have no obvious reason to support this way.

FrozenGene on 17 Jun 2019

If no activation, we will clamp it to 0 / 127.

In the tflite quantized conv implementation ( I posted an excerpt from their code previously) the offset is added in prior to the clamping. The tflite quantized models in their repository used uint8 asymmetric quantization with non-zero offsets for activations and weights and int32 for biases . In that case min and max values passed into the quantized conv are always 0 and 255.

It appears to me, though, that someone who wrote that conv code might have also considered supporting return of signed int8 quantized values ... since they provided a signed int32 min saturation value. If signed int8 quantization is a tflite quantization conversion option, then maybe a good idea make sure to cover this case.

The intel quantization uses fixed 0 offset uint8 for activations and fixed 0 offset int8 for weights and fixed 0 int32 for biases. That simplifies the terms of the convolution inner loops ( a lot, as has been discussed here before). It also reflects Intel's avx512 DLBoost hardware int8 capabilities/limitations. So, probably a good idea to support that mode.

jnorwood on 17 Jun 2019

In that case min and max values passed into the quantized conv are always 0 and 255.

Not true. When there is activation, the range is not always 0 ~ 255. For example RELU,

     auto quantize = [scale, zero_point](float f) {
    return zero_point + static_cast<int32_t>(TfLiteRound(f / scale));
     };
    *act_min = std::max(qmin, quantize(0.0));
    *act_max = qmax;

We have proved that compute as this way and could make the result the same as TFLite.

FrozenGene on 17 Jun 2019

In the tflite quantized Mobilenet v1, from the repository, the first conv operation has input data with a non-zero offset. The offset is 128. So either provide a conv which uses signed int8 and 0 offset, or do what tflite does and handle it as quantized uint8 convolution with 128 offset value.

You can see the quantization offsets in Netron in the node properties input data

mobilenetv2

jnorwood on 17 Jun 2019

Not true. When there is activation, the range is not always 0 ~ 255. For example RELU,

I believe tflite extends the quantization range so it always includes 0, as done in the gemmlowp quantization example below. I have dumped my min and max saturation input values from the six quantized tflite models (two mobilenets and four inceptions). They are all 0 and 255.

https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

// Given the min and max values of a float array, return
// reasonable quantization parameters to use for this array.
QuantizationParams ChooseQuantizationParams(float min, float max) {
  // We extend the [min, max] interval to ensure that it contains 0.
  // Otherwise, we would not meet the requirement that 0 be an exactly
  // representable value.
  min = std::min(min, 0.f);
  max = std::max(max, 0.f);

jnorwood on 17 Jun 2019

Not true. When there is activation, the range is not always 0 ~ 255. For example RELU,

I believe tflite extends the quantization range so it always includes 0, as done in the gemmlowp quantization example below. I have dumped my min and max saturation input values from the six quantized tflite models (two mobilenets and four inceptions). They are all 0 and 255.

https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc
// Given the min and max values of a float array, return
// reasonable quantization parameters to use for this array.
QuantizationParams ChooseQuantizationParams(float min, float max) {
  // We extend the [min, max] interval to ensure that it contains 0.
  // Otherwise, we would not meet the requirement that 0 be an exactly
  // representable value.
  min = std::min(min, 0.f);
  max = std::max(max, 0.f);

I think you maybe don't understand fully of my previous comment. One question I want to ask: Do your quantized models have conv + relu / relu6 like our model? If no, obviously is 0 ~ 255, no matter how many models are. Please see: https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 I and @jackwish have emphasized many times of this function code.

Please construct one quantized model like us:

I can make sure you will observe another result.

FrozenGene on 17 Jun 2019

I think you maybe don't understand fully of my previous comment. One question I want to ask: Do your quantized models have conv + relu / relu6 like our model? If no, obviously is 0 ~ 255, no matter how many models are. Please see: https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 I and @jackwish have emphasized many times of this function code.

The quantized mobilenet v1 inference model is from the tflite model repository. The training model includes relu6 and batch normalization operations, but these are fused into convolution operations in the inference model, as the Netron diagram shows.

The link you reference shows floating point activation values that would be applied during training. They do represent the range bound that would be expected of the upscaled values in the accumulator in the inference model. However the min and max saturation values passed into the inference quantized convolution are applied _after downscale_ ... I previously provided the code and the link. They are int32 values, not float values. They are applied after both downscale and offset are applied. They are 0..255 even though the scaled up range expected is 0..6 from the fused-in relu6 operation.

If the convolution and relu operations were separate, you would still see 0 and 255 for those min and max values because they are applied after downscale and after offset are applied to the convolution accumulator. The min and max values only function to saturate the downscaled result to the quantized uint8 bit range, avoiding wrap-around overflow/underflow of the 8 bit value if the downscaled accumulator were simply masked to 8 bits.

jnorwood on 17 Jun 2019

@jnorwood Have read again the long discussions, I finally understand what you are trying to say. Let me ask this question: considering ReLU6 in float, do you think it is saturating input float values into [0, 6]?

jackwish on 18 Jun 2019

I

@jnorwood Have read again the long discussions, I finally understand what you are trying to say. Let me ask this question: considering ReLU6 in float, do you think it is saturating input float values into [0, 6]?

The 0..6.0 float clamping is applied during training if relu6 is used as activation. It may also be used to force the range for creating the downscale constants and offsets that are used in inference. That seems so, from your activation code excerpt.

The gemmlowp example indicates that they always extend a range if it doesn't include 0. I believe their reason was that an exact zero representation is needed in the range... perhaps for padding. I didn't see that in the activation code excerpt, but perhaps that is handled elsewhere.

On the quantized inference side, those min and max values are applied after the downscale and offset are applied, and it seems to me more appropriate to recognize that they are needed for the quantization bits saturation whether or not an activation operation was used in the training model.

I've only seen 0 and 255 for those input min and max values in the six quantized tflite models I've converted. I dumped them all to check.

No, there is no saturation being applied to input values during inference. The input values are uint8 in the tflite models. There is extra info stored in the model file indicating the input range and offset. In some model operations that input info is needed for rescale ... For example in the multiple input concat operations in the inception_v3 model, the input ranges are different, so a rescale is required.

The tf training models associated with the quantized tflite models have activation and bn operations that are effectively fused together with the conv, along with the fake quantization ops. No separate activation nodes appear in the associated inference models.

jnorwood on 18 Jun 2019

I think you maybe don't understand fully of my previous comment. One question I want to ask: Do your quantized models have conv + relu / relu6 like our model? If no, obviously is 0 ~ 255, no matter how many models are. Please see: https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 I and @jackwish have emphasized many times of this function code.

The quantized mobilenet v1 inference model is from the tflite model repository. The training model includes relu6 and batch normalization operations, but these are fused into convolution operations in the inference model, as the Netron diagram shows.

The link you reference shows floating point activation values that would be applied during training. They do represent the range bound that would be expected of the upscaled values in the accumulator in the inference model. However the min and max saturation values passed into the inference quantized convolution are applied _after downscale_ ... I previously provided the code and the link. They are int32 values, not float values. They are applied after both downscale and offset are applied. They are 0..255 even though the scaled up range expected is 0..6 from the fused-in relu6 operation.

If the convolution and relu operations were separate, you would still see 0 and 255 for those min and max values because they are applied after downscale and after offset are applied to the convolution accumulator. The min and max values only function to saturate the downscaled result to the quantized uint8 bit range, avoiding wrap-around overflow/underflow of the 8 bit value if the downscaled accumulator were simply masked to 8 bits.

I have emphasized the model diagram is one quantized model. Let me show more detail of the property:

This is to say, not all relu / relu6 can be fused into convolution in TFLite's quantized model in real production environment. MobilenetV1 just one simple reference, we should consider more. Then what is the min / max now? that is previous code https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L138 does. NOT just simple 0 ~ 255.

FrozenGene on 18 Jun 2019

I'm using the tensorflow tflite quantized model, mobilenet_v1_1.0_224_quant.tflite. from
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

I view it with Netron, which shows no relu6 nodes. It also shows no fused relu6 nodes in the node properties. So... if you are discussing some different model, I can't comment on it without further info how to repeat it.

I dump the min and max parameters passed to the reference implementation of quantized conv, and they are all 0 and 255.

I created a tf branch which automates this dump, https://github.com/jnorwood/tensorflow/tree/tflite_cmake
It was last updated 14 days ago.
there is a readme.md in https://github.com/jnorwood/tensorflow/blob/tflite_cmake/tensorflow/lite/README_CMAKE.md
that shows how I built it using cmake and how to execute the command where I dumped the data, including the min and max parameters. I just ran it again and am attaching the screen capture, showing that all the min and max inputs are 0,255 for that inference model.

Screenshot from 2019-06-18 10-41-59

jnorwood on 18 Jun 2019

Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

Discussion

Other non-TVM related links that were used to understand quantization

GemmLowP - Doc

TFlite reference code

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)

Op quantize
def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """
Key points to discuss

The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.

Op quantized_conv2d
def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """

    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.

    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """
Key points to discuss further

This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.

First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.

Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.

The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).

Op dequantize

Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.
def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.

    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """

We need to add in_dtype in the dequantize op as the calculations will be different, especially the range to use.

shoubhik on 19 Jun 2019

We need to add in_dtype in the dequantize op as the calculations will be different, especially the range to use.

Guess the input tensor has such information already?

jackwish on 19 Jun 2019

We need to add in_dtype in the dequantize op as the calculations will be different, especially the range to use.

Guess the input tensor has such information already?

@jackwish, the input data is generally an Expr can be Var or IntImm or some other type of Expr. How will i get in_dtype from an Expr?

shoubhik on 25 Jun 2019

Sorry for the delayed reply in this discussion due to recent conference trips. Here are a few thoughts.

Let us put a concise namespace for the quantization dialect. Two possible candidates:

relay.op.qnn, e.g. relay.op.qnn.conv2d
- The qnn name is consistent with QNNPack
relay.op.tflite
- The op name is a dialect.

In both cases, they are a dialect of relay, which means by default we do not want to introduce special implementation, but instead will translate them into existing core ops. We need to have a special op_level for these core ops.

I still think we should minimize the number of operators, and directly translate to lower ops if possible. This includes things like quantize/dequantize, and qnn.concat. Please discuss this alternative and list pros and cons.

tqchen on 29 Jun 2019

Thanks @tqchen

Of the two choices, I am inclining towards relay.op.qnn. My hope is that different frameworks converge to same qnn ops. The relay.op.tflite seems to be very specific as of now. I agree that these news ops should have a special op_level.

I am still unclear about where to draw the boundary when to directly translate to lower ops vs creating a newqnn op. For example, if we are going for devices that do not have any FP32 compute units, we might have to create a long sequence of existing Relay ops to approximate the FP32 computation with fixed point/integer computation. So, encapsulating them would be a good idea.

Basically, we need some kind of abstraction that can be shared across frameworks for these framework operations. For now, I was treating this abstraction as a new qnn relay op. The rationale behind this choice is that once we convert from the framework to a Relay graph, we can eyeball the graph and make some sense by reading the graph. Directly translating will lose the readability of the Relay quantized graph.

However given the tradeoffs, we can very well create a new class that can be shared across frameworks. What are your thoughts on this?

anijain2305 on 1 Jul 2019

re "we might have to create a long sequence of existing Relay ops to approximate the FP32 computation".

This is certainly a problem for traditional frameworks, but won't be a problem for tvm/relay. Because we has automatic fusion and code generation, the long sequence of ops will be fused again into a single fused op. We can generate code as efficient, sometimes even more efficient(because we can fuse different ops together). So I will always recommend breaking things down to primitive ops if possible.

tqchen on 1 Jul 2019

I completely agree with breaking down into primitive ops. Even the relay.op.qnn should be broken down into primitive ops. If the primitive op does not exist, we will discuss and maybe create one. I understand the Relay fusion part. I am trying to make another point.

I am trying to understand when to directly translate to primitive ops OR create a new qnn op that will be later lowered to primitive ops using a relay pass. If the lowering sequence is very long, it might be better to create a new qnn op.

PS - The first Relay pass that we can run is qlower or qrewrite (can be a part of framework parser as well, if it looks ugly in build_module) and the resulting sequence will only be a sequence of existing relay primitive ops.

anijain2305 on 1 Jul 2019

For

relay.op.qnn, e.g. relay.op.qnn.conv2d The qnn name is consistent with QNNPack

and

My hope is that different frameworks converge to same qnn ops.

AFAIK, QNNPACK takes the quantization approach of TensorFlow/TFLite. I think that when we talking about op in this scenario, it means the quantization arithmetic formula itself rather than how to translate it into code, which is same for QNNPACK and TensorFlow/TFLite. So I guess one dialect should be enough for them. And, I guess the converge is more reasonable, if, the qnn stands for simply _generic_ quantized nn, but not QNNPACK.

jackwish on 1 Jul 2019

👍1

@jackwish Yes, qnn stands for a generic quantized nn, and not QNNPACK. I think @tqchen also means the same thing.

anijain2305 on 1 Jul 2019

OK, seems we are converging to qnn. Perhaps we could propose the list of op names

tqchen on 2 Jul 2019

Finally, we are starting to converge :)

I am proposing them on the basis of Resnet network for now.

relay.op.qnn.conv2d
relay.op.qnn.dense
relay.op.qnn.relu
relay.op.qnn.max_pool2d
relay.op.qnn.avg_pool2d
relay.op.qnn.concat (used in Inception)

relay.op.qnn.quantize
relay.op.qnn.dequantize

anijain2305 on 2 Jul 2019

All of above qnn ops will be lowered to existing Relay primitive ops using some Relay pass (for example, using ForwardRewrite infra). For example - relay.op.qnn.conv2d can be lowered to
~~~
fn (%quantized_data: Tensor[(2, 1, 2, 4), uint8], %weight: Tensor[(3, 1, 2, 2), uint8]) -> Tensor[(2, 3, 1, 3), uint8] {
%0 = nn.conv2d(%quantized_data, %weight, kernel_size=[2, 2], out_dtype="int32")
%1 = cast(%0, dtype="float32")
%2 = multiply(%1, 0.25098f)
%3 = round(%2)
%4 = cast(%3, dtype="int32")
%5 = clip(%4, a_min=0, a_max=255)
cast(%5, dtype="uint8")
}

~~~

I have yet to understand what needs to be done with softmax. Will have to look at a quantized model to understand.

anijain2305 on 2 Jul 2019

I have yet to understand what needs to be done with softmax.
Maybe computing softmax in float as it seems that we are not expecting
everything in integer (just like your conv2d lowering proposal)?

>

jackwish on 2 Jul 2019

0001
0002
0003

anijain2305 on 5 Jul 2019

@anijain2305 Generally Good. About the performance of HW, let us say ARM CPU, For the depthwise convolution, we even could optimize without tensorize. After some work of optimization for int8 using pure TVM schedule without tensorize, we could also beat QNNPACK (some workload we test we even could beyond 50% on ARM64 platform).

However, for normal convolution, without tensorize, it is hard to achieve best performance. When we use tensorize, one thing is that we combine bias_add / requantize into qnn.conv2d to avoid memory access. As @jackwish 's previous investigation, we find it is very important on ARM CPU's performance. So, if we implement it as the diagram, I only concern this thing.

FrozenGene on 5 Jul 2019

👍1

@FrozenGene Thanks for the quick feedback on the design.

I understand the performance concern. Let's try to tackle them in fusion. Fusion already performs compute_inline to bring the computation at right location. Hopefully, with some tagging and with some arm-twisting, we can achieve same tensorize schedule that you are suggesting.

anijain2305 on 5 Jul 2019

I just want to point out, again, that the output_activation_min and output_activation_max are required even if there is no specified activation operation, since they provide saturation to the quantization range ... avoiding overflow error.

Also, if you fuse activation operations during training, prior to the re-quantization, then you gain the extra bit of resolution for quantization. I believe tflite has done this in all their quantized inference models in their repository.

jnorwood on 5 Jul 2019

@jnorwood Yes, I understand your point. We can use the clip to saturate the values even if Relu was not fused. It fits in the design and the proposed abstractions.

anijain2305 on 6 Jul 2019

@tqchen What are your thoughts?

Seems like we are agreeing on the proposed design abstraction. There is a concern of not being able to achieve the best schedule performance. We can try to tackle it with fusion and schedule_tagging.

anijain2305 on 6 Jul 2019

Can we elaborate a bit if avg_pool2d, relu is necessary or if they are more of a direct mapping to the standard ops? Do we allow mix of standard ops and qnn ones？

tqchen on 6 Jul 2019

@tqchen, if we use avg_pool2d , we also need to modify it. But the modified code is not much. For example, we should make the sum UInt8 result be Int16 to avoid overflow. In our internal implementation, we use q_avg_pool2d to distinguish avg_pool2d. Relu shouldn’t be modified. However, if we have activation fns, we should have output_min / output_max calculated by calculate_activation_range_uint8 said before, then we insert clip operator.

FrozenGene on 6 Jul 2019

👍1

q_conv2d

anijain2305 on 6 Jul 2019

@tqchen Added the case for qrelu. (I think the asymmetric lowering can be improved further, but thats not the point).

Similarly for quantized avg pool2d, as @FrozenGene mentioned, we will still need to upcast the tensor to int32 to avoid saturation. Additionally, we would need to handle the zero points.

anijain2305 on 7 Jul 2019

Do we allow mix of standard ops and qnn ones？

The framework parsed graph might have a mix (as shown in the lowering of qconv2d). But in the relay.build function, my first pass would be quantize_rewrite pass, that will convert all the qnn ops to existing relay ops, resulting in whole graph consisting of only primitive ops.

anijain2305 on 7 Jul 2019

I agree that mixed-precision might make avg_pool2d's case a bit tricky. However, assuming that the zero-point won't change, we might just do avg_pool2d(x.astype("i32")).astype("i8").

max_pool2d though should be the same given that the maximum rule is the same regardless of zero point.

Most of the current operator's lowering rule cast back the domain to float then back into i32. As in the case of qnn.relu. This could be quite inefficient. In most cases of the current symmetric quantization, we try to keep everything in i32 as much as possible.

In particular, refer to the current quantization pass, every value could sit in a domain, which could be fixed point with an implied scale, or floating point. Conversion between domains might be necessary and should be conducted in a minimum way. The default way always convert integer domain back to f32 and use f32 to exchange value between layers, which may not not the most efficient way.

tqchen on 7 Jul 2019

In particular, refer to the current quantization pass, every value could sit in a domain, which could be fixed point with an implied scale, or floating point. Conversion between domains might be necessary and should be conducted in a minimum way. The default way always convert integer domain back to f32 and use f32 to exchange value between layers, which may not not the most efficient way.

So, I think we are trying to make 2 things work together here, which are very difficult to merge.
The first is to perform the quantization in framework and then convert it to Relay graph. This is what this issue is trying to focus on. The other is to perform the quantization in TVM itself. Your comment that the conversion between two domains should be minimal applies to the entity that quantizes the network. For example, relu, bias_Add etc are all fused in TFLite Conv2d for the same reason.

If we are converting the framework quantized model to Relay graph, then I think we should perform the same computation as defined by the framework quantized graph. If the original graph has domain conversions, then we will have to respect that as well. We can perform some graph optimizations - like remove dequantize followed by quantize if same quantization parameters. I think even with all this inefficiencies, our fusion algorithms and fast kernels should be able to provide better performance than the framework execution of the quantized graph.

Please let me know your thoughts on this.

anijain2305 on 7 Jul 2019

You can also view the domain conversion minimization as an optimization pass here. The resulting graph is to some extent equivalent semantically equivalent to the original one that does the conversion to f32 and back and forth. The idea is we can be smarter when lowering qnn ops into the relay sequence.

For example, when lowering the qconv2d -> qrelu sequence, we don't have to convert the result of qconv2d to f32 and then back to i8, they can be represented directly in the i8 domain without having to get back to f32. The mechanism in the current realize might help in this case.

There are also two separation steps in current tvm's quantizer. We always first make the choice(this step was done by other frameworks), and then decide how to best translate to low-level operator(realize stage in quantization). The realize stage in current quantization part would serve as a good reference.

tqchen on 7 Jul 2019

To elaborate further about the choice of the domain and how it is relatively independent of which operator you would like to perform. Many operators can actually perform the computation using different number representations (domains)

It means how you should represent the number of a certain layer in either of the two ways. We can represent 2.5 by

f32: stored as val_f32 , where val_f32=2.5
i8: stored as val_i8 * scale + zero_pt (val_i8=25, scale=0.1, zero_pt = 0)

Each operator in the qnn could take value from either f32 or i8. In the default setting of the current proposal if the value is from f32, it first converts its representation from f32-> i8, then perform the computation internally in i8, then convert back to f32.

So in the default lowering rules you proposed, every quantized operator has three stages(say qnn.relu)
convert_to_i8_dom -> relu_in_i8 -> convert_to_fp32_dom. However, when we have two consecutive ops that can perform operations in a different domain, in this case, fixed pt domain, we do not have to convert the domain into f32, then back to i8, instead we can directly do the domain conversion and possibly gain more efficiencies.

tqchen on 7 Jul 2019

Thanks @tqchen for the detailed explanation.

Actually, my proposal is simpler. My qnn.relu does not convert to the three stages that you mentioned. It only performs the relu_int_i8.

The frameworks (atleast TFLite and MxNet) do not go back to FP32 unless the operator is not supported in i8 format or accuracy is very bad in i8.

For example, TFLite qconv2d will translate to qnn.conv2d + qnn.requantize or as you explained conv_in_i8/i32 -> convert_to_int8 domain, but there wont be any FP32.

To complete the picture, suppose the quantized framework graph is (fw stands for framework)

fw.quantize -> fw.qconv2d -> fw.qrelu -> fw.dequantize

The Relay graph would be

qnn.quantize -> qnn.conv2d -> qnn.requantize -> qnn.relu -> qnn.dequantize
convert_to_i8 -> conv_in_i8/i32 -> convert_to_i8 -> relu_in_i8 -> convert_to_FP32

Essentially, if the framework does not convert back to FP32 in between, we would not go to FP32.

anijain2305 on 7 Jul 2019

To complete the picture, suppose the quantized framework graph is (fw stands for framework)

fw.quantize -> fw.qconv2d -> fw.qrelu -> fw.dequantize

If you do the qconv2d and qrelu operations sequentially, using their analogous fp operations, the output from qrelu will have the (potentially worse) resolution of the initial qconv2d. So, you need to be careful if you are trying to use the fully sequential, separate operation results as a reference.

I can see that you might want the graph to represent all the operations prior to optimizing the implementation. I just want to point out that the qrelu implementation can avoid the lowered resolution and can be completely cost free by revising the downscale multiplier and zero point of a preceding quantized output operation (qconv2d in this case). It is cost free because the clipping values are required in any case to do the quantized range saturation.

The operation of revising the downscale multiplier of a previous graph operation is also useful to achieve zero cost replacement of the scale normalization operations in the quantized concat operations in the inception models.

jnorwood on 7 Jul 2019

I can see that you might want the graph to represent all the operations prior to optimizing the implementation. I just want to point out that the qrelu implementation can avoid the lowered resolution and can be completely cost free by revising the downscale multiplier and zero point of a preceding quantized output operation (qconv2d in this case). It is cost free because the clipping values are required in any case to do the quantized range saturation.

Yes, you are correct. And that's what exactly TFLite does. In the case of fused TFLite conv2d, the conversion will be different

TFLite.conv2d (fused relu)

will be converted to following Relay graph

qnn.conv2d -> nn.bias_add -> qnn.requantize -> clip

In this case, the cost-free conversion is manifested in the clip operation.

We will have to add framework parsers for each framework, and most probably the resulting sequence of operators will be different for each framework.

My example in my last comment was to explain the fp32 and i8 boundaries and domain conversions of my proposal that @tqchen was pointing out.

anijain2305 on 7 Jul 2019

Several comments :)

Regarding @anijain2305 's ReLU proposal.

The symmetric and asymmetric path may merge into one - the asymmetric - where the zero point for symmetric approach is 0. Actually, this is a bit more complicate regarding the input tensor type, and what is the expected output tensor type, when handling the ReLU family:

| input type | output type| how to handle|
|-----------|------------|-----------------|
| int8/uint8| int8/uint8| Clipping out the unwanted value range, taking zero point into consideration|
| int32 | int32 | Assuming the int32 is symmetric, such that clipping out the unwanted value range should be fine for ReLU. But, what about ReLU6? |
|int32| int8/uint8| the scale and zero point of the input and output may take into consideration. This will break into ReLU with input/output type int32 and a Requantization in the proposal. Considering ReLU6, the integer representation of the FP32 6.0 should be calculated, otherwise, we can hardly know the expected output integer value range. |

The listed is not necessarily the all. As I stated before, we need to keep in mind how the floating point is represented in integer, and how can we arrange the arithmetic to maintain the floating point computing which is been represented.

Similarly for quantized avg pool2d, as @FrozenGene mentioned, we will still need to upcast the tensor to int32 to avoid saturation. Additionally, we would need to handle the zero points.
Zero point it not needed in handling pooling. The UINT8 representation of FP32 doesn't need to update in the semantic of pooling.

It seems that we have put many Quantize/Dequantize to make the quantization ops reusing existing nn ops, either explicitly or implicitly. This could be bad for performance. Maybe some passes need to be introduced to handle I guess.

jackwish on 7 Jul 2019

OK, given that most of the qnn ops are already in integer domain, we might be just fine. Minimization of requantize is still useful. And in the case when the scale is a power of two, use shift and normalize might be better than float scale and round

tqchen on 7 Jul 2019

Maybe scales are rarely a power of two (I assume you mean values such as
0100b, 0.0010b). They are basically with long fractionals.

Tianqi Chen notifications@github.com于2019年7月7日周日上午11:08写道：

OK, given that most of the qnn ops are already in integer domain, we might
be just fine. Minimization of requantize is still useful. And in the case
when the scale is a power of two, use shift and normalize might be better
than float scale and round

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/tvm/issues/2351?email_source=notifications&email_token=ABFVHDKDLR7B7VSQ7PAIRCLP6FMYHA5CNFSM4GMOMOS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLDX7Q#issuecomment-508967934,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABFVHDKEMYZ7HLOFF6FB7G3P6FMYHANCNFSM4GMOMOSQ
.

>

Best Regards
Zhenhua

jackwish on 7 Jul 2019

And in the case when the scale is a power of two, use shift and normalize might be better than float scale and round

Yes, the shift and normalize can completely by done in integer/fixed-point domain instead of going to Floating point (even if they are not a multiple of 2). I have been prototyping that. Will update.

anijain2305 on 7 Jul 2019

Thanks everybody for the fruitful discussion. I think we are gradually reaching convergence :)

I am have been prototyping the qnn.conv2d and qnn.requantize at https://github.com/dmlc/tvm/pull/3367

I have still few loose ends to fix. I will update once I am done and then we can discuss if the implementation makes sense.

anijain2305 on 7 Jul 2019

One thing to be careful about is that when using shift and normalize, right shift corresponds to round down as opposed to round to nearest, an additional 0.5 equivalence needs to be added to get the round behavior

tqchen on 7 Jul 2019

an additional 0.5 equivalence needs to be added to get the round behavior

if followed by relu, you can skip extra round processing for negative values.

otherwise for negative values you need to subtract 0.5 equivalent.

if using convergent nearest/even rounding, also need to handle the boundary cases for even/odd decision.

jnorwood on 7 Jul 2019

One thing to be careful about is that when using shift and normalize, right shift corresponds to round down as opposed to round to nearest, an additional 0.5 equivalence needs to be added to get the round behavior

Yes, I think it is little more complicated. The std::round of -2.5 is -3. Therefore, for negative numbers the rounder is not 0.5, its actually decimal equivalent of 0.0111111b (the number is represented in 2's complement). This brings in lot more new instructions. I will add some examples.

if followed by relu, you can skip extra round processing for negative values.

Aaah, very nice observation.

if using convergent nearest/even rounding, also need to handle the boundary cases for even/odd decision.

Lets skip this for the first implementation. Once we have normal rounding working, we can add a rounding parameter to the op which can lead to a different sequence of instructions.

anijain2305 on 7 Jul 2019

slight difference in a single point(0.5) is fine and likely won’t have an impact on final acc

tqchen on 7 Jul 2019

slight difference in a single point(0.5) is fine and likely won’t have an impact on final acc

Yeah, I was planning to add a rounding param to the op. For "ceil", we could just add a 0.5 rounding without worrying about negative values. For "round', we can be more precise. By default, we can choose "ceil". What do you think?

Update - Maybe not, "ceil" is confusing. Let me think and come up with better terms (like round-away-from-zero etc.).

anijain2305 on 7 Jul 2019

In other cases that do not directly correspond to 0.5, the behavior is still consistent with round, if you add 0.5, this include negative values. Because right shift corresponds to floor division.

tqchen on 7 Jul 2019

👍1

Let me try to follow up your discussion compared with our internal implementation, for round (of requantize), when we get the input_scale / kernel_scale / output_scale, we want to get the shift / multiplier (See: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/quantization_util.cc#L52 ) and pass to real requantize (i.e. corresponding to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L148) .
If we decide to do it in Python frontend like our internal implementation (we use Python to handle it and pass the shift / multiplier to previous real requantize, we should notice Python3's round is rounds to the nearest "even" number, not round away from zero like C++ std::round. We could implement it very easily:

def _round_away_from_zero(value):
    abs_value = abs(value)
    result = math.floor(abs_value) + math.floor(2 * (abs_value % 1))
    return result if value >= 0 else -result

Maybe your discuss of round is here, right?

FrozenGene on 8 Jul 2019

slight difference in a single point(0.5) is fine and likely won’t have an impact on final acc

Yeah, I was planning to add a rounding param to the op. For "ceil", we could just add a 0.5 rounding without worrying about negative values. For "round', we can be more precise. By default, we can choose "ceil". What do you think?

Update - Maybe not, "ceil" is confusing. Let me think and come up with better terms (like round-away-from-zero etc.).

If your round is the concept of my previous comment, maybe round is better and is the same as TFLite. IMO, if we couldn't get the same result of TFLite, we can not know where is wrong when the model is large and we could have problem when we deploy it in industry environment. Because algo. team often verify acc in TFLite, not verify the acc in TVM.

FrozenGene on 8 Jul 2019

tflite computes the output_multiplier and output_shift integer parameters from a double input in the call to QuantizeMultiplier . These are the integer downscale multiplier and right_shift divider parameters.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/experimental/micro/kernels/conv.cc

   double real_multiplier = 0.0;
    TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler(
        context, input, filter, bias, output, &real_multiplier));
     int exponent;
     QuantizeMultiplier(real_multiplier, &data->output_multiplier, &exponent);
data->output_shift = -exponent;

I'd suggest you print out their output_multiplier and output_shift values for comparison, since errors can start there.

Their downscale operations are implemented in int64.

jnorwood on 8 Jul 2019

The discussion in this thread has get quite long and given that we are converging. I recommend we close this thread, and open a new RFC thread "QNN Dialect". With latest proposals of the APIs that @anijain2305 @shoubhik is putting together(please also put in related APIs in TF/QNN for reference to back the decision).

This way we keep the community informed and we can move forward with these implementations. I hope we can get +1 representation from different groups who are interested in this direction, in particular @jnorwood @ajtulloch @FrozenGene @yzhliu

Thanks everyone for the hard work.

tqchen on 19 Jul 2019

@anijain2305 can you lead the proposal discussion?

tqchen on 19 Jul 2019

I agree, we should move the proposal to a new thread.
Yes, I can lead the proposal discussion.

anijain2305 on 19 Jul 2019

👍1

@anijain2305 can you open the RFC thread? Sorry for being a bit formal in this case, we want to set an example for the first dialect public discussions.

tqchen on 20 Jul 2019

@tqchen Thanks for reminding. Just created one :)

anijain2305 on 20 Jul 2019

Let us move to https://github.com/dmlc/tvm/issues/3591

tqchen on 20 Jul 2019

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

A quick question here since I can't see this mentioned on #3591

Is this network going to be quantized per tensor as well as the new per-channel quantization that is appearing in tflite 2.0 ? IIUC, tf1.13 has per tensor quantization rather than the per channel quantization. i.e. more interestingly can the relay design support both ?

https://www.tensorflow.org/lite/performance/quantization_spec?source=post_page---------------------------#per-axis_vs_per-tensor

regards
Ramana

u99127 on 29 Jul 2019

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

A quick question here since I can't see this mentioned on #3591

Is this network going to be quantized per tensor as well as the new per-channel quantization that is appearing in tflite 2.0 ? IIUC, tf1.13 has per tensor quantization rather than the per channel quantization. i.e. more interestingly can the relay design support both ?

https://www.tensorflow.org/lite/performance/quantization_spec?source=post_page---------------------------#per-axis_vs_per-tensor

regards
Ramana

regards
Ramana

Good question. We have only supported TF1.13 quantization. TF2.0 has separate scale and doesn't be considered in previous discussion. Seems there is a gap here. cc @anijain2305

FrozenGene on 30 Jul 2019

Tvm: [RFC][Quantization] Support quantized models from TensorflowLite

Most helpful comment

All 119 comments

Op quantize

Op quantized_conv2d

Op dequantize

Namespace for the tflite quantize style dialect

Requirements

Proposal

Pros

Cons

Op quantize

Op quantized_conv2d

Op dequantize

~~~

>

Related issues