Tvm: [QNN] [RFC] QNN Dialect -- Prequantize Models

Created on 20 Jul 2019 · 25Comments · Source: apache/tvm

We are proposing a new dialect named QNN, that introduces a quantized version of existing relay operators. The goal is to support the models that have been pre-quantized in the framework.

Some important notes about QNN dialect are

QNN operators are lowered to existing Relay operators to ensure that we can reuse Relay infrastructure.
Code resides in new directory. Python files are in python/relay/qnn and CPP files are in src/relay/qnn.
QNN, like any other dialect, introduces new Relay passes. These passes only deal with QNN ops (like lowering of QNN ops to existing Relay ops). For any generic optimization, we rely on existing Relay passes.

We can use this thread to discuss various open questions. Some of these questions can be
1) Code organization, namespaces, API discussion.
2) QNN operator lowering - Infrastructure, correct sequence of Relay operations etc.
3) Ways to efficiently add new operators with minimal engineering efforts.
4) Requirements (if any) of new generic Relay passes to achieve good performance.
5) Any new bugs that arise as we start testing integer computations more thoroughly.

The idea of QNN dialect was a result of discussion at Issue https://github.com/dmlc/tvm/issues/2351. Thanks @tqchen @FrozenGene @jackwish @jnorwood @shoubhik for the discussions.

First few PRs for the QNN dialect

Requantize operator - https://github.com/dmlc/tvm/pull/3531
Quantize and Dequantize operator - https://github.com/dmlc/tvm/pull/3512

RFC

Source

anijain2305

All 25 comments

also cc @ajtulloch @ZihengJiang @vinx13 @eqy @jroesch , @anijain2305 can you please list the API proposals and the reference APIs(in tflite etcs?) Then we can try to get everyone's thoughts on these specific API designs

tqchen on 20 Jul 2019

Let's start with just Requantize to keep it focussed. I would suggest to add +1 if one likes the API, to show that one agrees.

QNN proposal

~
def requantize(data,
input_scale,
input_zero_point,
output_scale,
output_zero_point,
rounding="AWAY_FROM_ZERO",
out_dtype="int8"):
r"""Requantized operator.
The requantize operator converts one quantized tensor representation to
another quantized tensor representation. For the output tensor, we are
provided with output scale and zero point. The computation is as follows
Q_output = zp_output + (scale_input)/(scale_ouptut) * (Q_input - zp_input)
Parameters
----------
data : tvm.relay.Expr
The input data to the operator.
input_scale: float
The quantization scale for the input tensor.
input_zero_point: int
The zero point of the input tensor.
output_scale: float
The quantization scale for the output tensor.
output_zero_point: int
The zero point of the output tensor.
rounding : string, optional
Defines the rounding direction when the value is midway between two
representable values.
out_dtype : str, optional
Specifies the output data type for mixed precision conv2d.
Returns
-------
result : tvm.relay.Expr
The computed result.
"""
~

TF Requantize

~~~
Arguments:

scope: A Scope object
input_min: The float value that the minimum quantized input value represents.
input_max: The float value that the maximum quantized input value represents.
requested_output_min: The float value that the minimum quantized output value represents.
requested_output_max: The float value that the maximum quantized output value represents.
out_type: The type of the output. Should be a lower bit depth than Tinput.
~~~

The min/max in TF are represented by scale and zero_point in QNN
rounding in QNN proposal is to give a choice between standard rounding (away_from_zero) and round_towards_zero, presenting a performance-accuracy knob.

anijain2305 on 20 Jul 2019

@FrozenGene @tqchen @u99127 Can you please approve the above API, so that we can discuss/move to next discussion. I have so many things to discuss :)

anijain2305 on 23 Jul 2019

@anijain2305 Let me look at it afternoon of today or evening.

FrozenGene on 23 Jul 2019

👍1

@anijain2305 to increase the throughput, can you also list other APIs, one per post and we use lazy consensus, wait for a week to see people's feedback then summarize.

tqchen on 23 Jul 2019

👍1

There are a couple of things in this gemmlowp quantization example, which you should perhaps consider how it could be supported in your requantize api.

ability to specify that range should be extended to include a zero value.
ability to specify that the range should be nudged so that exact zero is an integer value.
https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

jnorwood on 23 Jul 2019

@jnorwood Thanks for the comment. Both good points. I will keep those abilities, though, outside of the scope of requantize op.
Another function (not necessarily a Relay operator) can take min/max, a config (like nudge that zero is exactly representable) and generates scale and zero point as per the given choice. As of now, I am thinking of them as utility functions.

anijain2305 on 23 Jul 2019

QNN Conv2D operator

Tensorflow

~
tf.nn.quantized_conv2d(
input,
filter,
min_input,
max_input,
min_filter,
max_filter,
strides,
padding,
out_type=tf.dtypes.qint32,
dilations=[1, 1, 1, 1],
name=None
)
~

MxNet

~
mxnet.symbol.contrib.quantized_conv(
data=None,
weight=None,
bias=None,
min_data=None,
max_data=None,
min_weight=None,
max_weight=None,
min_bias=None,
max_bias=None,
kernel=_Null,
stride=_Null,
dilate=_Null,
pad=_Null,
num_filter=_Null,
num_group=_Null,
workspace=_Null,
no_bias=_Null,
cudnn_tune=_Null,
cudnn_off=_Null,
layout=_Null,
name=None,
attr=None,
out=None, **kwargs)
~

TFLite

~~~
inline void Conv(
const ConvParams& params,
const RuntimeShape& input_shape,
const uint8* input_data,
const RuntimeShape& filter_shape,
const uint8* filter_data,
const RuntimeShape& bias_shape,
const int32* bias_data,
const RuntimeShape& output_shape,
uint8* output_data,
const RuntimeShape& im2col_shape,
uint8* im2col_data, void* cpu_backend_context) {

Here ConvParams are

struct ConvParams {
PaddingType padding_type;
PaddingValues padding_values;
// TODO(starka): This was just "stride", so check that width+height is OK.
int16 stride_width;
int16 stride_height;
int16 dilation_width_factor;
int16 dilation_height_factor;
// uint8 inference params.
// TODO(b/65838351): Use smaller types if appropriate.
int32 input_offset;
int32 weights_offset;
int32 output_offset;
int32 output_multiplier;
int output_shift;
// uint8, etc, activation params.
int32 quantized_activation_min;
int32 quantized_activation_max;
// float activation params.
float float_activation_min;
float float_activation_max;
};
~~~

QNNPACK

~
enum qnnp_status qnnp_create_convolution2d_nhwc_q8(
uint32_t input_padding_top,
uint32_t input_padding_right,
uint32_t input_padding_bottom,
uint32_t input_padding_left,
uint32_t kernel_height,
uint32_t kernel_width,
uint32_t subsampling_height,
uint32_t subsampling_width,
uint32_t dilation_height,
uint32_t dilation_width,
uint32_t groups,
size_t group_input_channels,
size_t group_output_channels,
uint8_t input_zero_point,
float input_scale,
uint8_t kernel_zero_point,
float kernel_scale,
const uint8_t* kernel,
const int32_t* bias,
uint8_t output_zero_point,
float output_scale,
uint8_t output_min,
uint8_t output_max,
uint32_t flags,
qnnp_operator_t* convolution_out)
~

QNN Proposal

~~~
def conv2d(data,
weight,
input_zero_point,
kernel_zero_point,
strides=(1, 1),
padding=(0, 0),
dilation=(1, 1),
groups=1,
channels=None,
kernel_size=None,
data_layout="NCHW",
kernel_layout="OIHW",
out_layout="",
out_dtype="int32"):
r"""Quantized 2D convolution.

This operator convolves quantized weight with quantized data. The scale of
the output quantized tensor is the product of the weight_scale and
input_scale of the input quantized tensors. The zero point of the output
quantized tensor is 0. By default, the dtype of output is int32. Please also
refer to Requantize operator to understand how to scale back the int32
output to (u)int8.


Parameters
----------
data : tvm.relay.Expr
    The input data to the operator.

weight : tvm.relay.Expr
    The weight expressions.

input_zero_point: int
       The zero point of the data distribution.

kernel_zero_point: int
       The zero point of the quantized_kernel distribution.

strides : tuple of int, optional
    The strides of convolution.

padding : tuple of int, optional
    The padding of convolution on both sides of inputs before convolution.

dilation : tuple of int, optional
    Specifies the dilation rate to be used for dilated convolution.

groups : int, optional
    Number of groups for grouped convolution.

channels : int, optional
    Number of output channels of this convolution.

kernel_size : tuple of int, optional
    The spatial of the convolution weight.

data_layout : str, optional
    Layout of the input.

kernel_layout : str, optional
    Layout of the weight.

out_layout : str, optional
    Layout of the output, by default, out_layout is the same as data_layout

out_dtype : str, optional
    Specifies the output data type for mixed precision conv2d.

Returns
-------
result : tvm.relay.Expr
    The computed result.
"""

~~~

Key points

API is very close to TF API.
Others have bias in the API to perform fused computation. Instead of complicating the Conv2D API, our proposal is to add a bias_add after the qnn.conv2d. Relay fusion will itself fuse the computation.
TFLite and QNNPACK also have output_scale and output_zero_point. This is again because they want to fuse conv + bias + requantize in one op. We will have all of these 3 ops separately, connected by the framework parser depending on the framework graph. We will rely on Relay fusion.
TFLite also has output_min_activation/max_Activation because of fused relu. We can easily call a clip operator to do that.

An example of conversion from TFLite to Relay graph will look like this.

anijain2305 on 23 Jul 2019

@anijain2305 Could we also list the api of TFLite and QNNPACK? I think both should be considered. Because we will parse TFLite model and QNNPACK is an good quantization accelerated library

FrozenGene on 23 Jul 2019

The reference integer implementation of tflite conv2d has optional bias parameters which are pointers to 32 bit signed values, where a null pointer can be used to skip the operation.

Is it the intention to always separate out the bias add operation in tvm apis?

one other comment ... I've seen an implementation of quantized conv2d where the 32 bit signed bias values were preloaded to the accumulator, rather than clearing accumulator prior to the mac loop. I don't recall a discussion of it. I'm wondering if this a common enhancement to reduce overflow/underflow possibility of the signed 32 bit accumulations when a bias is expected?

jnorwood on 23 Jul 2019

@FrozenGene Updated the Conv2D API. Also, added a diagram explaining how to go from TFLite to Relay operators.

anijain2305 on 23 Jul 2019

@jnorwood Yes, bias is kept outside as a separate operator. But, this can be fused with the qnn.conv2d.

Regarding the accumulation point, if we perform fusion and add the bias in int32 in the accumulator at the end, is it any different than preloading the accumulator? We need to ensure that op is fused i.e. the bias addition happens in the same accumulator where conv2d has just finished.

anijain2305 on 23 Jul 2019

Regarding the accumulation point, if we perform fusion and add the bias in int32 in the accumulator at the end, is it any different than preloading the accumulator?

When preloading a negative bias, a signed 32 bit accumulator positive accumulate range is extended (before overflow), for example. Maybe the result from a post bias_add is the same for most implementations, but signed int overflow behavior is undefined in the C standards... so the order of bias_add operations might matter.

I saw the bias preload used in some paper. I'll check my notes and see if I can find it.

jnorwood on 23 Jul 2019

@anijain2305 Could we list the ConvParams of TFLite? I think it is more clean what TFLite's convolution computation need.

For the diagram, I think maybe we could avoid the intermediate clip. Because we have one clip inside requantize operator. If we accept one output_min / output_max, then we just do one clip in requantize, no intermediate clip.

FrozenGene on 24 Jul 2019

@FrozenGene Thanks for the suggestion to add ConvParams. Just updated the original post.

@FrozenGene I dont think requantize should take output_min and output_max. We can use requantize after/before any operator, where relu might not be at all applicable. Instead, I would suggest having two clip operators. And then relying on Relay passes to optimize the graph - in this case converting two back-to-back clip into a single clip.

We had similar discussion here - https://github.com/dmlc/tvm/issues/2351#issuecomment-502418630

anijain2305 on 24 Jul 2019

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization. I'd like to move to this RFC thread to loop more people with my consideration.

Before going far, let me clarify that, Requantize is obviously needed to support int32 input which is the accumulated multiplication intermediate result in conv2d when tensor type is int8. The problem is about whether Quantize/Dequantize of which the output/input needs int32 - which means a 32 bit integer quantization approach. (or even 16 bit)

First, which is trivial, if we can use 32 bit, why not 32 bit floating point rather than integer. Yes, there is devices supporting only integer, but, the industry shows that 8 bit is enough for accuracy and they are even moving to 4 bit. So, maybe we don't need to be so generic I guess. The less is more :)

Second, I wonder that it's impossible to support 32 bit due to arithmetic limitations.

Let's start from 8 bit. Considering we are using 8 bit quantization approach, of which the tensor value range is (-2^7, 2^7]. Multiplying two int8 generates value range (-2^14, 2^14]. As the accumulation could be of thousands, it's not safe to accumulate them in int16, but int32. That is why we need int32 in int8 quantization.
Similarly, taking 16 bit. The original value range is (-2^15, 2^15], and multiplied to be (-2^30, 2^30]. When accumulating, we should use 64 bit integer. Hmmm, int64 is still available though needs to be emulated sometimes on very low-low-end device. In this way, Requantize needs to support 64 bit input if we are going to enable 16 bit support.
Now, 32 bit. (-2^31, 2^31] multiplied to (-2^62, 2^62] which should be hold in 64 bit registers. 128 bit integer is required to accumulate int64, unless there are only severals of them.

That's the general talking. Strong assumptions may be introduced to handle the issues raised above, but I can hardly see the benefits. Maybe 8 bit quantization is enough so far :)

jackwish on 25 Jul 2019

👍1

Thanks @jackwish
This is a very good analysis. Everything makes sense. I upvote for restricting to (u)int8 for now for Quantize and Dequantize.

If in future, we see (u)int16, we can tackle then. int32 is highly unlikely (why not just go to FP32 as you say).

anijain2305 on 25 Jul 2019

@ajtulloch @antinucleon @ZihengJiang @jroesch, would be great if you can also review this week and sign off the APIs

tqchen on 25 Jul 2019

on the discussion of the intermediate formats supported ...

The intel avx512 vnni/dlboost operations are simd operations that support 8 bit quantized inference.

Their intrinsic descriptions show that the int8 multiplies go to intermediate 16 bit results before they are accumulated (in the fma) to int32 registers.

Would it be worth considering how tvm can detect sequences that can be substituted with these dlboost/vnni simd intrinsic operations? If tvm supported operations with 16 bit accumulators, then you could conceivably specify sequences of operations that would exactly match the intrinsic descriptions. Perhaps that would make the intrinsic's pattern easier to detect.

From the publicity, I think these dlboost/vnni avx512 simd operations are considered important for the new xeon ai inference support.

I'm providing the expanded intrinsic description for one of the ops as an example

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#avx512techs=AVX512_VNNI&expand=2202

jnorwood on 25 Jul 2019

@jnorwood We are using intrinsics for Skylake (not VNNI) and there is already a PR to take advantage of VNNI intrinsics for Cascade Lake #3388

anijain2305 on 25 Jul 2019

@jackwish, i want to get my understanding correct, when you say

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization.

are you talking about the inputs or outputs of quantize/dequantize ops being int32? Because, the current implementation for

Quantize - limits the inputs to be float32 and output to be (u)i8
Dequantize - The input to be (u)int8 and output to be float32

Or are you suggesting we should support higher number of bits (>16) for these ops?

shoubhik on 25 Jul 2019

@jackwish, i want to get my understanding correct, when you say

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization.

are you talking about the inputs or outputs of quantize/dequantize ops being int32? Because, the current implementation for

Quantize - limits the inputs to be float32 and output to be (u)i8

Dequantize - The input to be (u)int8 and output to be float32

Or are you suggesting we should support higher number of bits (>16) for these ops?

@shoubhik I was saying to limit to int8. I know your PR only restricts to int8, while PR #3531 seems trying to enable int8/16/32. I move to here because I saw the two PRs share same code but seems are not consistent in quantization approach. Thanks for helping to clarify that.

jackwish on 26 Jul 2019

Thanks @jackwish
This is a very good analysis. Everything makes sense. I upvote for restricting to (u)int8 for now for Quantize and Dequantize.

If in future, we see (u)int16, we can tackle then. int32 is highly unlikely (why not just go to FP32 as you say).

I am very glad that it helped :)

jackwish on 26 Jul 2019

@jackwish, i want to get my understanding correct, when you say

I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization.

are you talking about the inputs or outputs of quantize/dequantize ops being int32? Because, the current implementation for

Quantize - limits the inputs to be float32 and output to be (u)i8

Dequantize - The input to be (u)int8 and output to be float32

Or are you suggesting we should support higher number of bits (>16) for these ops?

@shoubhik I was saying to limit to int8. I know your PR only your PR restricts to int8, while PR #3531 seems trying to enable int8/16/32. I move to here because I saw the two PRs share same code but seems are not consistent in quantization approach. Thanks for helping to clarify.

Thanks for the explaination.

shoubhik on 26 Jul 2019

close as QNN has become part of the repo

tqchen on 27 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[RFC] Relay IR Text Format

joshpoll · 28Comments

[RFC] Data-flow Analysis Functionality on TVM IR

DKXXXL · 24Comments

[RFC][Quantization] Support quantized models from TensorflowLite

FrozenGene · 119Comments

[RFC] v0.7 Release Planning

tqchen · 25Comments

[RFC] NMS API Change

kevinthesun · 42Comments