Glow: Need to implement channel-wise quantized fullyconnected node.

Created on 3 Aug 2018 · 22Comments · Source: pytorch/glow

One technique to improve the accuracy of quantized fullyconnected operations is channel-wise quantization. In this technique the activations remain regular quantized activations, but the FC weights are quantized with special per-channel information. PyTorch has support for this operator[1]

To implement this you will need to do the following:

Add new target independent node to the Interpreter and CPU backends.
Implement the IR-lowering for the new target-independent FC.
Implement a special kind of FC-quantization and lowering that lowers the FC into row-wise FC. This method will need to inspect the weights and extract the channel-specific information.

[1] - https://github.com/caffe2/caffe2/blob/master/caffe2/operators/fused_rowwise_8bit_conversion_ops.cc

@rdzhabarov @jfix71

Source

nadavrot

👍4

Most helpful comment

Row-wise quantization has been used for constants weights in the cases I've seen.

jspark1105 on 14 Sep 2018

👍2

All 22 comments

The plan looks good to me, I think we'd need to have a way to control channel-wise quantization through some CLI flag as well. We could also extend this issue to cover Conv, but FC would have higher priority.

rdzhabarov on 6 Aug 2018

👍1

@beicy FYI

nadavrot on 6 Aug 2018

@beicy feel free to grab this and I'm happy to help with anything related to this issue. Otherwise, I'll tackle this some time in between ONNXIFI work.

rdzhabarov on 7 Aug 2018

👍1

The design for this issue is described as follows:

1. Row-wise quantization
Since so far, row-wise quantization is only used for constants weights, we do this quantization at compile time. The weights is a 2-D tensor (i.e. N * M) with float type for each element. To do the row-wise quantization:
1) For each row of the input tensor, choose _scale(float)_ and _offset(int32_t)_ based on the min and max of each row. The 1D tensor _scales_ and _offsets_ are used to store the scale and offset of each row separately.
2) Use the scales and offsets to quantize the input data.
All the three tensors : quantized weights, scales, and offsets will be passed to RowWiseFullyConnected node later.

2. RowwiseQuantizedFullyConnected Node
Previously, for a FC node, if it is quantized-able, the whole node is quantized and then lowered into _matmul_ + _add_. In this PR, if a FC node is quantize-able (for Interpreter and CPU backends), we generate the RowwiseQuantizedFullyConnected node instead and no more lowering optimization:

3. Interpreter Backend support.

4. CPU Backend support.

5. Adding flag to control row-wise.

beicy on 14 Sep 2018

Thank you for the great writeup @beicy .

I have a question about FloatToFused8BitRowwiseQuantize. Why do we need to implement this as a new node? I assumed that we could just replace FC with RWFC and along the way convert the weights into some opaque tensor format, similar to how we do with our DKKC convolutions. Why do we need a multi-node solution? In any case we would need to implement the quantization procedure in the optimizer because we won't want to quantize the weights at runtime. Is there a strong requirement to support dynamically computed weights? @jspark1105

nadavrot on 14 Sep 2018

@nadavrot For regular quantization node, the scale and offset are given and carried by "struct Type" of the quantized tensor. In row-wise quantization, for each row of the tensor to be quantized, there is one (scale, offset) pair should be calculated by max and min. Since I don't want to modify the "struct Type", the "FloatToFused8BitRowwiseQuantize" is used to generate the scales, offsets and the quantized data.
Do you mean we should row-wise quantize the weights at compiler time, and pass the quantized weights directly to RWFC?

beicy on 14 Sep 2018

@beicy My question was about the decision to implement two nodes instead of one. It sounds like we should be able to have one node: RWFC, and skip the node FloatToFused8BitRowwiseQuantize by performing the transformation to the weight at compile time, when we compile the program and mutate the graph. What do you think?

nadavrot on 14 Sep 2018

👍1

@nadavrot I see. I will double check where row-wise quantization could be used. If it is only used for constant like weights, we don't need this quantization at runtime :)

beicy on 14 Sep 2018

Row-wise quantization has been used for constants weights in the cases I've seen.

jspark1105 on 14 Sep 2018

👍2

Thanks @jspark1105 . Then this FloatToFused8BitRowwiseQuantize node is not necessary in Glow.

beicy on 14 Sep 2018

@beicy since we have a control over the new FC node with row-wise quantization, have you considered supplying tensor of scales and offsets as a separate input to the new row-wise FC node?

I think mixing int8 data with some metadata {s,o} stored all together looks a bit hacky (looking at Row-wise quantization node output).

Note, we might want to enhance loader to handle ONNX/C2 model with row-wise quantization at some point later (they use [S,O] as a separate arg to the op).

rdzhabarov on 18 Sep 2018

@rdzhabarov Thanks for letting me know the ONNX/C2 loader requirement. If the separation format is convenient for the further work, let's separate the quantized data and scales and offsets :)

beicy on 18 Sep 2018

@beicy they do not have that format for row-wise quantization, but looking at general quantization support, that's the direction they are moving. Even for normal quantization, scale and offset are comming from a separate params to the Op.

rdzhabarov on 18 Sep 2018

I'm wondering if we can have a better name for RowWiseFullyConnected node. Should not capitalize W to match Caffe2's row-wise op. And it's not clear from the name that it has something to do with quantization. How about RowwiseQuantizedFullyConnected?

artemrakhov-glow on 18 Sep 2018

👍1

@beicy is there anything left here to fix?

rdzhabarov on 7 Nov 2018

@rdzhabarov Our quantization procedure doesn't support it yet -- just do the regular FC quantization. We should enable rowwise quantization under some flag's control.

beicy on 7 Nov 2018

👍1

Can we close this issue?

nadavrot on 21 Nov 2018

I was looking at implementation of this. For weights scale/offset are computed per filter/channel, but for bias only one is computed for entire vector/all channels. Is there a reason scale/offset wasn't computed per channel for biases? Not worth it in terms of improved accuracy, design decision limiting memory usage for constant tensor?
In this paper, https://arxiv.org/pdf/1806.08342.pdf, it mentions per channel weight quantization can improve accuracy. It doesn't really mention biases. I would venture a guess since each entry in bias vector corresponds to particular channel, it should also be quantized per channel?

ayermolo on 29 Nov 2018

Bias will be one vector for one batch. If we have per filter/channel quantization for bias, we will end up with each bias parameter associated with one scale/offset parameters, which don't make much sense. As we will do int32 accumulation, we can increase the precision for the accumulation with bias also if the quantized bias are not accurate enough.

csummersea on 29 Nov 2018

👍1

@ayermolo in addition to the overhead explained by Summer, per channel quantization for bias also doesn't help accuracy much because bias is usually quantized in int32 (it's small so it's ok to store them in 32-bit). I believe this is also the case in Glow.

jspark1105 on 29 Nov 2018

Yes, in Glow, the bias is quantized in int32.

beicy on 29 Nov 2018

Ah yeah makes sense, and in retrospect quite obvious. Thanks.

ayermolo on 30 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings