Glow: Need to implement channel-wise quantized fullyconnected node.

Created on 3 Aug 2018  路  22Comments  路  Source: pytorch/glow

One technique to improve the accuracy of quantized fullyconnected operations is channel-wise quantization. In this technique the activations remain regular quantized activations, but the FC weights are quantized with special per-channel information. PyTorch has support for this operator[1]

To implement this you will need to do the following:

  1. Add new target independent node to the Interpreter and CPU backends.
  2. Implement the IR-lowering for the new target-independent FC.
  3. Implement a special kind of FC-quantization and lowering that lowers the FC into row-wise FC. This method will need to inspect the weights and extract the channel-specific information.

[1] - https://github.com/caffe2/caffe2/blob/master/caffe2/operators/fused_rowwise_8bit_conversion_ops.cc

@rdzhabarov @jfix71

Most helpful comment

Row-wise quantization has been used for constants weights in the cases I've seen.

All 22 comments

The plan looks good to me, I think we'd need to have a way to control channel-wise quantization through some CLI flag as well. We could also extend this issue to cover Conv, but FC would have higher priority.

@beicy FYI

@beicy feel free to grab this and I'm happy to help with anything related to this issue. Otherwise, I'll tackle this some time in between ONNXIFI work.

The design for this issue is described as follows:

1. Row-wise quantization
Since so far, row-wise quantization is only used for constants weights, we do this quantization at compile time. The weights is a 2-D tensor (i.e. N * M) with float type for each element. To do the row-wise quantization:
1) For each row of the input tensor, choose _scale(float)_ and _offset(int32_t)_ based on the min and max of each row. The 1D tensor _scales_ and _offsets_ are used to store the scale and offset of each row separately.
2) Use the scales and offsets to quantize the input data.
All the three tensors : quantized weights, scales, and offsets will be passed to RowWiseFullyConnected node later.

2. RowwiseQuantizedFullyConnected Node
Previously, for a FC node, if it is quantized-able, the whole node is quantized and then lowered into _matmul_ + _add_. In this PR, if a FC node is quantize-able (for Interpreter and CPU backends), we generate the RowwiseQuantizedFullyConnected node instead and no more lowering optimization:

3. Interpreter Backend support.

4. CPU Backend support.

5. Adding flag to control row-wise.

Thank you for the great writeup @beicy .

I have a question about FloatToFused8BitRowwiseQuantize. Why do we need to implement this as a new node? I assumed that we could just replace FC with RWFC and along the way convert the weights into some opaque tensor format, similar to how we do with our DKKC convolutions. Why do we need a multi-node solution? In any case we would need to implement the quantization procedure in the optimizer because we won't want to quantize the weights at runtime. Is there a strong requirement to support dynamically computed weights? @jspark1105

@nadavrot For regular quantization node, the scale and offset are given and carried by "struct Type" of the quantized tensor. In row-wise quantization, for each row of the tensor to be quantized, there is one (scale, offset) pair should be calculated by max and min. Since I don't want to modify the "struct Type", the "FloatToFused8BitRowwiseQuantize" is used to generate the scales, offsets and the quantized data.
Do you mean we should row-wise quantize the weights at compiler time, and pass the quantized weights directly to RWFC?

@beicy My question was about the decision to implement two nodes instead of one. It sounds like we should be able to have one node: RWFC, and skip the node FloatToFused8BitRowwiseQuantize by performing the transformation to the weight at compile time, when we compile the program and mutate the graph. What do you think?

@nadavrot I see. I will double check where row-wise quantization could be used. If it is only used for constant like weights, we don't need this quantization at runtime :)

Row-wise quantization has been used for constants weights in the cases I've seen.

Thanks @jspark1105 . Then this FloatToFused8BitRowwiseQuantize node is not necessary in Glow.

@beicy since we have a control over the new FC node with row-wise quantization, have you considered supplying tensor of scales and offsets as a separate input to the new row-wise FC node?

I think mixing int8 data with some metadata {s,o} stored all together looks a bit hacky (looking at Row-wise quantization node output).

Note, we might want to enhance loader to handle ONNX/C2 model with row-wise quantization at some point later (they use [S,O] as a separate arg to the op).

@rdzhabarov Thanks for letting me know the ONNX/C2 loader requirement. If the separation format is convenient for the further work, let's separate the quantized data and scales and offsets :)

@beicy they do not have that format for row-wise quantization, but looking at general quantization support, that's the direction they are moving. Even for normal quantization, scale and offset are comming from a separate params to the Op.

I'm wondering if we can have a better name for RowWiseFullyConnected node. Should not capitalize W to match Caffe2's row-wise op. And it's not clear from the name that it has something to do with quantization. How about RowwiseQuantizedFullyConnected?

@beicy is there anything left here to fix?

@rdzhabarov Our quantization procedure doesn't support it yet -- just do the regular FC quantization. We should enable rowwise quantization under some flag's control.

Can we close this issue?

I was looking at implementation of this. For weights scale/offset are computed per filter/channel, but for bias only one is computed for entire vector/all channels. Is there a reason scale/offset wasn't computed per channel for biases? Not worth it in terms of improved accuracy, design decision limiting memory usage for constant tensor?
In this paper, https://arxiv.org/pdf/1806.08342.pdf, it mentions per channel weight quantization can improve accuracy. It doesn't really mention biases. I would venture a guess since each entry in bias vector corresponds to particular channel, it should also be quantized per channel?

Bias will be one vector for one batch. If we have per filter/channel quantization for bias, we will end up with each bias parameter associated with one scale/offset parameters, which don't make much sense. As we will do int32 accumulation, we can increase the precision for the accumulation with bias also if the quantized bias are not accurate enough.

@ayermolo in addition to the overhead explained by Summer, per channel quantization for bias also doesn't help accuracy much because bias is usually quantized in int32 (it's small so it's ok to store them in 32-bit). I believe this is also the case in Glow.

Yes, in Glow, the bias is quantized in int32.

Ah yeah makes sense, and in retrospect quite obvious. Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tlepley-cadence picture tlepley-cadence  路  4Comments

opti-mix picture opti-mix  路  4Comments

dati91 picture dati91  路  3Comments

jackm321 picture jackm321  路  3Comments

ayermolo picture ayermolo  路  3Comments