Glow: More general implementation of Row/ChannelWiseWQuantization

Created on 4 Mar 2019 · 13Comments · Source: pytorch/glow

I would like to propose a generic approach for ‘RowWise’ quantization, that gives in addition the possibility for the backend to perform some customizations. I’ve created a WIP PR with a prototype implementation for review/discussion https://github.com/pytorch/glow/pull/2480.

Currently a specific ‘RowWise’ node type variant has been created for each operation concerned by rowwise quantization: RowwiseQuantizedFullyConnected/RowwiseQuantizedSparseLengthsWeightedSum. Going by this approach if we want to apply it to regular Convolution we would create ChannelWiseQuantizedConvolution, or something like that.

Our proposal is to add a generic ChannelWiseQuantization node instead of creating a specific node variant each time we want to support rowwise quantization. The input of ChannelWiseQuantization would be quantized weights, original floating point weight, scales array, and offsets array. The latter two can either be in tensor or internal vector type to the node.

We keep the original floating point weight tensor since some backend may want to do a custom quantization. Keeping original weights around can be removed down the road if we change infrastracture of glow to allow backends hooks to quantize on node level.

I included Interpreter and CPU backend implementations for illustrative purposes.

During LLVM generation it would allow:

backends to use this information to select appropriate function for the target device, and pass relevant information in to it.
various optimizations to deal with only one node type, vs bunch of different node types of RowwiseQuantized variety

Source

ayermolo

👍1

Most helpful comment

@jfix71 I agree with this approach. This is I believe a pragmatic approach to improve the quantization flow in Glow.

Keeping Quantize nodes until the end of Backend::optimizeFunction() would solve 2 aspects we need to deal with:

enable a lossless (re)quantization of constants in the backends (constants quantized with standard Quantize nodes), what is not possible today because Quantize nodes are optimized out too early in the flow.
make ChannelWiseQuantization nodes less hacky by only keeping the float input (removing the quantized constant input of the current proposal)

tlepley-cadence on 11 Mar 2019

👍2

All 13 comments

@ayermolo Thanks for the proposal and example PR. I agree we could do with a redesign on supporting channel/row-wise quantization.

The proposed approach feels a bit hacky to me. There's the replaceAllUsesOfWith() at IRGen time. And iterating over Instruction users at LLVM generation time, and during the execution of the Interpreter kernel. And then using the Instruction as basically a pointer to the weights/scales/offsets, which otherwise does nothing.

I wonder if instead of adding new a Node/Instruction, why can't we just handle row-wise quantization via another ElemKind? E.g. we would have a new ElemKind::Int8RowwiseQTy. Currently, Types with our other quantized ElemKinds (e.g. Int8QTy) have the scale and offset as scalars built into the Type. Here we could add scales/offsets vectors to the Type instead

Do you see some benefits to your proposal compared to adding the new ElemKind?

jfix71 on 5 Mar 2019

Umm yeah on a hacky side for sure. Kind of tried to work around current limitations of GLOW.

The IRGen code with replaceALlUsesOfWith. I added it to prevent memory allocation for the node, but there is probably a better way of doing it. I can revisit it.

Yeah I thought about extending type, and did a bit or proto typing, but this seemed like a much more fundamental change to glow and might be more fragile. There are a few places where it just uses scale/offset and assumes they are scalar.

Wit this approach I tried to balance the number of changes, and quickness of implementation while working with current fundamental framework of GLOW. For example on design decision to keep original weights around was because currently there is no hook on Node IR level that would allow custom quantization. Even if we go with extending type, that problem is still there.

ayermolo on 5 Mar 2019

there is no hook on Node IR level that would allow custom quantization

Why can't this be done via Backend::transformPostLowering() or Backend::compile()? In both situations you have the Function with Node IR. Can you provide more details on requirements to perform this custom quantization?

jfix71 on 6 Mar 2019

Well by the time it gets to this point everything is already quantized. I guess we can "re-quantize" in transformPostLowering, but then original weight needs to be kept around until then not to loose accuracy. So the node will still have an extra input with original weights.
Basically what would be nice is to delay quantization to transformPostLowering, or even provide a hook after it to allow backends to do custom quantization. For example factor out the optimization in to a it's own function and then various backends can override it and do their own thing.

ayermolo on 7 Mar 2019

What if we just delay the merging of Quantize nodes into Constants until the end of Backend::optimizeFunction()? This way the Constant payloads stay intact, we can add Quantize nodes whenever we want in the optimization pipeline, and backends can still change/customize quantization during Backend::transformPostLowering().

jfix71 on 7 Mar 2019

👍1

Yeah that should work also. Isn't this what @tlepley proposed a while back?

ayermolo on 7 Mar 2019

@jfix71 I agree with this approach. This is I believe a pragmatic approach to improve the quantization flow in Glow.

Keeping Quantize nodes until the end of Backend::optimizeFunction() would solve 2 aspects we need to deal with:

enable a lossless (re)quantization of constants in the backends (constants quantized with standard Quantize nodes), what is not possible today because Quantize nodes are optimized out too early in the flow.
make ChannelWiseQuantization nodes less hacky by only keeping the float input (removing the quantized constant input of the current proposal)

tlepley-cadence on 11 Mar 2019

👍2

So putting delayed quantization a side. A more general question is having such a node between Convolution/FC/whatever and Weights/Filter is the right approach?
On one hand it's a more generic approach then adding specialized rowwise quantized nodes. It allows BE to map it to whatever they want.
On the other hand it does add extra complications. For example memory will be allocated for it when not necessary, with current implementation. There is some type of pattern matching needs to be done during LLVM IR gen.

ayermolo on 12 Mar 2019

@ayermolo I still feel the ChannelWiseWeightQuantization approach is a bit too hacky. In addition to the points I previously raised, it artificially increases the number of users of the node that has a rowwise-quantized input. And it's not explicitly clear that an input is actually rowwise-quantized -- you have to know to iterate over the users to check. BTW, you also can't rowwise-quantize multiple inputs with its current formulation -- you'd probably want to add some integer to the ChannelWiseWeightQuantization to specify which input is channelwise-quantized.

Assuming we use the delayed quantization, I still think I'd prefer extending the type like I mentioned previously:

I wonder if instead of adding new a Node/Instruction, why can't we just handle row-wise quantization via another ElemKind? E.g. we would have a new ElemKind::Int8RowwiseQTy. Currently, Types with our other quantized ElemKinds (e.g. Int8QTy) have the scale and offset as scalars built into the Type. Here we could add scales/offsets vectors to the Type instead

What pros/cons do you see for your proposed approach over extending the type? Would it satisfy all of your use cases?

jfix71 on 13 Mar 2019

@jfix71 Well if we do delayed quantization and you guys are OK with extending type that would be the better approach for sure. My current approach just tried to work around current implementation of quantization in glow. I wasn't sure if delayed quantization proposed by @tlepley would go anywhere, and if extending type was on the table.

ayermolo on 13 Mar 2019

@ayermolo Yeah I'd prefer extending the type and delaying quantization. And I think even if we decide to not extend the type, we should still delay quantization -- so it would be a good place to start here. Are you all interested in working on this?

jfix71 on 14 Mar 2019

@jfix71 As much as I would like to, I don't think I will have time to look in to extending type in near future. I think @tlepley-cadence was looking in to delayed quantization.

ayermolo on 15 Mar 2019

OK, I'm finally finding the time to upstream the constant quantization delaying :-)
I'll create a PR rapidly and I'll continue the discussion in my original issue : #2165

tlepley-cadence on 23 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings