Right now we do profiling on the graph prior to lowering. I will use FullyConnected nodes as a running example. They are lowered into a MatMul and BatchedAdd node. However the FullyConnected nodes have only their output profiled, meaning the output of the last node (BatchedAdd). This means we do not have profiling information about the MatMul, and we are forced to use the profile from the BatchedAdd for the MatMul when lowering the quantized FullyConnected node. This means we could lose some precision unnecessarily.
We could instead lower all nodes prior to profiling. We would then get profiles for both the MatMul and BatchedAdd. This way we do not lose any information about the components of the lowered FullyConnected node. After profiling we load this profile information back in for quantization, which now includes all component node's profiles.
At this point, one option to proceed is to apply quantization to the unlowered graph based on the lowered component profiles (e.g. still quantize the FullyConnected node using the BatchedAdd profile). This could be necessary because we currently quantize prior to compiling, which is where we do optimizations and lowering. We would then need to additionally keep track of each unlowered node's component node's quantization parameters, so that we can use these parameters when the nodes are lowered (e.g. make sure when we lower the quantized FullyConnected node we can find and use its component MatMul node's quantization parameters).
So it seems we would need to create a map from unlowered nodes to their lowered component profiles/quantization parameters, e.g. map from the FullyConnected Node to the BatchedAdd and MatMul profiles/quantization parameters. This map would need to be passed into lower().
I believe other options would require a redesign of the order in which we apply quantization vs. optimizations and lowering.
I believe other options would require a redesign of the order in which we apply quantization vs. optimizations and lowering.
Should we shot for a redesign instead of patching the current one?
So I think the main redesign option would be to lower before quantizing. I believe this would involve lowering everything and quantizing it all, then pattern matching groups of nodes back to unlowered nodes if the backend requested so. This means we would need to do this "unlowering" before any optimizations occurred to ensure they do not prevent any unlowering that the backend wants.
However, after thinking about this more I don't think this redesign option would help very much with respect to our future direction with ONNXIFI and receiving pre-quantized graphs. I'm assuming* ONNIXIFI will pass us unlowered graphs (e.g. a quantized FullyConnected node), not lowered graphs (e.g. quantized MatMul and quantized BatchedAdd nodes from the FullyConnected node).
This means we have a similar issue here to our current issue with profiling/quantizing unlowered graphs. We will need ONNXIFI to pass us quantization parameters for its component MatMul and the BatchedAdd in order to not lose precision after lowering. The same goes for more complex nodes that we lower, e.g. BatchNormalization or LSTMs, where it's much more important that we have quantization information for each component node. This means we would need something similar to the first option I described initially, where we are able to pass component quantization parameters down to the lower() function.
*I could be off base here -- perhaps ONNXIFI will pass us fully lowered quantized graphs, and it's up to us to do this unlowering? In which case we should focus on the redesign option. CC: @rdzhabarov
Thanks @jfix71 for writing this up.
I would also assume ONNXIFI would give us unlowered graphs, so the mapping high level nodes from lower level ones makes sense.
One thing I was thinking is instead of a mapping the quantization parameters, could we have add some kind of annotation directly on the node so that the information lives with it. (Like what LLVM does with debug information).
Regarding your comment on lowered graphs from ONNXIFI, I don't get why we would need to unlower them. Could you elaborate on that aspect?
Regarding your comment on lowered graphs from ONNXIFI, I don't get why we would need to unlower them. Could you elaborate on that aspect?
There are some backends which prefer having nodes unlowered, for example if they have optimized implementations for them. The simplest example is (surprise!) FullyConnected -- many have found better performance from executing it as a single unit with the bias BatchedAdd tagged onto the end of the MatMul, instead of doing them separately. So in this case, a backend may currently return false for the FullyConnectedNode in shouldLower() to prevent lower() from lowering it. The backend then would provide its own backend-specific FullyConnectedInst, which FullyConnectedNode is IRGen'd to, and which the backend knows how to execute.
Makes sense.
It sounds like we would need some kind of canonicalization pass to get to this "higher" level representation.
Anyway, all-in-all, it looks like we want to patch the current model instead of a redesign.
Waiting on @rdzhabarov for confirmation!
@jfix71 I am not sure if FC is a very good example because we know how to lower FC while preserving the quantization information between the MM and the Add. Are you worried about more complex nodes, such as LSTM cells?
@nadavrot Yes, I was using FC as a simple example. But if we receive a quantized LSTM or GRU etc. from ONNXIFI we'll need their lowered internal component's quantization information.
I agree with @jfix71 we should focus on the production use case when ONNX model is supplied to Glow for inference execution.
perhaps ONNXIFI will pass us fully lowered quantized graphs, and it's up to us to do this unlowering?
This seems to be a key question. Current quantized FC situation does not seem like a concern. The main question is a quantized representation of complex nodes (and maybe we can get away with the right representation by onnx).
@Maratyszcza @jspark1105 any ideas how complex quantized cells are going to be represented by C2/ONNX? Ideally that should be a graph of simple ops, but would like to get your insights here.
Current situation with ONNX quantization is as follows (keep in mind quantization spec is not finalized and can change):
MatMulInteger, ConvInteger, ReduceSum integer to represent flexible quantization schemes, and quantized ops QConv/QConvTranspose/QFC to represent static quantization scheme with 8-bit unsigned quantized elements, zero point & scale (or equivalent schemes, e.g. 9-bit signed + scale)MatMulInteger, ConvInteger, ReduceSum see onnx/onnx#1219 for the spec.QConv/QConvTranspose/QFC (i.e. MatMul + Bias) are work-in-progress in onnx/onnx#1264.QCompose and QDecompose ops to convert between the two representations (TBD).With MatMulInteger-type ops, quantized networks are represented via integer arithmetics
With QConv-type ops, quantized networks are represented via special types, quint8 (quantized 8-bit unsigned integer tensor with static scalar zero point and static scalar scale parameters, which are recorded in the model for each tensor, including intermediate activations, used for input/output/weights) and qint32 (for bias).
@Maratyszcza Thanks for the info. So let's say we're loading a pre-quantized LSTM from ONNXIFI, where the quantization parameters have already been determined and applied to the LSTM. I imagine we will receive a QLSTM, with a specific static quantization scheme and associated parameters (e.g. scale/offset).
Would we also receive the static quantization scheme and associated parameters for the subcomponents of the QLSTM? I.e. would we receive the QLSTM's quantized subcomponent operators (qFCs, QAdds, QSigmoids/or other activations, etc.) that make up the QLSTM? Or not even necessarily the quantized subcomponent operators themselves, but at least their quantization scheme and associated parameters?
@jfix71 We don't have plan to add QLSTM or QGRU for the moment. Currently, the plan is to support the following quantized operators:
Most helpful comment
@jfix71 We don't have plan to add QLSTM or QGRU for the moment. Currently, the plan is to support the following quantized operators: