Create a UInt8QTy ElemKind and in mode loaders load uint8 quantized Tensors directly instead of converting to Int8QTy.
Right now Glow has Int8QTy and when it loads Tensors from onnxifi that are uint_8 type, the loaded tensor is copied and shifted by -128 to make its values fit the expected range for Int8QTy. If Glow had a UInt8QTy ElemKind this copy and shifting could be avoided entirely.
Furthermore, Glow has a UInt8FusedQTy which is supported by some operators and when an Int8QTy weight is loaded for this operator, it is then shifted back to fit the expected range of UInt8FusedQTy by adding 128 to all values in the Tensor which is an additional unnecessary overhead.
Many of Glow's CPU and Interpreter op implementations support quantized tensors but only Int8QTy. These could be rewritten to support UInt8QTy but it may be easier to implement a graph transformation step (maybe using FunctionConverter) to convert Tensors of type UInt8QTy to Int8QTy during compilation on an as needed basis by the Backend.
Unsigned int8 has been until now modelled in Glow as int8 with offset -128. I find this elegant because with the offset, we could model symmetric signed, symmetric unsigned and asymmetric with a single element kind and we then avoid the proliferation of element kinds. I have the feeling that introducing an actual UINT8 element kind is going to bring confusion and complexity since unsigned int8 will have 2 different representations in the IR, not a single one. The IR won't then be anymore canonical with regard to unsigned.
I think that when the target HW requires an actual uint8_t (what is our case by the way), we can still keep the s8/-128 modelling during the full compiler flow and simply do the (lossless) conversion of constant tensors from s8/-128 to u8 at the very end of the flow at code generation time (in the backend).
Unsigned int8 has been until now modelled in Glow as int8 with offset -128. I find this elegant because with the offset, we could model symmetric signed, symmetric unsigned and asymmetric with a single element kind and we then avoid the proliferation of element kinds.
@tlepley That's a valid point. But the main scenario we'd like to unblock is avoiding double memory allocation in the onnxifi scenario. This requires relying on the underlined memory provided by the pytorch. In this case we simply cannot allow copying weights in the loaders and change the data in weight buffers.
I do not think this introduce too much complexity as you can see in the pr.
@rdzhabarov do you still plan on loading ONNXIFI_DATATYPE_UINT8 as UInt8QTy to finish removing the copies in loaders?
Yup, it's in my backlog. This/next week should be done.
Finished in https://github.com/pytorch/glow/pull/3193
Most helpful comment
@tlepley That's a valid point. But the main scenario we'd like to unblock is avoiding double memory allocation in the onnxifi scenario. This requires relying on the underlined memory provided by the pytorch. In this case we simply cannot allow copying weights in the loaders and change the data in weight buffers.
I do not think this introduce too much complexity as you can see in the pr.