Glow: Unexpected results with quantization

Created on 11 May 2020  路  2Comments  路  Source: pytorch/glow

Using a simple fully-connected network (784*128*10) on the MNIST dataset, I tested the quantization method proposed in this document. I made the quantization profile using the MNIST training dataset and used it to create quantized bundles targeting different architectures. I then ran the networks on the first 100 images of the MNIST testing dataset and measured the mean time required to do one inference. Here are the results:
Architecture | Not quantized | Quantized
---------------- | ------------------ | --------------
ARM Cortex-A9 | 3304 碌s | 9713 碌s
ARM Cortex-M7 | 10729 碌s | 3725 碌s
ARM Cortex-M4 | 20506 碌s | 22713 碌s

One would expect that the inference time would be less for the quantized versions of the network. But as you can see, this is not the case for Cortex-A9 and Cortex-M4. This is particularly strange for the Cortex-M4 which has a soft float-abi.

Is this a bug or am I doing something wrong?

Most helpful comment

@byersin This might surprise you but it is not guaranteed that a quantized model runs faster than a floating-point one. The only direct and controlled impact is memory footprint: a quantized model will use roughly 4x less memory (int8 vs float32).

What does quantization mean: means using integer operations to approximate floating-point operations. There are multiple quantization schemas which accomplish this with different tradeoffs between accuracy (of the approximation) and run-time complexity. For example:

  • asymmetric schema (default) - best accuracy, worst performance
  • symmetric schema - moderate accuracy, moderate performance
  • symmetric_with_power2_scale - worst accuracy, best performance

Just to give you an example of how the math looks like for a floating-point dot-product:
image
Simple right? This is how it looks like for a quantized implementation using an asymmetric schema:
image
Not so simple right? We end up with more integer operations for approximating a simple floating-point expression. Some quantization schema get rid of some of the computations by forcing some parameters to 0. The only way to have a faster quantized implementation is to use HW acceleration features like integer SIMD arithmetic, etc which means that you would need some specialized kernels (Glow is using some generic implementations which are not optimized for a particular HW).

Bottom line: quantization only guarantees lower memory footprint. Better performance needs special intervention.
You should try with other quantization schema and observe the results.

All 2 comments

@byersin This might surprise you but it is not guaranteed that a quantized model runs faster than a floating-point one. The only direct and controlled impact is memory footprint: a quantized model will use roughly 4x less memory (int8 vs float32).

What does quantization mean: means using integer operations to approximate floating-point operations. There are multiple quantization schemas which accomplish this with different tradeoffs between accuracy (of the approximation) and run-time complexity. For example:

  • asymmetric schema (default) - best accuracy, worst performance
  • symmetric schema - moderate accuracy, moderate performance
  • symmetric_with_power2_scale - worst accuracy, best performance

Just to give you an example of how the math looks like for a floating-point dot-product:
image
Simple right? This is how it looks like for a quantized implementation using an asymmetric schema:
image
Not so simple right? We end up with more integer operations for approximating a simple floating-point expression. Some quantization schema get rid of some of the computations by forcing some parameters to 0. The only way to have a faster quantized implementation is to use HW acceleration features like integer SIMD arithmetic, etc which means that you would need some specialized kernels (Glow is using some generic implementations which are not optimized for a particular HW).

Bottom line: quantization only guarantees lower memory footprint. Better performance needs special intervention.
You should try with other quantization schema and observe the results.

Thank you for your explanation! I still need to do more testing but indeed things are not as straightforward as I expected them to be.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

artemrakhov-glow picture artemrakhov-glow  路  4Comments

dati91 picture dati91  路  3Comments

alannnna picture alannnna  路  3Comments

ayermolo picture ayermolo  路  3Comments

tlepley-cadence picture tlepley-cadence  路  4Comments