Glow: [Quantization] Interpreter quantization should approximate realistic backend quantization

Created on 11 Oct 2018 · 12Comments · Source: pytorch/glow

Currently the Interpreter's quantization implementation uses floating point arithmetic in many places to ensure that we do not lose precision (see fwdElementAddInst_I8Impl() as an example).

This is different than what we see in our CPU backend, and what we expect to see in other backends/accelerators -- floating point arithmetic is avoided as much as possible.

Given that the Interpreter is supposed to be a reference for the other backends, I believe it makes more sense for the Interpreter to follow other backends here, and not use floating point arithmetic to improve its quantization accuracy.

I have created a test case in OperatorTest on my fork that displays the issue, where the CPU Backend fails a test that the Interpreter test passes, because the Interpreter test has better accuracy.

Source

jfix71

Most helpful comment

Unfortunately these small differences in quantization implementations appear to be compounding into large errors in our BackendCorrectnessTests and make it hard to verify their correctness across different quantization implementations. For example, comparing the Interpreter to the CPU Backend, here are some errors I'm seeing when comparing outputs:

quantized complexNet1, comparing dequantized final output:
   interpreterData: 0.681679
   cpuBackendData: 0.613511
   delta = 0.068168
   relative error: 0.10

quantized tinyResnet, comparing dequantized final output:
   interpreterData: 20.589876
   cpuBackendData: 5.147469
   delta: 15.442408
   relative error: 0.75



md5-2c64b9dc5d9e51cdf08edc843ad97e67



Interpreter, quantized Resnet50:
   Label 285: probability: 0.469509
   Label 281: probability: 0.469509
   Label 282: probability: 0.048737



md5-1c6ef520dc3965b397cc90058118b261



CPUBackend, quantized Resnet50:
   Label 285: probability: 0.544301
   Label 281: probability: 0.398239
   Label 282: probability: 0.044697

This seems like a really big difference -- and similar to before, it might be due to bugs and not differences in quantization implementation. So like @opti-mix said, it seems like we really should have "quantization implementation" as a configurable parameter of the Interpreter if we intend to use to compare against the backends, like we're doing in the BackendCorrectnessTests.

Also, we really should be measuring and unit testing top-K label probabilities and ensuring they aren't changing significantly. Otherwise we could end up with many small bugs that compound and eventually break the compiler in a really ugly way. It's one of the reasons we never noticed the bug that cause us to skip Y-values in thhe DKKC8 Conv (related to https://github.com/pytorch/glow/issues/1689#issuecomment-424046660 and https://github.com/pytorch/glow/issues/1687).

jfix71 on 11 Oct 2018

👍2

All 12 comments

@jfix71 I see the point you are trying to make.

But I'd also like to mention that the difference in the quantization approaches, i.e. FP vs integer computations for intermediate results, is not the only difference. Even if the interpreter would avoid using FP arithmetic when performing the quantized operations, there could be still differences between the interpreter and other backends supporting quantization. E.g. some of the backends use the (fp scale, int offset) as quantization parameters. Some others would use (int scale, int right_shift) as quantization parameters and the bit-width (i.e. the value range) of scale and shift could be target dependent, and so on. It almost seems like it would be nice to have the "quantization scheme" as a configurable parameter of the interpreter.

Besides quantiztion issues, there are other aspects where the interpreter differs from other backends:

Some backends may dictate a certain way of performing certain operations, e.g. matmuls or convolutions. For example, some of the backends may want to implement a convolution not as a series of nested loops, but like im2col followed by a matmul. This imposes a certain order of numeric operations, which in turn affects the numeric results.
The interpreter does not (statically) allocate any memory and simply reuses the memory already allocated for the tensors, whereas many backends would model the device memory hierarchy and perform a static memory allocation for it (e.g. OpenCL and AOT use this approach and most HW backends would use it as well). I'm sure there are other aspects where we could see differences as well. Of course, not all of them affect the numeric results as much as a difference in quantization approaches.

opti-mix on 11 Oct 2018

👍1

@jfix71 @opti-mix

I agree with the points that Roman mentioned. There is no one standard that we can follow when implementing the reference implementation for quantized operation. In this case, using the most accurate implementation (using floating point) is probably the best default.

nadavrot on 11 Oct 2018

quantized complexNet1, comparing dequantized final output:
   interpreterData: 0.681679
   cpuBackendData: 0.613511
   delta = 0.068168
   relative error: 0.10

quantized tinyResnet, comparing dequantized final output:
   interpreterData: 20.589876
   cpuBackendData: 5.147469
   delta: 15.442408
   relative error: 0.75



md5-2c64b9dc5d9e51cdf08edc843ad97e67



Interpreter, quantized Resnet50:
   Label 285: probability: 0.469509
   Label 281: probability: 0.469509
   Label 282: probability: 0.048737



md5-1c6ef520dc3965b397cc90058118b261



CPUBackend, quantized Resnet50:
   Label 285: probability: 0.544301
   Label 281: probability: 0.398239
   Label 282: probability: 0.044697

jfix71 on 11 Oct 2018

👍2

@jspark1105 @Maratyszcza

nadavrot on 11 Oct 2018

Also, we really should be measuring and unit testing top-K label probabilities and ensuring they aren't changing significantly.

I don't think that's achievable. As soon as we allow to reorder operations differently from one back end to another, we will see differences. Quantization just adds one more dimension to the problem.

The one thing we could check is if the result is reasonable and that's hard to define.

qcolombet on 16 Oct 2018

The one thing we could check is if the result is reasonable and that's hard to define.

True, but have currently defined it as "check the label of the top prediction" and I think we could/should do better. It doesn't need to be bit-wise equality, it could be equality to expected probabilities given some allowed error/threshold we define, so that we are at least aware when major changes occur. Then we can verify the change is sane and the difference in probability is reasonable or expected, and update the expected probabilities.

jfix71 on 16 Oct 2018

it could be equality to expected probabilities given some allowed error/threshold we define

That's the thing, when you start to reassociate things around, merge operators and so on, all bets are off. I believe we could only define such threshold if we were to account for all the precision changes induced by all optimizations, including the ones in the backend.

The bottom line is when using "fast-math" the only thing that you can test is that it looks reasonable :(.

qcolombet on 19 Oct 2018

@jfix71 @beicy could you update this issue whenever you have a time with the latest that was discussed.

rdzhabarov on 30 Oct 2018

Based the discussion with Summer: For correctness: always compare the top-1 and top-5 accuracy of quantized model and the original fp32 model, and if the diff is less than 0.5%, it is accepted. That it, instead of comparing the quantization results between CPU and Interpreter backends, we should compare the top-1 and top-5 accuracy between quantized and original-fp32 model for each backend separately, and check the diff. Therefore, it looks to me that we don't need to use Interpreter quantization result as an reference.

"This is different than what we see in our CPU backend, and what we expect to see in other backends/accelerators -- floating point arithmetic is avoided as much as possible."
--- @jfix71 I am not sure if every backend can run original fp32 model. If so, I guess we can use that result as a reference ?

beicy on 31 Oct 2018

👍1

We discussed modifying the Interpreter's kernels. One consideration here was to make their quantization schemes configurable so that the backend that being tested can tune it to use the same quantization implementation details. Another consideration was to always stay in integer (instead of moving to float for internal calculations) to increase its numerical stability. This would more closely mirror realistic quantization on accelerators/in real usage.

We discussed using something like the debug instrumentation to compare accuracy after every node/instruction, instead of just the final results of the network.

We discussed adding relative error instead of absolute error to our tensor equality checks.

I am not sure if every backend can run original fp32 model. If so, I guess we can use that result as a reference ?

@beicy Correct, not all backends will support fp32. The fp32 top-k accuracy on the Interpreter/CPUBackend should match other frameworks (e.g. Caffe2), so technically we don't even need to run it on our Interpreter/CPUBackend to get this measurement, though it's a good idea to validate them too.

Also, note that top-k is specific to image classification -- we will need to do the same for other domains with their own accuracy metrics, e.g. Bleu scores for text translation.

I have a top-k script that I will put up soon. Once we are supporting directly loading quantized models it will be useful for us to compare against fp32.

jfix71 on 31 Oct 2018

@jfix71 do you envision any follow ups on this issue?

rdzhabarov on 6 Feb 2019

We have the top-k script in place at this point since https://github.com/pytorch/glow/pull/2009. We should set up tracking top-1/top-5/etc. over time IMO to make sure any unusual/unexpected changes don't slip through the cracks.

In terms of changing the Interpreter -- it kind of depends on what we philosophically see the point of the Interpreter as.

I think it would be useful to be able to try to use the Interpreter as a sort of reference implementation to aid in debugging different backends. If we go in this direction, we'd want to try to make the quantization implementation configurable in the Interpreter to simulate the backend.

However if we view the Interpreter just as another backend useful for quickly bringing up new operators and for quantization profiling, I don't think there's much else for us to do here.

jfix71 on 6 Feb 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings