Incubator-mxnet: topk regression

Created on 16 Aug 2018 · 17Comments · Source: apache/incubator-mxnet

Description

https://github.com/apache/incubator-mxnet/pull/12085/ introduced a bug into the topk operator. Below code example will output [ 274232. 179574. 274233. 274231.] with mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with mxnet-cu90==1.3.0b20180814.

Note that both CPU and GPU versions of the operator are affected.

@szha @ciyongch @eric-haibin-lin

Steps to reproduce

(Paste the commands you ran that produced the error.)

wget https://s3.amazonaws.com/lllausen-public/topk/array.npy
python3 -c 'import mxnet as mx;import numpy as np; print(mx.nd.topk(mx.nd.array(np.load("array.npy")), k=4)[80])'
python3 -c 'import mxnet as mx;import numpy as np; print(mx.nd.topk(mx.nd.array(np.load("array.npy"), ctx=mx.gpu(0)), k=4)[80])'

Bug Operator Python

Source

leezu

Most helpful comment

@leezu @eric-haibin-lin

The mismatch result was caused by the data type conversion from int to real_t(float in this case).

Actually, this kind of error is also existing in previous version of MXNet (say, MXNet before this PR https://github.com/apache/incubator-mxnet/pull/12085) when the value of indices is large enough, and it's very easily to reproduce via below command:
python -c 'import mxnet as mx; import numpy as np; print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'

For the case used by @leezu , the input shape of ndarray is (256, 300096), so the initial flattened indices would be in the range of [0, 76,824,575].

MXNet without PR https://github.com/apache/incubator-mxnet/pull/12085, first do the modulo to the flattened indices, then get the desired number of indices, and do the conversion from int to real_t, it works well since the the maximum indices is 300095 during conversion, no conversion error in this case.
MXNet with PR https://github.com/apache/incubator-mxnet/pull/12085, first get the desired number of indices, then do the conversion from int to real_t, and finally the modulos calculation. The conversion error happened in this step due to the value of indices exceed the expressive range of float (such as integer: 24281911, 24281912, 24281913 will be 24281912 after conversion), that's why we see the mismatch result mentioned above.

Once there's a data type conversion (int to real_t) before returning indices, there could be such kind of error.

Will work on this to fix this issue.

ciyongch on 16 Aug 2018

👍4

All 17 comments

@leezu Thanks for reporting the issue, I'm looking into this now.

ciyongch on 16 Aug 2018

@ciyongch thanks for looking into this. Your help is much appreciated

eric-haibin-lin on 16 Aug 2018

@mxnet-label-bot: [Python, Operator, Bug]

vdantu on 16 Aug 2018

@leezu @eric-haibin-lin

The mismatch result was caused by the data type conversion from int to real_t(float in this case).

For the case used by @leezu , the input shape of ndarray is (256, 300096), so the initial flattened indices would be in the range of [0, 76,824,575].

MXNet without PR https://github.com/apache/incubator-mxnet/pull/12085, first do the modulo to the flattened indices, then get the desired number of indices, and do the conversion from int to real_t, it works well since the the maximum indices is 300095 during conversion, no conversion error in this case.
MXNet with PR https://github.com/apache/incubator-mxnet/pull/12085, first get the desired number of indices, then do the conversion from int to real_t, and finally the modulos calculation. The conversion error happened in this step due to the value of indices exceed the expressive range of float (such as integer: 24281911, 24281912, 24281913 will be 24281912 after conversion), that's why we see the mismatch result mentioned above.

Once there's a data type conversion (int to real_t) before returning indices, there could be such kind of error.

Will work on this to fix this issue.

ciyongch on 16 Aug 2018

👍4

@eric-haibin-lin @szha @pengzhao-intel

The current design of ret_indices requires data type conversion, since the real_t(float) type could not express the large integer precisely compared to int, this kind of error still be there unless the upper API was changed to use int type. It might introduce compatibility issue to current framework/topology due to API change(changing the type of ret_indices from real_t to int).

Another solution is to keep the functionality behavior same as before (workable in this case), which could still be benefit from the PR https://github.com/apache/incubator-mxnet/pull/12085. Firstly, get the desired number of indices (slicing), then do the modulo calculation and finally convert the ret_indices from int to float.

I have submit another PR https://github.com/apache/incubator-mxnet/pull/12202 to fix this issue, please take a time to review it.

ciyongch on 16 Aug 2018

Indeed. Let's address that in 2.0 due to API changes.

szha on 16 Aug 2018

@ciyongch for now please add a follow-up PR to add test for #12202. Thanks a lot for the quick fix.

szha on 16 Aug 2018

@szha Fix has been merged, should be good to close

haojin2 on 16 Aug 2018

@haojin2 not really. See comments above. In addition tests need to be added.

szha on 16 Aug 2018

👍1

@szha sure, I will add two kinds of test cases to topk op, one is similar to the case used by @leezu , the other one is what I mentioned above (Since it will fail unless API change, so this one will be skipped by default for now). Any comments?

ciyongch on 17 Aug 2018

Hi @szha , I submit a PR https://github.com/apache/incubator-mxnet/pull/12217 to cover this case, and this is only for the first option mentioned above. For the second one, we can add it after the API was changed.
BTW, any plan to support int type of ndarray as a input to topk?

ciyongch on 17 Aug 2018

I think https://github.com/apache/incubator-mxnet/pull/12250 has solved this problem. Now we can use mx.nd.topk(..., dtype=np.int32) to set the dtype of the output indices.

import mxnet as mx
import numpy as np
print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1), dtype=np.int32), k=4, dtype=np.int32))

Result:

[[9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]
 [9603071 9603070 9603069 9603068]]
<NDArray 8x4 @cpu(0)>

sxjscience on 21 Aug 2018

🎉3

@sxjscience I agree that the problem seems to be fixed now, please feel free to close the issue or we can wait for @leezu to close it.

haojin2 on 21 Aug 2018

Thanks to everyone for fixing this

leezu on 21 Aug 2018

@leezu Quick question: shall we make dtype=np.int64 a default? I cannot see any reason we should hold on to the float32 default -- the previous defaults may have been an implicit bug in the first place.

yifeim on 22 Aug 2018

@yifeim this may break backwards compatibility of the topk operator. If you feel it is worth to break the compatibility for this change I suggest to discuss with @sxjscience about it.

Nevertheless it seems that #12250 got merged without the documentation on the website being updated about the recommendation to use dtype=np.int64.

leezu on 25 Aug 2018

@yifeim this is tracked in #9686.

szha on 25 Aug 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings