https://github.com/apache/incubator-mxnet/pull/12085/ introduced a bug into the topk operator. Below code example will output [ 274232. 179574. 274233. 274231.] with mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with mxnet-cu90==1.3.0b20180814.
Note that both CPU and GPU versions of the operator are affected.
@szha @ciyongch @eric-haibin-lin
(Paste the commands you ran that produced the error.)
wget https://s3.amazonaws.com/lllausen-public/topk/array.npypython3 -c 'import mxnet as mx;import numpy as np; print(mx.nd.topk(mx.nd.array(np.load("array.npy")), k=4)[80])'python3 -c 'import mxnet as mx;import numpy as np; print(mx.nd.topk(mx.nd.array(np.load("array.npy"), ctx=mx.gpu(0)), k=4)[80])' @leezu Thanks for reporting the issue, I'm looking into this now.
@ciyongch thanks for looking into this. Your help is much appreciated
@mxnet-label-bot: [Python, Operator, Bug]
@leezu @eric-haibin-lin
The mismatch result was caused by the data type conversion from int to real_t(float in this case).
Actually, this kind of error is also existing in previous version of MXNet (say, MXNet before this PR https://github.com/apache/incubator-mxnet/pull/12085) when the value of indices is large enough, and it's very easily to reproduce via below command:
python -c 'import mxnet as mx; import numpy as np; print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'
For the case used by @leezu , the input shape of ndarray is (256, 300096), so the initial flattened indices would be in the range of [0, 76,824,575].
modulo to the flattened indices, then get the desired number of indices, and do the conversion from int to real_t, it works well since the the maximum indices is 300095 during conversion, no conversion error in this case.int to real_t, and finally the modulos calculation. The conversion error happened in this step due to the value of indices exceed the expressive range of float (such as integer: 24281911, 24281912, 24281913 will be 24281912 after conversion), that's why we see the mismatch result mentioned above.Once there's a data type conversion (int to real_t) before returning indices, there could be such kind of error.
Will work on this to fix this issue.
@eric-haibin-lin @szha @pengzhao-intel
The current design of ret_indices requires data type conversion, since the real_t(float) type could not express the large integer precisely compared to int, this kind of error still be there unless the upper API was changed to use int type. It might introduce compatibility issue to current framework/topology due to API change(changing the type of ret_indices from real_t to int).
Another solution is to keep the functionality behavior same as before (workable in this case), which could still be benefit from the PR https://github.com/apache/incubator-mxnet/pull/12085. Firstly, get the desired number of indices (slicing), then do the modulo calculation and finally convert the ret_indices from int to float.
I have submit another PR https://github.com/apache/incubator-mxnet/pull/12202 to fix this issue, please take a time to review it.
Indeed. Let's address that in 2.0 due to API changes.
@ciyongch for now please add a follow-up PR to add test for #12202. Thanks a lot for the quick fix.
@szha Fix has been merged, should be good to close
@haojin2 not really. See comments above. In addition tests need to be added.
@szha sure, I will add two kinds of test cases to topk op, one is similar to the case used by @leezu , the other one is what I mentioned above (Since it will fail unless API change, so this one will be skipped by default for now). Any comments?
Hi @szha , I submit a PR https://github.com/apache/incubator-mxnet/pull/12217 to cover this case, and this is only for the first option mentioned above. For the second one, we can add it after the API was changed.
BTW, any plan to support int type of ndarray as a input to topk?
I think https://github.com/apache/incubator-mxnet/pull/12250 has solved this problem. Now we can use mx.nd.topk(..., dtype=np.int32) to set the dtype of the output indices.
import mxnet as mx
import numpy as np
print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1), dtype=np.int32), k=4, dtype=np.int32))
Result:
[[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]
[9603071 9603070 9603069 9603068]]
<NDArray 8x4 @cpu(0)>
@sxjscience I agree that the problem seems to be fixed now, please feel free to close the issue or we can wait for @leezu to close it.
Thanks to everyone for fixing this
@leezu Quick question: shall we make dtype=np.int64 a default? I cannot see any reason we should hold on to the float32 default -- the previous defaults may have been an implicit bug in the first place.
@yifeim this may break backwards compatibility of the topk operator. If you feel it is worth to break the compatibility for this change I suggest to discuss with @sxjscience about it.
Nevertheless it seems that #12250 got merged without the documentation on the website being updated about the recommendation to use dtype=np.int64.
@yifeim this is tracked in #9686.
Most helpful comment
@leezu @eric-haibin-lin
The mismatch result was caused by the data type conversion from
inttoreal_t(float in this case).Actually, this kind of error is also existing in previous version of MXNet (say, MXNet before this PR https://github.com/apache/incubator-mxnet/pull/12085) when the value of indices is large enough, and it's very easily to reproduce via below command:
python -c 'import mxnet as mx; import numpy as np; print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'For the case used by @leezu , the input shape of ndarray is (256, 300096), so the initial flattened indices would be in the range of
[0, 76,824,575].moduloto the flattened indices, then get the desired number of indices, and do the conversion frominttoreal_t, it works well since the the maximum indices is 300095 during conversion, no conversion error in this case.inttoreal_t, and finally the modulos calculation. The conversion error happened in this step due to the value of indices exceed the expressive range offloat(such as integer: 24281911, 24281912, 24281913 will be 24281912 after conversion), that's why we see the mismatch result mentioned above.Once there's a data type conversion (
inttoreal_t) before returning indices, there could be such kind of error.Will work on this to fix this issue.