Incubator-mxnet: [1.4.0] unravel_index no longer works with magic '-1' in shape parameter as in 1.3.1

Created on 12 Jan 2019 · 7Comments · Source: apache/incubator-mxnet

Description

The unravel_index op seems to no longer correctly work with 'magic' shape values, such as '-1's. The following example still works with mxnet 1.3.1, but does not on latest master (it returns all zeros in the result without throwing an error) or 1.4.0.
We have a use case for this in Sockeye.

Environment info (Required)

macOs

MXNet commit hash:
pip-installed:
mxnet 1.5.0b20190111

Minimum reproducible example

Input data taken from Sockeye unit tests.

x = mx.nd.array([335, 620, 593, 219, 36], dtype='int32')
mx.nd.unravel_index(x, shape=(-1, 200))

With mxnet==1.5.0b20190111, the result is incorrect:

[[  0   0   0   0   0]
 [135  20 193  19  36]]
<NDArray 2x5 @cpu(0)>

With mxnet==1.3.1, the result is correct:

[[  1   3   2   1   0]
 [135  20 193  19  36]]
<NDArray 2x5 @cpu(0)>

However, if the shape parameter is fully specified (shape=(5,200)), mxnet==1.5.0b20190111 returns the correct values.

Bug NDArray Python

Source

fhieber

👍1

Most helpful comment

PR #14356 will fix this.

asmushetzel on 7 Mar 2019

🎉2 👍2

All 7 comments

Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it. @mxnet-label-bot add [Python, Bug, NDArray]

piyushghai on 14 Jan 2019

Could someone test this also with the 1.4 release candidate? If this is present there as well, I'd appreciate a last-minute fix very much! :)

fhieber on 16 Jan 2019

Hoping to get some activity on this.
I bisected the nightly builds to figure out, when this bug was introduced.

mxnet==1.3.1b20181005 is fine
mxnet==1.3.1b20181010 contains the bug
Assuming that builds are roughly reflecting merge dates of PRs, the bug must be introduced some time between October 5th and October 10th, and is hence also part of 1.4.0.

There is no code change in the ravel.* files or the unravel_index op itself since its addition, so I would guess that some other type of change caused this unwanted regression; maybe unravel_index uses some outdated interface?

The only commits where I could guess that they could have some effect on this:

Unfortunately I don't have a way to bisect commits directly through source compilation right now.

I would appreciate some support on this. Thanks!

fhieber on 27 Jan 2019

I can confirm that the problem exists in 1.4.0

fhieber on 4 Mar 2019

Pretty easy to diagnose.

The type for mshadow::index_t got changed from "unsigned" to "int64_t" between both versions. This type is used to encode dimensions of shapes internally.
In 1.3.1, a "-1" therefore was interpreted as unsigned(-1) which is basically max_int. In 1.4.0 and later it is interpreted as "-1" and that changed the behaviour of the operator.

It is debatable whether we ever wanted to explicitly allow the use of "-1" in the shape argument of the operator. At least I as the original author had this not in mind and also the documentation does not say anything about magic numbers. In fact, the only case where the operator did work (and could ever work) with magic numbers is when "-1" is specified as first coordinate. Nothing else is possible.
Two solutions:

We document that it is o.k. to use "-1" as first shape coordinate in ravel/unravel_index. And enhance the operator to support this. That does not require any change for "ravel" and a simple change in the code shown below.
We leave it as is (the code simply does not support this). A potential workaround for customers would be to specify a huge number except "-1" for the first dimension which has essentially the same effect.

struct unravel_index {
  template<typename DType>
  MSHADOW_XINLINE static void Map(int i, index_t N, index_t ndim, index_t *shape,
                                  DType *unravelled, DType *ravelled) {
    index_t idx(ravelled[i]);
    #pragma unroll
    for (int j = ndim; j--; ) {
      index_t tmp = idx / shape[j];
      unravelled[i+j*N] = idx - tmp*shape[j];
      idx = tmp;
    }
  }
};

asmushetzel on 6 Mar 2019

Oh thank you so much! This is quite hilarious - we relied on exploiting this undocumented behavior without knowing :)
For Sockeye this means that we can update to 1.4.0 immediately and make this exploit of range overflow explicit by using shape=(sys.maxsize, y) in our code instead of shape=(-1, y). We rely on this functionality because we cannot infer the first shape value (batch size) in a HybridBlock in our case. As such, I would vote for going with option 1, explicitly supporting it in the future.

fhieber on 6 Mar 2019

PR #14356 will fix this.

asmushetzel on 7 Mar 2019

🎉2 👍2

Was this page helpful?

0 / 5 - 0 ratings