Incubator-mxnet: MKL Bug

Created on 10 Oct 2017 · 4Comments · Source: apache/incubator-mxnet

@ykim362

Environment info

Operating System: OSX, pip install --pre mxnet-mkl

MXNet commit hash (git rev-parse HEAD): b6b8da0ac2b1ef8b84089458c757ce8b19aab0d7

Python version and distribution: Python 2.7.13

Minimum reproducible example

from __future__ import print_function
import mxnet as mx
import skimage.io as io
import numpy as np

def print_activation(net, url):
    I = io.imread(url)
    if I.shape[2] == 4:
        I = rgba2rgb(I)
    image = mx.nd.array(I).astype(np.uint8)
    image = mx.image.resize_short(image, 256)
    image, _ = mx.image.center_crop(image, (224, 224))
    image = mx.image.color_normalize(image.astype(np.float32)/255,
                                     mean=mx.nd.array([0.485, 0.456, 0.406]),
                                     std=mx.nd.array([0.229, 0.224, 0.225]))
    image = mx.nd.transpose(image.astype('float32'), (2,1,0))
    image = mx.nd.expand_dims(image, axis=0)
    out = net(image)
    return (out - out.max()).exp().sum().asscalar(), mx.nd.log_softmax(out), out
act1, log_softmax_out1, out1 = print_activation(mx.gluon.model_zoo.vision.squeezenet1_1(pretrained=True), 'https://github.com/zackchase/mxnet-the-straight-dope/raw/master/img/real_hotdog.jpg')
act2, log_softmax_out2, out2 = print_activation(mx.gluon.model_zoo.vision.squeezenet1_1(pretrained=True), 'https://github.com/zackchase/mxnet-the-straight-dope/raw/master/img/real_hotdog.jpg')
print(act1, act2)
print((out1-out2).max(), (out1-out2).min())

Expected outputs should be

1.16364 1.16364

[ 0. ]
<NDArray 1 @cpu(0)>
[ 0. ]
<NDArray 1 @cpu(0)>

Instead I got

MKL Build:20170720
3.0 nan

[ 0.]
<NDArray 1 @cpu(0)>
[ 0.]
<NDArray 1 @cpu(0)>

And the first two values are different every time I run it.

% python bug.py
MKL Build:20170720
1.0 nan

[ 0.]
<NDArray 1 @cpu(0)>
[ 0.]
<NDArray 1 @cpu(0)>

% python bug.py
MKL Build:20170720
nan 718.584

[ 0.]
<NDArray 1 @cpu(0)>
[ 0.]
<NDArray 1 @cpu(0)>

Steps to reproduce

or if you are running standard examples, please provide the commands you have run that lead to the error.

pip install --pre mxnet-mkl
pip install scikit-image
Save the above code snippet into file bug.py
python bug.py

What have you tried to solve it?

Calling mx.nd.log_softmax(out) before the term (out - out.max()).exp().sum().asscalar() in print_activation function seems to make the results right.
Results are correct for variants that don't involve MKL.

Bug

Source

szha

Most helpful comment

My MKLDNN code seems to solve the problem. The output is very deterministic. It's always

1.1635 1.1635

[ 0.]
<NDArray 1 @cpu(0)> 
[ 0.]
<NDArray 1 @cpu(0)>

zheng-da on 19 Nov 2017

👍2

All 4 comments

is there an update on this issue? we are keen to include MKL in an upcoming project.

jesterhazy on 27 Oct 2017

@ykim362 stated that he started working on this a while back, but there's no update from him yet.

Help on finding the root cause is welcome.

szha on 27 Oct 2017

My MKLDNN code seems to solve the problem. The output is very deterministic. It's always

1.1635 1.1635

[ 0.]
<NDArray 1 @cpu(0)> 
[ 0.]
<NDArray 1 @cpu(0)>

zheng-da on 19 Nov 2017

👍2

@zheng-da Thanks! I have been also assuming the NDArray re-factoring solves this problem. Does the NDArrady refactoring also enable gluon APIs not currently available for CPU? (Inplace operations)