Incubator-mxnet: Bug or highly unintuitive and undocumented behaviour corrupting NDArray data(possible race condition)[MKLDNN][C++]

Created on 16 Jul 2019  路  14Comments  路  Source: apache/incubator-mxnet

Copying from NDArray a to NDArray b and then back from b to a(saving network weights for early stopping) causes data corruption. It is not 100% consistent, probably hard to reproduce(but I had it reproducing semi-consistently, must be sensitive to the execution conditions, that's why I mention race possibility). Possibly because execution engine provides a weak gurantee, thus executing a->b and b->a out of order?(https://mxnet.incubator.apache.org/versions/master/architecture/overview.html#execution-engine "The execution of any two functions that modify a common variable is serialized in their push order.")

Using MXNet with MKLDNN v0.20 beta and from C++, built from source downloaded from github(apache-mxnet-src-1.4.0.rc3-incubating). MXNet built with CMake and MSVC (VS 2019) .

```
void Test()
{
using namespace mxnet::cpp;
using ArgMap = map;
ArgMap args;
ArgMap auxiliaryState;
ArgMap bestModelArgs;
ArgMap bestModelAux;
auto AssignNDArrayMap =
{
for ( auto &[k, arr] : rhs )
arr.CopyTo( &lhs.at(k) );
};
auto WaitMap =
{
return; //Workaround - commenting this return statement fixes the problem
for ( auto &[k, v] : map )
{
v.WaitAll();
v.WaitToRead();
v.WaitToWrite();
}
};
auto SaveAsBestModel = &WaitMap, &bestModelArgs, &bestModelAux, &AssignNDArrayMap, &args, &auxiliaryState
{
AssignNDArrayMap( bestModelArgs, args );
AssignNDArrayMap( bestModelAux, auxiliaryState );
WaitMap( bestModelArgs );
WaitMap( bestModelAux );
};
auto LoadBestModel = &WaitMap, &bestModelArgs, &bestModelAux, &AssignNDArrayMap, &args, &auxiliaryState
{
AssignNDArrayMap( args, bestModelArgs );
AssignNDArrayMap( auxiliaryState, bestModelAux );
WaitMap( args );
WaitMap( auxiliaryState );
};
//copy
SaveAsBestModel(); //a -> b
LoadBestModel(); //b -> a

//Data in args/auxiliaryState is now corrupt!

}
````
If this is the intended behaviour there needs to be several big, bold warnings, for example on the NDArray API docs page.
Waiting with WaitAll() works around the problem(see code above).

Bug C++

Most helpful comment

@kostayScr I filed a PR to fix this: https://github.com/apache/incubator-mxnet/pull/15637. Local test works for me. But as you know, the problem is a bit complicated and random, so I wish you can have a try for your real case to see if problem is really resolved. Thanks in advance.

All 14 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug, C++

@mxnet-label-bot add [Bug, C++]

@ZhennanQin could you help take a look for this ?

@kostayScr I can't reproduce the crash on my linux box. Probably because:

        ArgMap args;
    ArgMap auxiliaryState;
    ArgMap bestModelArgs;
    ArgMap bestModelAux;

are all empty in your case, so no copy actually happen. Can you provide a real case that can reproduce this issue?

Hello, thanks for looking into it.
Here is more complete code:

void TestWaitMapMXNetBug()
{
    using namespace mxnet::cpp;
    using ArgMap = map<string, NDArray>;

    const size_t testDataSz = 10 * 1024 * 1024;

    float *testDataOnes = new float[testDataSz];
    float *testDataZeroes = new float[testDataSz];
    for ( size_t i = 0; i < testDataSz; i++ )
    {
        testDataOnes[i] = 1.0f;
    }
    for ( size_t i = 0; i < testDataSz; i++ )
    {
        testDataZeroes[i] = 0.0f;
    }
    //NDArray testOnes( testDataOnes, testDataSz ); //crash
    //NDArray testZeroes( testDataZeroes, testDataSz );
    NDArray testOnes( testDataOnes, Shape( testDataSz ), Context::cpu() );
    NDArray testZeroes( testDataZeroes, Shape( testDataSz ), Context::cpu() );

    auto CheckNDArrayValues = []( const NDArray &arr, const float val ) //check that all values in array equal to val
    {
        auto arrData = arr.GetData();
        for ( size_t i = 0; i < arr.Size(); i++ )
        {
            if ( arrData[i] != val )
                return false;
        }
        return true;
    };

    ArgMap args;
    ArgMap auxiliaryState;
    ArgMap bestModelArgs;
    ArgMap bestModelAux;

    auto testName = "testName"s;
    args[testName] = testOnes;
    bestModelArgs[testName] = testZeroes;

    //initial test
    std::cout << CheckNDArrayValues( args[testName], 1.0f );
    std::cout << CheckNDArrayValues( bestModelArgs[testName], 0.0f );

    auto AssignNDArrayMap = []( decltype( args ) &lhs, const decltype( args ) &rhs )
    {
        for ( auto &[k, arr] : rhs )
            arr.CopyTo( &lhs.at( k ) );
    };
    auto WaitMap = []( auto &map )
    {
        return; //Workaround - commenting this return statement fixes the problem
        for ( auto &[k, v] : map )
        {
            v.WaitAll();
            v.WaitToRead();
            v.WaitToWrite();
        }
    };
    auto SaveAsBestModel = [ &WaitMap, &bestModelArgs, &bestModelAux, &AssignNDArrayMap, &args, &auxiliaryState ]()
    {
        AssignNDArrayMap( bestModelArgs, args );
        AssignNDArrayMap( bestModelAux, auxiliaryState );
        WaitMap( bestModelArgs );
        WaitMap( bestModelAux );
    };
    auto LoadBestModel = [ &WaitMap, &bestModelArgs, &bestModelAux, &AssignNDArrayMap, &args, &auxiliaryState ]()
    {
        AssignNDArrayMap( args, bestModelArgs );
        AssignNDArrayMap( auxiliaryState, bestModelAux );
        WaitMap( args );
        WaitMap( auxiliaryState );
    };
    //copy
    SaveAsBestModel();  //a -> b
    LoadBestModel();    //b -> a

    //Data in args/auxiliaryState is now Corrupt!
    //test for corruption, now both must contain ones
    std::cout << CheckNDArrayValues( bestModelArgs[testName], 1.0f );
    std::cout << CheckNDArrayValues( args[testName], 1.0f );

    //bestModelArgs got corrupt, maybe it's all zeroes?
    std::cout << CheckNDArrayValues( bestModelArgs[testName], 0.0f );
}

Works(fails the check) only in Release build mode for me.
Correct output:

true
true
true
true
false

Factual output:

true
true
false
true
false

Enabling the waiting on the NDArrays fixes it.
I also want to clarify that there is no crash - just data corruption.

@kostayScr Thanks for providing this case. Confirm that this is reproducible on my linux box. I'm trying to root cause this problem. I also found that naive engine doesn't have this problem, so for workaround, you can switch to naive engine by

export MXNET_ENGINE_TYPE=NaiveEngine

@kostayScr After some analysis, I think it's not a bug, but a engine feature. Actually, the order of a->b and b->a is guaranteed by engine, but as CopyTo is executed asynchronously, you shouldn't access a NDArray until all operations are done which are pending on it. So inside CheckNDArrayValues, you need to add WaitAll() for the checked NDArray. The code would be like

  auto CheckNDArrayValues = [](const NDArray &arr, const float val) {
    arr.WaitAll();
    auto arrData = arr.GetData();
    for (size_t i = 0; i < arr.Size(); i++) {
      if (arrData[i] != val) return false;
    }
    return true;
  };

With this change, you can remove WaitMap and get expected result.

In summary, you don't need to wait between a->b and b-a as engine can make the guarantee. But you need to wait for all operations are finished before accessing a.

@ZhennanQin I see. But the original problem was with Forward/SyncCopyToCpu() rather than GetData(). I had the impression that GetData() will wait as well, like it does in Python I believe. So I just used it instead without checking(it was shorter). It probably needs a large font warning about waiting before calling it.

Trying to make a reproduction snippet currently. It might be more complex than this, in the original code, the 'b' bunch of NDarrays went out of scope at the end, maybe that's it.

@ZhennanQin I think I managed to pin it down:

auto ConvBlock = []( const Symbol &data, int num, int filterCount = 16, int dilation = 1 )
{
    auto convWeight = Symbol::Variable( "weight_conv" + std::to_string( num ) );
    auto convBias = Symbol::Variable( "bias_conv" + std::to_string( num ) );
    //stride can help perf w/o hurting performance much
    Symbol net = Convolution( data, convWeight, convBias, Shape( 1, 5 ), filterCount, Shape(), Shape( 1, dilation ) );
    net = Activation( net, ActivationActType::kRelu );
    return net;
};
void TestWaitMapMXNetBug()
{
    using namespace mxnet::cpp;
    using ArgMap = map<string, NDArray>;

    const size_t testDataSz = 1 * 1024 * 1024;

    float *testDataFirst = new float[testDataSz];
    float *testDataSecond = new float[testDataSz];
    for ( size_t i = 0; i < testDataSz; i++ ) //fill test data with pattern to later verify
    {
        testDataFirst[i] = i;
    }
    for ( size_t i = 0; i < testDataSz; i++ )
    {
        testDataSecond[i] = testDataSz - i - 1;
    }
    //NDArray testOnes( testDataOnes, testDataSz ); //crash
    //compare array to the first arr.Size() elements of pattern
    auto CheckNDArrayValuesPattern = []( NDArray &arr, const float *pattern, size_t patternSz )
    {
        //auto arrData = arr.GetData();
        std::unique_ptr<float[]> arrData( new float[arr.Size()] );
        arr.SyncCopyToCPU( arrData.get(), arr.Size() ); //Synchronous copy - calls WaitToRead on NDArray,
        assert( arr.Size() <= patternSz );
        auto res = memcmp( arrData.get(), pattern, arr.Size() * sizeof( float ) );
        return res == 0;
    };
    auto CheckNDArrMapValsPattern = [ &CheckNDArrayValuesPattern ]( auto &map_, const float *pattern, size_t patternSz )
    {
        for ( auto &[k, arr] : map_ )
        {
            if ( !CheckNDArrayValuesPattern( arr, pattern, patternSz ) )
                return false;
        }
        return true;
    };

    ArgMap args;
    ArgMap auxiliaryState;
    ArgMap bestModelArgs;
    ArgMap bestModelAux;


    auto SetMapPattern = []( auto &map_, const float *pattern, size_t patternSz )
    {
        for ( auto &[k, arr] : map_ )
        {
            if ( patternSz < arr.Size() )
                throw std::logic_error( __FUNCSIG__ );
            arr.SyncCopyFromCPU( pattern, arr.Size() );
        }
    };

    auto ctx = Context::cpu();
    args["X"] = NDArray( Shape( 200, 1, 1, 1024 * 1), ctx ); //dummy, (batch_size, channel, height, width)
    Symbol data = Symbol::Variable( "X" );
    auto net = ConvBlock( data, 0, 16 );
    net = ConvBlock( net, 1, 16 );
    net = ConvBlock( net, 2, 16 );
    net.InferArgsMap( ctx, &args, args );
    auto exec = net.SimpleBind( ctx, args );
    SetMapPattern( args, testDataFirst, testDataSz );

    for ( auto &[k, v] : args )
        bestModelArgs[k] = NDArray( v.GetShape(), ctx );
    SetMapPattern( bestModelArgs, testDataSecond, testDataSz );

    qDebug() << CheckNDArrMapValsPattern( args, testDataFirst, testDataSz );
    qDebug() << CheckNDArrMapValsPattern( bestModelArgs, testDataSecond, testDataSz );

    auto AssignNDArrayMap = []( decltype( args ) &lhs, const decltype( args ) &rhs )
    {
        for ( auto &[k, arr] : rhs )
            arr.CopyTo( &lhs.at( k ) );
    };
    auto WaitMap = []( auto &map )
    {
        return; //Workaround - commenting this return statement fixes the problem
        for ( auto &[k, v] : map )
        {
            v.WaitAll();
            v.WaitToRead();
            v.WaitToWrite();
        }
    };
    auto SaveAsBestModel = [ &WaitMap, &bestModelArgs, &bestModelAux, &AssignNDArrayMap, &args, &auxiliaryState ]()
    {
        AssignNDArrayMap( bestModelArgs, args );
        AssignNDArrayMap( bestModelAux, auxiliaryState );
        WaitMap( bestModelArgs );
        WaitMap( bestModelAux );
    };
    auto LoadBestModel = [ &WaitMap, &bestModelArgs, &bestModelAux, &AssignNDArrayMap, &args, &auxiliaryState ]()
    {
        AssignNDArrayMap( args, bestModelArgs );
        AssignNDArrayMap( auxiliaryState, bestModelAux );
        WaitMap( args );
        WaitMap( auxiliaryState );
    };

    for ( int i = 0; i < 1000; ++i )
        exec->Forward( false );
    {
        //to make NDArray change it's storage type to the MKLDNN one
        vector<float> dummy( exec->outputs[0].Size() );
        exec->outputs[0].SyncCopyToCPU(&dummy);
    }

    //copy
    SaveAsBestModel();  //a -> b


    for ( auto &[k, v] : args )
    {
        //uncommenting this line must cause an exception:
        //qDebug() << k.c_str() << v.GetData();

        //exception info:
        //ndarray.cc:751
        //#if MXNET_USE_MKLDNN == 1
        //CHECK(!IsMKLDNNData()) << "We can't generate TBlob for MKLDNN data. "
        //<< "Please use Reorder2Default() to generate a new NDArray first";
        //#endif
    }
    //auto dataPtr1 = bestModelArgs.at( "0" ).GetData();
    LoadBestModel();    //b -> a

    //Data in args/auxiliaryState is now Corrupt!
    //test for corruption
    qDebug() << CheckNDArrMapValsPattern( args, testDataFirst, testDataSz );
    qDebug() << CheckNDArrMapValsPattern( bestModelArgs, testDataFirst, testDataSz );
}

Output should be all true, but it's:

true
true
false
false

It seems to be related to the data format (storage type) of the NDArrays. See comment in the code about exception - it's what hinted me at how to reproduce. I checked quite a few things before I got it reproducing, here is the uncleaned code, maybe it'll save some time: https://pastebin.com/RcHb6kdb (the commented code didn't reproduce the bug).

@kostayScr Thanks for providing such detailed case. I can reproduce it locally. Some experiment shows that it's another unexpected behavior about MKLDNNDataReorderAsync. Quote out it from mkldnn_convolution.cc can fix this issue, but inference speed will drop especially for small batch size. I'm trying to fix this without performance impact. Maybe need more days.

@ZhennanQin Yeah it was a slippery one, many pre-conditions(that's why the code is so long). The WaitMap function works fine as a workaround for me. Doesn't affect performance and results seem to be stable.

@kostayScr I filed a PR to fix this: https://github.com/apache/incubator-mxnet/pull/15637. Local test works for me. But as you know, the problem is a bit complicated and random, so I wish you can have a try for your real case to see if problem is really resolved. Thanks in advance.

@ZhennanQin I applied the patch manually, then tried to reproduce the bug in my code(ran it 5 times). The bug didn't pop up so hopefully it is fixed now. Should I close the issue?

@kostayScr Thanks for the try. It's up to you that whether you want to trace the bug to be fixed from master:)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

realbns2008 picture realbns2008  路  3Comments

yuconglin picture yuconglin  路  3Comments

dushoufu picture dushoufu  路  3Comments

GuilongZh picture GuilongZh  路  3Comments

sbodenstein picture sbodenstein  路  3Comments