Incubator-mxnet: MKLDNN RNN seg fault

Created on 1 Oct 2020  路  10Comments  路  Source: apache/incubator-mxnet

A customer is experiencing seg fault when feeding in a large input to MKL LSTM. I have reduced the code to this:

import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon import nn, rnn, Trainer

hidden_size = 30
num_embed = 100
vocab_size = 13028#len(vocab.token_to_idx.keys())

inp = nd.random.uniform(0, vocab_size, (16758,500))
print(inp)

context = mx.cpu()

model = nn.Sequential()
model.add(nn.Embedding(vocab_size, num_embed), # Embedding layer
          rnn.LSTM(hidden_size, num_layers=1,bidirectional=True),  # Recurrent layer ,bidirectional=True
          nn.Dense(3))  # Output layer

model.collect_params().initialize(mx.init.Xavier(), ctx=context)

val_predictions = model(inp)
nd.waitall()
print(val_predictions)

I think this is some sort of out of memory issue because if we shrink the input (first dim of inp) then there will not be a seg fault, but still, shall we add some error message here so that users will be notified to reduce the input size?

I also noticed the same input will run fine with export MXNET_USE_MKLDNN_RNN=0 but that is 3x slower than the mkldnn implementation. Another suggestion I made to the customer was to try out a magic number for the seg fault threshold and do multiple batches that are smaller than that (customer was trying to forward pass the entire validation set), but this is also a pretty hacky solution. So maybe better yet, we can optimize the mkldnn implementation to process data that's currently too large?

@PatricZhao

Bug MKLDNN RNN

Most helpful comment

Hi,

Well,
When running our pre-model (this is a simple imitation of the LSTM model). While a test, I want to create a large LSTM tensor, for example: (20758,500). It could be seen that ~ 170GB of memory is allocated for scratchpad computations. We can see that global memory is always true. Well, as a result, for a different oneDNN version, I got the error messages: as following:

  1. For a given v.1.3 version of mkldnn: Segmentation fault: 11
  2. For a given v.1.6 version of mkldnn: mxnet.base.MXNetError: MXNetError: could not create a primitive

This error is only visible for a large LSTM tensor. Step-by-step reproduction casts light on this issue. If we have a look at the code, a lot of things might be visible there. First off, the standard Vanilla-LSTM algorithm of MKLDNN leads to allocate sufficient/insufficient block of memory. The block is allocated based on this equation: sizeof(float) * work_space, where work_space is an offset (in bytes). For a given test (input: 20758,500) we can see that ~170 GB od memory is allocated for scratchpad computation, where workspace = 47952392192 * sizeof(float) = 191809568768 bytes ~ 170 GB. If you don't have enough space, you will get both errors: see 1 & 2. In Intel, MKLDNN primitives can use either individual memory or global buffer memory for an intermediate computation. The first one might lead to getting better performance result since memory most likely will be attached to any thread. The second one, might save a lot of memory.

For brevity:
The input tensor isT x N x C, well, for a given example (10758, 500), T is 10758, C is 500, That means that we need at least 4 * 10758 * 500 * 500 * 4 bytes ~ 40 GB, or maybe more. Basically the work-space would be comparable with the grid size n_layers * mb * n_times_stamps * 4 (gates) * max(sic, slc, dhsc) ^ 2.聽 For a given oneDNN version (1.3 and 1.6) the size of work-space (i.e LSTM space)聽is equal book<float>(num_elems, ....) ~ 40 GB * sizeof(T) = 40 GB * 4 ~160GB.聽 An upper_bound (the size of input tensor) has not been clearly defined and its upper_bound has been limited by physical side of memory.聽聽Well, the size of buffer which is need to allocate LSTM tensor is determined, as following: 4 * 10758 * 500 * 500 * 4 bytes ~ 40 GB.聽 Yet, this value is multiply by the constant value聽of its type (in this case: = float).聽
Approximately: it should be defined, as following:聽

  1. The size of work-space * , where is 聽~ * 1byte [potentially]
  2. The workspace is only limited by the total number of elements of a given tensor.聽

An upper_bound of a given tensor is equal (the upper-bound of LSTM)
n^2 * m = memory_space / (16 bytes)

All 10 comments

seg fault:

Segmentation fault: 11

terminate called without an active exception
Aborted (core dumped)

GDB:


Thread 9 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffbac26700 (LWP 18164)]
bt
0x00007fff9c0743f0 in ?? ()
(gdb) bt
#0  0x00007fff9c0743f0 in ?? ()
#1  0x00007fffe5e905ec in float** dnnl::impl::memory_tracking::grantor_t::get<float*>(unsigned int const&) const
    () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#2  0x00007fffe5e93697 in dnnl::impl::cpu::_ref_rnn_common_t<(dnnl_prop_kind_t)64, (dnnl_data_type_t)3, (dnnl_data_type_t)3, (dnnl_data_type_t)3>::execute_(dnnl::impl::exec_ctx_t const&) const ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#3  0x00007fffe5d05de9 in dnnl::impl::cpu::_ref_rnn_common_t<(dnnl_prop_kind_t)64, (dnnl_data_type_t)3, (dnnl_data_type_t)3, (dnnl_data_type_t)3>::execute(dnnl::impl::exec_ctx_t const&) const ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#4  0x00007fffe5890788 in dnnl_primitive_execute ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#5  0x00007fffe0a5eb1a in mxnet::MKLDNNStream::Submit(bool) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#6  0x00007fffe0b13343 in mxnet::op::MKLDNNRnnOp::Forward(mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#7  0x00007fffe5306633 in mxnet::op::RNNStatefulComputeExCPU(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#8  0x00007fffe4f503fd in mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#9  0x00007fffe4f506cd in std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::O---Type <return> to continue, or q <return> to quit---
pStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#2}>::_M_invoke(std::_Any_data const&, mxnet::RunContext) () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#10 0x00007fffe501d754 in std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext, mxnet::engine::CallbackOnComplete) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#11 0x00007fffe50180a5 in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) () from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#12 0x00007fffe502a294 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#13 0x00007fffe5016934 in std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so
#14 0x00007fffded79421 in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
#15 0x00007ffff7bbd6db in start_thread (arg=0x7fffbac26700) at pthread_create.c:463
#16 0x00007ffff78e6a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@TaoLv @ciyongch @PatricZhao - Hello guys. Can you please help in this issue. We saw atleast 2 production users impacted by this and USE_MKLDNN=0 was temp fix, but performance is really bad as expected. This is a blocker.

@anko-intel

Thanks, @Zha0q1 @sandeep-krishnamurthy! I have a look at this issue.

@Zha0q1 Could you please tell me a little bit more details about this issue, such as the branch name and its commit sha and what version of MKLDNN you have (commit-sha)? Thanks!

I am using mxnet 1.7 (https://github.com/apache/incubator-mxnet/releases/tag/1.7.0) from .pip install mxnet. The machine was a C5.9xlarge DLAMI Ubuntu 18 EC2 instance.

Hi,

Well,
When running our pre-model (this is a simple imitation of the LSTM model). While a test, I want to create a large LSTM tensor, for example: (20758,500). It could be seen that ~ 170GB of memory is allocated for scratchpad computations. We can see that global memory is always true. Well, as a result, for a different oneDNN version, I got the error messages: as following:

  1. For a given v.1.3 version of mkldnn: Segmentation fault: 11
  2. For a given v.1.6 version of mkldnn: mxnet.base.MXNetError: MXNetError: could not create a primitive

This error is only visible for a large LSTM tensor. Step-by-step reproduction casts light on this issue. If we have a look at the code, a lot of things might be visible there. First off, the standard Vanilla-LSTM algorithm of MKLDNN leads to allocate sufficient/insufficient block of memory. The block is allocated based on this equation: sizeof(float) * work_space, where work_space is an offset (in bytes). For a given test (input: 20758,500) we can see that ~170 GB od memory is allocated for scratchpad computation, where workspace = 47952392192 * sizeof(float) = 191809568768 bytes ~ 170 GB. If you don't have enough space, you will get both errors: see 1 & 2. In Intel, MKLDNN primitives can use either individual memory or global buffer memory for an intermediate computation. The first one might lead to getting better performance result since memory most likely will be attached to any thread. The second one, might save a lot of memory.

For brevity:
The input tensor isT x N x C, well, for a given example (10758, 500), T is 10758, C is 500, That means that we need at least 4 * 10758 * 500 * 500 * 4 bytes ~ 40 GB, or maybe more. Basically the work-space would be comparable with the grid size n_layers * mb * n_times_stamps * 4 (gates) * max(sic, slc, dhsc) ^ 2.聽 For a given oneDNN version (1.3 and 1.6) the size of work-space (i.e LSTM space)聽is equal book<float>(num_elems, ....) ~ 40 GB * sizeof(T) = 40 GB * 4 ~160GB.聽 An upper_bound (the size of input tensor) has not been clearly defined and its upper_bound has been limited by physical side of memory.聽聽Well, the size of buffer which is need to allocate LSTM tensor is determined, as following: 4 * 10758 * 500 * 500 * 4 bytes ~ 40 GB.聽 Yet, this value is multiply by the constant value聽of its type (in this case: = float).聽
Approximately: it should be defined, as following:聽

  1. The size of work-space * , where is 聽~ * 1byte [potentially]
  2. The workspace is only limited by the total number of elements of a given tensor.聽

An upper_bound of a given tensor is equal (the upper-bound of LSTM)
n^2 * m = memory_space / (16 bytes)

Hi @Zha0q1
There鈥檚 a bug in oneDNN LSTM forward inference that results in using ~4x more memory for LSTM workspace in inference cases.
Could you please tell me whether this addressing (look at the table), is acceptable and it allows you to resolve any issues?

聽 (dim: 20756, 500) | Before | After
-- | -- | --
The total size of memory needed to allocate LSTM tensor | 230 GB (~4x more memory) | 56 GB (~4x less memory)

@mozga-intel Thanks for you investigation! Yes, this improvement is huge and will help our users who run inference tasks on pre-trained models. It would be great to include this fix in the next oneDNN release

@TaoLv @ciyongch @PatricZhao - Hello guys. Can you please help in this issue. We saw atleast 2 production users impacted by this and USE_MKLDNN=0 was temp fix, but performance is really bad as expected. This is a blocker.

Sorry for that and the team is working on fixing any possible issues. Feel free to ping us for any issue :)

Was this page helpful?
0 / 5 - 0 ratings