Incubator-mxnet: Current MXNet-Dev master breaks loading of certain models

Created on 24 Jun 2019  路  9Comments  路  Source: apache/incubator-mxnet

Description

The current MXNET master dev branch, pypi version 1.5.0b20190623 breaks the loading of certain MXNET-models (both in mxnet-mkl & mxnet-cu100), which previously were loaded successfully with mxnet==1.4.1.
The model uses grouped depthwise (a.ka. depthwise seperable) convolutions which could be the cause for this issue because other models (e.g. CrazyAraFish_0.5.0_RiseV1.zip) still work correctly as usual.

Environment info

I'm using python, but the same problem also occurs when building the MXNET-CPP package from source.

Error Message:

isready
self.symbol_path: /home/queensgambit/Programming/Deep_Learning/models/risev2/symbol/model-1.19246-0.603-symbol.json
self.params_path: /home/queensgambit/Programming/Deep_Learning/models/risev2/params/model-1.19246-0.603-0223.params
[00:35:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[00:35:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
Traceback (most recent call last):
  File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1623, in simple_bind
    ctypes.byref(exe_handle)))
  File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator value: [00:35:51] include/mxnet/./tuple.h:202: Check failed: i >= 0 && i < ndim(): index = 0 must be in range [0, -1)
Stack trace:
  [bt] (0) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b3ab) [0x7f186bc433ab]
  [bt] (1) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5343) [0x7f186bcad343]
  [bt] (2) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x298bf82) [0x7f186e373f82]
  [bt] (3) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2471ee2) [0x7f186de59ee2]
  [bt] (4) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2474794) [0x7f186de5c794]
  [bt] (5) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x355) [0x7f186de48455]
  [bt] (6) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Executor::SimpleBind(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*)+0x8a8) [0x7f186de49688]
  [bt] (7) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBindEx+0x221b) [0x7f186dd9884b]
  [bt] (8) /home/queensgambit/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1872e3eec0]



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "crazyara.py", line 668, in main
    self.setup_network()
  File "crazyara.py", line 166, in setup_network
    model_weights_dir=self.settings["model_weights_dir"]))
  File "/home/queensgambit/Programming/Deep_Learning/CrazyAra/DeepCrazyhouse/src/domain/agent/neural_net_api.py", line 95, in __init__
    force_rebind=True,
  File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1629, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 34, 8, 8)
force_rebind: True
Error in operator value: [00:35:51] include/mxnet/./tuple.h:202: Check failed: i >= 0 && i < ndim(): index = 0 must be in range [0, -1)
Stack trace:
  [bt] (0) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b3ab) [0x7f186bc433ab]
  [bt] (1) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5343) [0x7f186bcad343]
  [bt] (2) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x298bf82) [0x7f186e373f82]
  [bt] (3) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2471ee2) [0x7f186de59ee2]
  [bt] (4) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2474794) [0x7f186de5c794]
  [bt] (5) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x355) [0x7f186de48455]
  [bt] (6) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Executor::SimpleBind(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*)+0x8a8) [0x7f186de49688]
  [bt] (7) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBindEx+0x221b) [0x7f186dd9884b]
  [bt] (8) /home/queensgambit/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1872e3eec0]

Minimum reproducible example

Steps to reproduce

Download release CrazyAra_0.5.0_RiseV2_mobile.zip at:

pip install python-chess

Extract CrazyAra_0.5.0_RiseV2_mobile.zip and run

$ python crazyara.py
$ uci
$ isready

from the commandline.
More details for install instructions can be found here:

Alternatively, you can load the mxnet model from the model/ directory manually in python.

Does someones have an idea what recent change causes this?
Can you include more automated unit tests for MXNET to ensure that the loading of different model types is preserved for version updates?

Backend Bug

Most helpful comment

I think, I know why the loading fails, thank you for help @roywei. It's because I ported the training code from Gluon to MXNet for this model. The reason for this was that I experienced long delays during training due to MXNET_CUDNN_AUTOTUNE_DEFAULT calls:

Apparently in MXNet version 1.4.1 the code above works successfully and ignores the missing label information whereas version 1.5.0 blocks it, which is a behaviour I appreciate.

Using this code I'm able to successfully load the model both in version MXNet 1.4.1 and MXNet 1.5.0:

model_arch_path = 'model-1.19246-0.603-symbol.json'
model_params_path = 'model-1.19246-0.603-0223.params'
ctx = mx.cpu()
symbol = mx.sym.load(model_arch_path)
inputs = mx.sym.var('data', dtype='float32')
value_out = symbol.get_internals()['value_tanh0_output']
policy_out = symbol.get_internals()['flatten0_output']
sym = mx.symbol.Group([value_out, policy_out])
net = mx.gluon.SymbolBlock(sym, inputs)
net.collect_params().load(model_params_path, ctx)

Consequently, this issue can be closed.

All 9 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@mxnet-label-bot add [Bug]

@mxnet-label-bot add [Backend]

This issue might also be related to https://github.com/apache/incubator-mxnet/issues/15281.

Hi @QueensGambit I'm getting file not found error when following the steps to reproduce

I do have model-1.19246-0.603-0223.params under model/params

uciok

isready
info string The given batch_size 8 is higher than the number of threads 4. The maximum legal batch_size is the same as the number of threads (here: 4) 
info string The batch_size was reduced to 4
Traceback (most recent call last):
  File "crazyara.py", line 734, in main
    self.setup_network()
  File "crazyara.py", line 169, in setup_network
    model_weights_dir=self.settings["model_weights_dir"]))
  File "/Users/lawei/Downloads/CrazyAra_0.5.0_RiseV2_mobile/DeepCrazyhouse/src/domain/agent/neural_net_api.py", line 60, in __init__
    + '. Please make sure that the path has a "/" at the end of the path.'
Exception: No params file (.params) was found in your given model_weights_dir: ./model/params/. Please make sure that the path has a "/" at the end of the path.

Also I'm getting parameter not found when trying to load the symbol and params directly

>>> gluon.nn.SymbolBlock.imports("model-1.19246-0.603-symbol.json", ['data'], "model-1.19246-0.603-0223.params")

[13:25:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[13:25:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py:1159: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
    data: None
  input_sym_arg_type = in_param.infer_type()[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1037, in imports
    ret.collect_params().load(param_file, ctx=ctx, cast_dtype=True, dtype_source='saved')
  File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 960, in load
    ignore_extra, restore_prefix, filename, cast_dtype, dtype_source)
  File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 995, in load_dict
    name[lprefix:], error_str, _brief_print_list(arg_dict.keys()))
AssertionError: Parameter 'value_label' is missing in file: model-1.19246-0.603-0223.params, which contains parameters: 'stem_conv0_weight', 'stem_bn0_gamma', 'stem_bn0_beta', ..., 'value_bn0_moving_mean', 'value_bn0_moving_var', 'policy_bn0_moving_mean', 'policy_bn0_moving_var'. Please make sure source and target networks have the same prefix.

@roywei Thank your for the reply.
Sorry, for the inconvenience there was apparently a / missing in the relative path of the config-files which I released. I just updated the .zip-Release files and it should work again for MXNet 1.4.1.

This is the code how the model is currently loaded:
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/agent/neural_net_api.py#L66

        sym = mx.sym.load(self.symbol_path)
        # https://github.com/apache/incubator-mxnet/issues/6951
        save_dict = mx.nd.load(self.params_path)
        arg_params = {}
        aux_params = {}
        for key, val in save_dict.items():
            param_type, name = key.split(":", 1)
            if param_type == "arg":
                arg_params[name] = val
            if param_type == "aux":
                aux_params[name] = val
        # set the context on CPU, switch to GPU if there is one available
        if ctx == "cpu":
            self.ctx = mx.cpu()
        elif ctx == "gpu":
            self.ctx = mx.gpu()
        else:
            raise Exception("Unavailable ctx mode given %s. You must either select 'cpu' or 'gpu'" % ctx)
        # define batch_size times executor objects which are used for inference
        # one executor object is used for the currently requested batch batch length
        # the requested batch length is variable and at maximum the given batch_size
        self.executors = []
        for i in range(batch_size):
            executor = sym.simple_bind(
                ctx=self.ctx,
                # add a new length for each size starting with 1
                data=(i + 1, NB_CHANNELS_FULL, BOARD_HEIGHT, BOARD_WIDTH),
                grad_req="null",
                force_rebind=True,
            )
            executor.copy_params_from(arg_params, aux_params)
            self.executors.append(executor)

I think, I know why the loading fails, thank you for help @roywei. It's because I ported the training code from Gluon to MXNet for this model. The reason for this was that I experienced long delays during training due to MXNET_CUDNN_AUTOTUNE_DEFAULT calls:

Apparently in MXNet version 1.4.1 the code above works successfully and ignores the missing label information whereas version 1.5.0 blocks it, which is a behaviour I appreciate.

Using this code I'm able to successfully load the model both in version MXNet 1.4.1 and MXNet 1.5.0:

model_arch_path = 'model-1.19246-0.603-symbol.json'
model_params_path = 'model-1.19246-0.603-0223.params'
ctx = mx.cpu()
symbol = mx.sym.load(model_arch_path)
inputs = mx.sym.var('data', dtype='float32')
value_out = symbol.get_internals()['value_tanh0_output']
policy_out = symbol.get_internals()['flatten0_output']
sym = mx.symbol.Group([value_out, policy_out])
net = mx.gluon.SymbolBlock(sym, inputs)
net.collect_params().load(model_params_path, ctx)

Consequently, this issue can be closed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

qiliux picture qiliux  路  3Comments

luoruisichuan picture luoruisichuan  路  3Comments

Fzz123 picture Fzz123  路  3Comments

Zhaoyang-XU picture Zhaoyang-XU  路  3Comments

JonBoyleCoding picture JonBoyleCoding  路  3Comments