The current MXNET master dev branch, pypi version 1.5.0b20190623 breaks the loading of certain MXNET-models (both in mxnet-mkl & mxnet-cu100), which previously were loaded successfully with mxnet==1.4.1.
The model uses grouped depthwise (a.ka. depthwise seperable) convolutions which could be the cause for this issue because other models (e.g. CrazyAraFish_0.5.0_RiseV1.zip) still work correctly as usual.
I'm using python, but the same problem also occurs when building the MXNET-CPP package from source.
isready
self.symbol_path: /home/queensgambit/Programming/Deep_Learning/models/risev2/symbol/model-1.19246-0.603-symbol.json
self.params_path: /home/queensgambit/Programming/Deep_Learning/models/risev2/params/model-1.19246-0.603-0223.params
[00:35:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[00:35:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
Traceback (most recent call last):
File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1623, in simple_bind
ctypes.byref(exe_handle)))
File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator value: [00:35:51] include/mxnet/./tuple.h:202: Check failed: i >= 0 && i < ndim(): index = 0 must be in range [0, -1)
Stack trace:
[bt] (0) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b3ab) [0x7f186bc433ab]
[bt] (1) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5343) [0x7f186bcad343]
[bt] (2) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x298bf82) [0x7f186e373f82]
[bt] (3) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2471ee2) [0x7f186de59ee2]
[bt] (4) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2474794) [0x7f186de5c794]
[bt] (5) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x355) [0x7f186de48455]
[bt] (6) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Executor::SimpleBind(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*)+0x8a8) [0x7f186de49688]
[bt] (7) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBindEx+0x221b) [0x7f186dd9884b]
[bt] (8) /home/queensgambit/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1872e3eec0]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "crazyara.py", line 668, in main
self.setup_network()
File "crazyara.py", line 166, in setup_network
model_weights_dir=self.settings["model_weights_dir"]))
File "/home/queensgambit/Programming/Deep_Learning/CrazyAra/DeepCrazyhouse/src/domain/agent/neural_net_api.py", line 95, in __init__
force_rebind=True,
File "/home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1629, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 34, 8, 8)
force_rebind: True
Error in operator value: [00:35:51] include/mxnet/./tuple.h:202: Check failed: i >= 0 && i < ndim(): index = 0 must be in range [0, -1)
Stack trace:
[bt] (0) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b3ab) [0x7f186bc433ab]
[bt] (1) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5343) [0x7f186bcad343]
[bt] (2) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x298bf82) [0x7f186e373f82]
[bt] (3) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2471ee2) [0x7f186de59ee2]
[bt] (4) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2474794) [0x7f186de5c794]
[bt] (5) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x355) [0x7f186de48455]
[bt] (6) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Executor::SimpleBind(nnvm::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::vector<mxnet::Context, std::allocator<mxnet::Context> > const&, std::unordered_map<std::string, mxnet::TShape, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::TShape> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, int> > > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::unordered_map<std::string, mxnet::NDArray, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, mxnet::NDArray> > >*, mxnet::Executor*)+0x8a8) [0x7f186de49688]
[bt] (7) /home/queensgambit/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBindEx+0x221b) [0x7f186dd9884b]
[bt] (8) /home/queensgambit/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1872e3eec0]
Download release CrazyAra_0.5.0_RiseV2_mobile.zip at:
pip install python-chess
Extract CrazyAra_0.5.0_RiseV2_mobile.zip and run
$ python crazyara.py
$ uci
$ isready
from the commandline.
More details for install instructions can be found here:
Alternatively, you can load the mxnet model from the model/ directory manually in python.
Does someones have an idea what recent change causes this?
Can you include more automated unit tests for MXNET to ensure that the loading of different model types is preserved for version updates?
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug
@mxnet-label-bot add [Bug]
@mxnet-label-bot add [Backend]
This issue might also be related to https://github.com/apache/incubator-mxnet/issues/15281.
Hi @QueensGambit I'm getting file not found error when following the steps to reproduce
I do have model-1.19246-0.603-0223.params under model/params
uciok
isready
info string The given batch_size 8 is higher than the number of threads 4. The maximum legal batch_size is the same as the number of threads (here: 4)
info string The batch_size was reduced to 4
Traceback (most recent call last):
File "crazyara.py", line 734, in main
self.setup_network()
File "crazyara.py", line 169, in setup_network
model_weights_dir=self.settings["model_weights_dir"]))
File "/Users/lawei/Downloads/CrazyAra_0.5.0_RiseV2_mobile/DeepCrazyhouse/src/domain/agent/neural_net_api.py", line 60, in __init__
+ '. Please make sure that the path has a "/" at the end of the path.'
Exception: No params file (.params) was found in your given model_weights_dir: ./model/params/. Please make sure that the path has a "/" at the end of the path.
Also I'm getting parameter not found when trying to load the symbol and params directly
>>> gluon.nn.SymbolBlock.imports("model-1.19246-0.603-symbol.json", ['data'], "model-1.19246-0.603-0223.params")
[13:25:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[13:25:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py:1159: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
data: None
input_sym_arg_type = in_param.infer_type()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1037, in imports
ret.collect_params().load(param_file, ctx=ctx, cast_dtype=True, dtype_source='saved')
File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 960, in load
ignore_extra, restore_prefix, filename, cast_dtype, dtype_source)
File "/Users/lawei/anaconda3/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 995, in load_dict
name[lprefix:], error_str, _brief_print_list(arg_dict.keys()))
AssertionError: Parameter 'value_label' is missing in file: model-1.19246-0.603-0223.params, which contains parameters: 'stem_conv0_weight', 'stem_bn0_gamma', 'stem_bn0_beta', ..., 'value_bn0_moving_mean', 'value_bn0_moving_var', 'policy_bn0_moving_mean', 'policy_bn0_moving_var'. Please make sure source and target networks have the same prefix.
@roywei Thank your for the reply.
Sorry, for the inconvenience there was apparently a / missing in the relative path of the config-files which I released. I just updated the .zip-Release files and it should work again for MXNet 1.4.1.
This is the code how the model is currently loaded:
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/agent/neural_net_api.py#L66
sym = mx.sym.load(self.symbol_path)
# https://github.com/apache/incubator-mxnet/issues/6951
save_dict = mx.nd.load(self.params_path)
arg_params = {}
aux_params = {}
for key, val in save_dict.items():
param_type, name = key.split(":", 1)
if param_type == "arg":
arg_params[name] = val
if param_type == "aux":
aux_params[name] = val
# set the context on CPU, switch to GPU if there is one available
if ctx == "cpu":
self.ctx = mx.cpu()
elif ctx == "gpu":
self.ctx = mx.gpu()
else:
raise Exception("Unavailable ctx mode given %s. You must either select 'cpu' or 'gpu'" % ctx)
# define batch_size times executor objects which are used for inference
# one executor object is used for the currently requested batch batch length
# the requested batch length is variable and at maximum the given batch_size
self.executors = []
for i in range(batch_size):
executor = sym.simple_bind(
ctx=self.ctx,
# add a new length for each size starting with 1
data=(i + 1, NB_CHANNELS_FULL, BOARD_HEIGHT, BOARD_WIDTH),
grad_req="null",
force_rebind=True,
)
executor.copy_params_from(arg_params, aux_params)
self.executors.append(executor)
I think, I know why the loading fails, thank you for help @roywei. It's because I ported the training code from Gluon to MXNet for this model. The reason for this was that I experienced long delays during training due to MXNET_CUDNN_AUTOTUNE_DEFAULT calls:
Apparently in MXNet version 1.4.1 the code above works successfully and ignores the missing label information whereas version 1.5.0 blocks it, which is a behaviour I appreciate.
Using this code I'm able to successfully load the model both in version MXNet 1.4.1 and MXNet 1.5.0:
model_arch_path = 'model-1.19246-0.603-symbol.json'
model_params_path = 'model-1.19246-0.603-0223.params'
ctx = mx.cpu()
symbol = mx.sym.load(model_arch_path)
inputs = mx.sym.var('data', dtype='float32')
value_out = symbol.get_internals()['value_tanh0_output']
policy_out = symbol.get_internals()['flatten0_output']
sym = mx.symbol.Group([value_out, policy_out])
net = mx.gluon.SymbolBlock(sym, inputs)
net.collect_params().load(model_params_path, ctx)
Consequently, this issue can be closed.
See insightface #764
Most helpful comment
I think, I know why the loading fails, thank you for help @roywei. It's because I ported the training code from Gluon to MXNet for this model. The reason for this was that I experienced long delays during training due to
MXNET_CUDNN_AUTOTUNE_DEFAULTcalls:Apparently in MXNet version 1.4.1 the code above works successfully and ignores the missing label information whereas version 1.5.0 blocks it, which is a behaviour I appreciate.
Using this code I'm able to successfully load the model both in version MXNet 1.4.1 and MXNet 1.5.0:
Consequently, this issue can be closed.