We are observing crashes with Sockeye transformer models using the latest build from mxnet master (1.2.0b20180411). The error we are seeing is this:
Error in positional_encodings.infer_type: Traceback (most recent call last):
File "/Users/fhieber/miniconda3/lib/python3.6/site-packages/mxnet/operator.py", line 831, in infer_storage_type_entry
ret = op_prop.infer_storage_type(stypes)
File "/Users/fhieber/miniconda3/lib/python3.6/site-packages/mxnet/operator.py", line 556, in infer_storage_type
return in_stype, [in_stype[0]]*len(self.list_outputs()), \
IndexError: list index out of range
[ERROR:__main__] Uncaught exception
Traceback (most recent call last):
File "/Users/fhieber/miniconda3/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1513, in simple_bind
ctypes.byref(exe_handle)))
File "/Users/fhieber/miniconda3/lib/python3.6/site-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator target_pos_embed_positional_encodings: [14:33:09] src/operator/custom/custom.cc:523: Check failed: reinterpret_cast<CustomOpInferStorageTypeFunc>( params.info->callbacks[kCustomOpPropInferStorageType])( stypes.size(), stypes.data(), params.info->contexts[kCustomOpPropInferStorageType])
The corresponding Sockeye issue is here: https://github.com/awslabs/sockeye/issues/354
I believe this might be related to this commit/PR #10374 as the crash is not observed with mxnet==1.2.0b20180410.
Note that the custom op in Sockeye does not use input data (https://github.com/awslabs/sockeye/blob/master/sockeye/layers.py#L644).
We observed an issue with this type of op before in an earlier version of mxnet: https://github.com/apache/incubator-mxnet/pull/7967 (This PR also contains a minimum reproducible example).
It'd be great to fix this before the 1.2 RC.
Same problem occurs based on mxnet_cu90-1.2.0b20180411 and the latest master branch.
My (uneducated) guess would be that the new code for storage type inference needs to guard against in_stype being an empty list to avoid the index error when trying to access in_stype[0].
@fhieber Thanks a lot for bringing this up. You are right, the use case for empty input lists was not considered. Working on a fix now.
@fhieber the fix is currently merged. Can you please check if this issue is good to close ?
In commit dbe5c14ca6daa3e607ef8293735d33106f47a697 I'm seeing the same crash in sockeye.
Error in operator source_pos_embed_positional_encodings: [10:49:15] /home/kellen/Development/incubator-mxnet/src/operator/custom/custom.cc:523: Check failed: reinterpret_cast<CustomOpInferStorageTypeFunc>( params.info->callbacks[kCustomOpPropInferStorageType])( stypes.size(), stypes.data(), params.info->contexts[kCustomOpPropInferStorageType])
I've verified updating to ceb810ccc17a712c375d55418a0ba45ae91714b5 fixes the issue. I think we can consider this one fixed, thanks for the quick work. @fhieber you ok with closing?
Yes. I wanted to test it with the nightly mac build (as I don't have compilation set up on my laptop) when available, but if you already tested it, I can close it.
Thanks again everyone for the super fast turnaround and making sure this fix gets into the next release!
Most helpful comment
My (uneducated) guess would be that the new code for storage type inference needs to guard against
in_stypebeing an empty list to avoid the index error when trying to accessin_stype[0].