I encounter a the exception "Exception: unknown storage type: -1" when I use my focal loss
the shape of out_data[0] is (batch_size, 2, anchor_num)
the shape of in_data[1] is (batch_size, anchor_num)
class FocalLossOperator(mx.operator.CustomOp):
def __init__(self, gamma, alpha):
super(FocalLossOperator, self).__init__()
self.gamma = gamma
self.alpha = alpha
def forward(self, is_train, req, in_data, out_data, aux):
#print('forward')
#print(in_data[0].shape)
y = mx.nd.exp(in_data[0] - mx.nd.max_axis(in_data[0], axis=1).reshape((in_data[0].shape[0], 1, -1)))
y /= mx.nd.sum(y, axis=1).reshape((in_data[0].shape[0],1, -1))
self.assign(out_data[0], req[0], y)
def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
y_numpy = out_data[0].asnumpy().transpose((0,2,1))
label_numpy = in_data[1].asnumpy()
y_numpy = y_numpy.reshape((-1,2))
label_numpy = label_numpy.reshape((-1))
#print(len(np.where(label_numpy == -1)[0]))
indices = np.where(label_numpy == -1)[0]
label_numpy[indices] = 0
self.pro_truth = mx.nd.array(y_numpy[np.arange(y_numpy.shape[0]), label_numpy.astype(np.int)])
print(len(indices))
# i!=j
pro_truth = (self.pro_truth + 1e-14).reshape((self.pro_truth.shape[0], 1))
grad = self.alpha * mx.nd.power(1 - pro_truth, self.gamma - 1) * \
(self.gamma * (-1 * pro_truth * mx.nd.array(y_numpy)) * mx.nd.log(pro_truth) + mx.nd.array(y_numpy) * (1 - pro_truth))
# i==j
pro_truth = self.pro_truth + 1e-14
grad_numpy = grad.asnumpy()
grad_numpy[np.arange(y_numpy.shape[0]), label_numpy.astype(np.int)] = (
self.alpha * mx.nd.power(1 - pro_truth, self.gamma) * (
self.gamma * pro_truth * mx.nd.log(pro_truth) + pro_truth - 1)).asnumpy()
grad_numpy /= label_numpy.shape[0]
grad_numpy[indices,:] = 0
#grad_numpy = grad_numpy.reshape((out_data[0].shape[0],-1,out_data[0].shape[1])).transpose((0,2,1))
grad = mx.nd.array(grad_numpy)
grad = grad.reshape(out_data[0].shape[0],-1,out_data[0].shape[1]).transpose((0,2,1))
self.assign(in_grad[0], req[0], grad)
@mx.operator.register('FocalLoss')
class FocalLossProp(mx.operator.CustomOpProp):
def __init__(self, gamma, alpha):
super(FocalLossProp, self).__init__(need_top_grad=False)
self.gamma = float(gamma)
self.alpha = float(alpha)
def list_arguments(self):
return ['data', 'labels']
def list_outputs(self):
return ['output']
def infer_shape(self, in_shape):
data_shape = in_shape[0]
labels_shape = in_shape[1]
out_shape = data_shape
return [data_shape, labels_shape], [out_shape], []
def create_operator(self, ctx, shapes, dtypes):
return FocalLossOperator(self.gamma, self.alpha)
Error in CustomOp.backward: Traceback (most recent call last):
File "/home/anaconda2/lib/python2.7/site-packages/mxnet/operator.py", line 1020, in backward_entry
stype=stype))
File "/home/anaconda2/lib/python2.7/site-packages/mxnet/ndarray/sparse.py", line 1187, in _ndarray_cls
raise Exception("unknown storage type: %s"%stype)
Exception: unknown storage type: -1
terminate called after throwing an instance of 'dmlc::Error'
what(): [12:17:03] src/operator/custom/custom.cc:418: Check failed: reinterpret_cast
Stack trace returned 8 entries:
[bt] (0) /home/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40b29a) [0x7feccd0c829a]
[bt] (1) /home/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40b8b1) [0x7feccd0c88b1]
[bt] (2) /home/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x6c6239) [0x7feccd383239]
[bt] (3) /home/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x6e1020) [0x7feccd39e020]
[bt] (4) /home/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x6c7078) [0x7feccd384078]
[bt] (5) /home/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7fed70a50c5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fed78a076ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fed7802d41d]
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Build
I have this problem too, It may be a compatible problem.
I think it's serious bug as most python custom operators encounter this error.
I think it's serious bug as most python custom operators encounter this error.
so how can I use this custom operator? I really need to use focal loss in my experiment
I can use other custom operators, they didn't have this problem @chinakook
You can define storage type as the parent class CustomOp. May be like ['default'].
I could not reproduce this exception
import mxnet as mx
import numpy as np
class FocalLossOperator(mx.operator.CustomOp):
def __init__(self, gamma, alpha):
super(FocalLossOperator, self).__init__()
self.gamma = gamma
self.alpha = alpha
def forward(self, is_train, req, in_data, out_data, aux):
#print('forward')
#print(in_data[0].shape)
y = mx.nd.exp(in_data[0] - mx.nd.max_axis(in_data[0], axis=1).reshape((in_data[0].shape[0], 1, -1)))
y /= mx.nd.sum(y, axis=1).reshape((in_data[0].shape[0],1, -1))
self.assign(out_data[0], req[0], y)
def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
y_numpy = out_data[0].asnumpy().transpose((0,2,1))
label_numpy = in_data[1].asnumpy()
y_numpy = y_numpy.reshape((-1,2))
label_numpy = label_numpy.reshape((-1))
#print(len(np.where(label_numpy == -1)[0]))
indices = np.where(label_numpy == -1)[0]
label_numpy[indices] = 0
self.pro_truth = mx.nd.array(y_numpy[np.arange(y_numpy.shape[0]), label_numpy.astype(np.int)])
# print(len(indices))
# i!=j
pro_truth = (self.pro_truth + 1e-14).reshape((self.pro_truth.shape[0], 1))
grad = self.alpha * mx.nd.power(1 - pro_truth, self.gamma - 1) * \
(self.gamma * (-1 * pro_truth * mx.nd.array(y_numpy)) * mx.nd.log(pro_truth) + mx.nd.array(y_numpy) * (1 - pro_truth))
# i==j
pro_truth = self.pro_truth + 1e-14
grad_numpy = grad.asnumpy()
grad_numpy[np.arange(y_numpy.shape[0]), label_numpy.astype(np.int)] = (
self.alpha * mx.nd.power(1 - pro_truth, self.gamma) * (
self.gamma * pro_truth * mx.nd.log(pro_truth) + pro_truth - 1)).asnumpy()
grad_numpy /= label_numpy.shape[0]
grad_numpy[indices,:] = 0
#grad_numpy = grad_numpy.reshape((out_data[0].shape[0],-1,out_data[0].shape[1])).transpose((0,2,1))
grad = mx.nd.array(grad_numpy)
grad = grad.reshape(out_data[0].shape[0],-1,out_data[0].shape[1]).transpose((0,2,1))
self.assign(in_grad[0], req[0], grad)
@mx.operator.register('FocalLoss')
class FocalLossProp(mx.operator.CustomOpProp):
def __init__(self, gamma, alpha):
super(FocalLossProp, self).__init__(need_top_grad=False)
self.gamma = float(gamma)
self.alpha = float(alpha)
def list_arguments(self):
return ['data', 'labels']
def list_outputs(self):
return ['output']
def infer_shape(self, in_shape):
data_shape = in_shape[0]
labels_shape = in_shape[1]
out_shape = data_shape
return [data_shape, labels_shape], [out_shape], []
def create_operator(self, ctx, shapes, dtypes):
return FocalLossOperator(self.gamma, self.alpha)
class FocalLossGluon(mx.gluon.nn.HybridBlock):
def hybrid_forward(self, F, x, label):
return F.Custom(x, label, gamma=1, alpha=1, op_type='FocalLoss')
if __name__ == '__main__':
batch_size = 3
num_anchor = 4
x = mx.nd.zeros((batch_size, 2, num_anchor))
label = mx.nd.zeros((batch_size, num_anchor))
x.attach_grad()
with mx.autograd.record():
y = mx.nd.Custom(x, label, gamma=1, alpha=1, op_type='FocalLoss')
y.backward()
print(y)
print(x.grad)
block = FocalLossGluon()
block.hybridize()
for _ in range(2):
with mx.autograd.record():
y = block(x, label)
y.backward()
print(y)
print(x.grad)
@jiashu-zhu paste your model here.
I just use this focalloss to replace softmaxoutput in RetinaFace @chinakook
this focalloss works in your code? @wkcn
@jiashu-zhu Yes, it works in my code.
All FPN op in this repo get this error. It may be something bug with custom op.
Really thanks, so do you have any idea to make it works?I think I can try them @chinakook
Use a older mxnet version.
Many thanks, I will try it @chinakook
Could you please tell me which SoftmaxOutput is replaced with FocalLoss?
A minimal reproduce example is good.
I replace the SoftmaxOutput in line 403 of rcnn/symbol/symbol_common, and I use resnet-152 as my backbone, which you can download in retinaface homepage, and other setting remain default @wkcn
I got the same problem. I tried mxnet version 1.5.0, 1.4.1, 1.3.1, 1.2.1, 1.1.0, and only version 1.1.0 works for me.
I need a minimal reproduce example to check the bug, since I am busy and have a little time on it.
Does it work in the minist classification? https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/symbols/lenet.py
Thanks for reporting this @jiashu-zhu @cccorn @chinakook . Thanks a lot @wkcn for offering to help. It is possible that this issue got added with the Sparse Tensor support for custom op. Have you tried commenting out the declare_backward_dependency in CustomOpProp https://github.com/dingjiansw101/RoITransformer_DOTA/blob/master/fpn/operator_py/fpn_psroi_rotatedpooling.py#L128 to see if that fixes the issue. Sorry, I am a little pressed for time right now and won't be able to dig into the issue currently. Can you try this workaround for now ?
I met the same problem in Deformable ConvNets when I changed ENABLE_OHEM: false :(
https://github.com/msracver/Deformable-ConvNets/blob/master/experiments/fpn/cfgs/resnet_v1_101_coco_trainval_fpn_dcn_end2end_ohem.yaml#L82
I tried to address the problem, and I found some NDArray is not initalized.
https://github.com/apache/incubator-mxnet/blob/master/src/c_api/c_api.cc#L588
I commented out the declare_backward_dependency, but it did not work.
A temporary solution: comment out all need_top_grad=False and declare_backward_dependency.
Thanks for your kindly help! @wkcn
and using older version(like MXnet v1.1.0) is also a temporary solution
Most helpful comment
Thanks for reporting this @jiashu-zhu @cccorn @chinakook . Thanks a lot @wkcn for offering to help. It is possible that this issue got added with the Sparse Tensor support for custom op. Have you tried commenting out the declare_backward_dependency in CustomOpProp https://github.com/dingjiansw101/RoITransformer_DOTA/blob/master/fpn/operator_py/fpn_psroi_rotatedpooling.py#L128 to see if that fixes the issue. Sorry, I am a little pressed for time right now and won't be able to dig into the issue currently. Can you try this workaround for now ?