Incubator-mxnet: how to debug network layer by layer

Created on 14 Jun 2017 · 12Comments · Source: apache/incubator-mxnet

first of all, thank very much for afford such powerful tool

I'm try to combine NCE loss and bucketing LSTM in Python for few days. For now, the network is build successful and run without complain, the loss decreasing as batch feeding(very slow), and the PPL test on result model is very huge, so, I wanna to debug the model step by step, print weights and each layer's forward/backward tensor, to find where is the problem.

what's I have done yet:

install a monitor (find and fix a tiny bug while using monitor with bucketing module)
feed sample one per batch again and again
save weights, forward/backward tensor into file

after that, I do have all weights and each layers's forward/backward tensor saved in files.

but while analysis those file, I found that some tensor are work as defined in net symbol, but some are not, what I have found are:

some layer are merge, such as a element-wise multiply layer follow by a softmax layer, those two layers' forward output tensor are exactly same
in a mx.sym.Reshape layer, the input tensor and output tensor look like have no relation with each other, which is confuse me
the sample problem appear in backward flow

so, is the underlay graph executor have optimize the symbol graph, and change the behavior of some layer?

if so, is possible to disable this behavior?
if impossible, are there any document about this
or which source code should I read?
or, how do you debug model?

thanks for and advices

blows are my environment:

OS: centos 7.0, X86_64
Python: Python 2.7.5
Mxnet: da08c9203ecd1d8e3cd6a29ecf1a9238521f2351
Author: ziheng <[email protected]> Date: Sun Jun 4 17:43:47 2017 -0700
- USE_PROFILE=1 (not open profile while running)
- USE_CUDA = 1
- USE_CUDNN = 1
- USE_BLAS=mkl
- USE_MKLML2017 = 1
- USE_MKL2017_EXPERIMENTAL = 0
- USE_OPENMP = 1
- USE_NNPACK = 0
gcc: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)

Source

hanpum

Most helpful comment

Sth I used before. FYI.

class Debug(mx.operator.CustomOp):
    def forward(self, is_train, req, in_data, out_data, aux):
        f = open('s.txt','w')
        value = in_data[0].asnumpy()
        for i in range(value.size):
            f.write('%1.6f\n'%(value.flat[i]))
        f.close()
        #x = in_data[0].asnumpy()
        #for i in range(144):
        #    print(x[i])
        self.assign(out_data[0],req[0], in_data[0])

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0],req[0], out_grad[0])

Godricly on 15 Jun 2017

👍8

All 12 comments

A trick used by me is to create a "debug" custom op. It simply duplicate the input to output. With it, you can simply access to all the internal outputs of a network.

winstywang on 14 Jun 2017

👍2

@winstywang, thanks, could you tell more detail about your method?

hanpum on 14 Jun 2017

Sth I used before. FYI.

class Debug(mx.operator.CustomOp):
    def forward(self, is_train, req, in_data, out_data, aux):
        f = open('s.txt','w')
        value = in_data[0].asnumpy()
        for i in range(value.size):
            f.write('%1.6f\n'%(value.flat[i]))
        f.close()
        #x = in_data[0].asnumpy()
        #for i in range(144):
        #    print(x[i])
        self.assign(out_data[0],req[0], in_data[0])

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0],req[0], out_grad[0])

Godricly on 15 Jun 2017

👍8

I think I have found the problem, as the monitor.tic() is call before forward call, but collect result after backward, update and update_metric call, which may change the tensor compute while forward. I have change to collect result after each call, the tensor flow as defined now.

but backward gradients still confuse me, I'm trying to understand them....

hanpum on 15 Jun 2017

@Godricly Can you print the name of Debug Symbol?
And can you show me the DebugProp class' code? I think I get a problem in it.
I have tried to code debug in C++, but it crash in KNL machine.

BiranLi on 22 Jun 2017

@BiranLi Just follow the custom op introduction in python.

Godricly on 22 Jun 2017

@Godricly Just drop the label_shape is OK?

BiranLi on 22 Jun 2017

@BiranLi Y do you need label shape? This op acts like a shortcut in resnet. You can write a simple testing code.

Godricly on 22 Jun 2017

@Godricly I thought of it as a all-pass layer. This layer does not need the label param. I will try it. Thx.

BiranLi on 23 Jun 2017

its the same thing. forget that dimension match case.

2017-06-23 8:54 GMT+08:00 BiranLi notifications@github.com:

@Godricly https://github.com/godricly I thought of it as a all-pass
layer. This layer does not need the label param. I will try it. Thx.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/6692#issuecomment-310541070, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AQwQvxV7eQH7_UqJwMloDl2hIMOE4E6tks5sGwzjgaJpZM4N5t9U
.

Godricly on 23 Jun 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

yajiedesign on 30 Sep 2017

@Godricly thanks for your advice. But when I use the code, It has error: Check failed: reinterpret_cast(params.info->callbacks[kCustomOpBackward])( ptrs.size(), ptrs.data(), tags.data(), reinterpret_cast(req.data()), static_cast(ctx.is_train), params.info->contexts[kCustomOpBackward]). Any advice will be appreciated. Thanks