Incubator-mxnet: how to debug network layer by layer

Created on 14 Jun 2017  Â·  12Comments  Â·  Source: apache/incubator-mxnet

first of all, thank very much for afford such powerful tool

I'm try to combine NCE loss and bucketing LSTM in Python for few days. For now, the network is build successful and run without complain, the loss decreasing as batch feeding(very slow), and the PPL test on result model is very huge, so, I wanna to debug the model step by step, print weights and each layer's forward/backward tensor, to find where is the problem.

what's I have done yet:

  1. install a monitor (find and fix a tiny bug while using monitor with bucketing module)
  2. feed sample one per batch again and again
  3. save weights, forward/backward tensor into file

after that, I do have all weights and each layers's forward/backward tensor saved in files.

but while analysis those file, I found that some tensor are work as defined in net symbol, but some are not, what I have found are:

  1. some layer are merge, such as a element-wise multiply layer follow by a softmax layer, those two layers' forward output tensor are exactly same
  2. in a mx.sym.Reshape layer, the input tensor and output tensor look like have no relation with each other, which is confuse me
  3. the sample problem appear in backward flow

so, is the underlay graph executor have optimize the symbol graph, and change the behavior of some layer?

  1. if so, is possible to disable this behavior?
  2. if impossible, are there any document about this
  3. or which source code should I read?
  4. or, how do you debug model?

thanks for and advices

blows are my environment:

  1. OS: centos 7.0, X86_64
  2. Python: Python 2.7.5
  3. Mxnet: da08c9203ecd1d8e3cd6a29ecf1a9238521f2351
    Author: ziheng <[email protected]> Date: Sun Jun 4 17:43:47 2017 -0700

    • USE_PROFILE=1 (not open profile while running)

    • USE_CUDA = 1

    • USE_CUDNN = 1

    • USE_BLAS=mkl

    • USE_MKLML2017 = 1

    • USE_MKL2017_EXPERIMENTAL = 0

    • USE_OPENMP = 1

    • USE_NNPACK = 0

  4. gcc: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)

Most helpful comment

Sth I used before. FYI.

class Debug(mx.operator.CustomOp):
    def forward(self, is_train, req, in_data, out_data, aux):
        f = open('s.txt','w')
        value = in_data[0].asnumpy()
        for i in range(value.size):
            f.write('%1.6f\n'%(value.flat[i]))
        f.close()
        #x = in_data[0].asnumpy()
        #for i in range(144):
        #    print(x[i])
        self.assign(out_data[0],req[0], in_data[0])

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0],req[0], out_grad[0])

All 12 comments

A trick used by me is to create a "debug" custom op. It simply duplicate the input to output. With it, you can simply access to all the internal outputs of a network.

@winstywang, thanks, could you tell more detail about your method?

Sth I used before. FYI.

class Debug(mx.operator.CustomOp):
    def forward(self, is_train, req, in_data, out_data, aux):
        f = open('s.txt','w')
        value = in_data[0].asnumpy()
        for i in range(value.size):
            f.write('%1.6f\n'%(value.flat[i]))
        f.close()
        #x = in_data[0].asnumpy()
        #for i in range(144):
        #    print(x[i])
        self.assign(out_data[0],req[0], in_data[0])

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0],req[0], out_grad[0])

I think I have found the problem, as the monitor.tic() is call before forward call, but collect result after backward, update and update_metric call, which may change the tensor compute while forward. I have change to collect result after each call, the tensor flow as defined now.

but backward gradients still confuse me, I'm trying to understand them....

@Godricly Can you print the name of Debug Symbol?
And can you show me the DebugProp class' code? I think I get a problem in it.
I have tried to code debug in C++, but it crash in KNL machine.

@BiranLi Just follow the custom op introduction in python.

@Godricly Just drop the label_shape is OK?

@BiranLi Y do you need label shape? This op acts like a shortcut in resnet. You can write a simple testing code.

@Godricly I thought of it as a all-pass layer. This layer does not need the label param. I will try it. Thx.

its the same thing. forget that dimension match case.

2017-06-23 8:54 GMT+08:00 BiranLi notifications@github.com:

@Godricly https://github.com/godricly I thought of it as a all-pass
layer. This layer does not need the label param. I will try it. Thx.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/6692#issuecomment-310541070, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AQwQvxV7eQH7_UqJwMloDl2hIMOE4E6tks5sGwzjgaJpZM4N5t9U
.

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

@Godricly thanks for your advice. But when I use the code, It has error: Check failed: reinterpret_cast(params.info->callbacks[kCustomOpBackward])( ptrs.size(), ptrs.data(), tags.data(), reinterpret_cast(req.data()), static_cast(ctx.is_train), params.info->contexts[kCustomOpBackward]). Any advice will be appreciated. Thanks

Was this page helpful?
0 / 5 - 0 ratings