Incubator-mxnet: Gradient checkpointing in the Gluon interface

Created on 12 Jun 2020 · 7Comments · Source: apache/incubator-mxnet

The backward mirroring functionality, which can greatly reduce the memory requirement in training deep networks, is missing in the Gluon interface.

Currently, we have the support in our symbolic API but do not have the support in Gluon, see https://github.com/apache/incubator-mxnet/pull/18228. We may add the support to Gluon.

Feature request Gluon

Source

sxjscience

🚀2

Most helpful comment

May be enabled in the unified executor

eric-haibin-lin on 12 Jun 2020

👍3 🚀2 🎉2

All 7 comments

@szha @yzhliu @eric-haibin-lin @leezu Do you think it's a valid 2.0 item? PyTorch has nice support of gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html

sxjscience on 12 Jun 2020

May be enabled in the unified executor

eric-haibin-lin on 12 Jun 2020

👍3 🚀2 🎉2

It's an urgent job because the MXNet 2.0 delete the symbolic executor API.

chinakook on 16 Jul 2020

@eric-haibin-lin @ArmageddonKnight could you update on this?

szha on 16 Jul 2020

I'm still looking into this. Currently the mirror pass requires some shape/type information which is missing at the point of calling the pass

eric-haibin-lin on 23 Jul 2020

👀1

We mainly need to pass shape & dtype info here: https://github.com/apache/incubator-mxnet/blob/master/src/imperative/cached_op.h#L252-L255

eric-haibin-lin on 14 Sep 2020

👍1

Is it possible to implement MXNET_BACKWARD_DO_MIRROR if bacthnorm is used? It seems that the parameters of BatchNorm and SyncBatchNorm are updated by themselves instead of the trainer(I'm not sure about this). So we may need some extra memory to store these old parameters. It is meaningful since we can assume the size of these parameters are small enough. However, it seems that in current implementation, updating of batchnorm parameters is disabled when MXNET_BACKWARD_DO_MIRROR is set.
See https://github.com/apache/incubator-mxnet/blob/beafba76395e75c093f99d20ac62e38f48e91012/src/operator/nn/cudnn/cudnn_batch_norm-inl.h#L131.