Incubator-mxnet: Gradient checkpointing in the Gluon interface

Created on 12 Jun 2020  路  7Comments  路  Source: apache/incubator-mxnet

The backward mirroring functionality, which can greatly reduce the memory requirement in training deep networks, is missing in the Gluon interface.

Currently, we have the support in our symbolic API but do not have the support in Gluon, see https://github.com/apache/incubator-mxnet/pull/18228. We may add the support to Gluon.

Feature request Gluon

Most helpful comment

May be enabled in the unified executor

All 7 comments

@szha @yzhliu @eric-haibin-lin @leezu Do you think it's a valid 2.0 item? PyTorch has nice support of gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html

May be enabled in the unified executor

It's an urgent job because the MXNet 2.0 delete the symbolic executor API.

@eric-haibin-lin @ArmageddonKnight could you update on this?

I'm still looking into this. Currently the mirror pass requires some shape/type information which is missing at the point of calling the pass

Is it possible to implement MXNET_BACKWARD_DO_MIRROR if bacthnorm is used? It seems that the parameters of BatchNorm and SyncBatchNorm are updated by themselves instead of the trainer(I'm not sure about this). So we may need some extra memory to store these old parameters. It is meaningful since we can assume the size of these parameters are small enough. However, it seems that in current implementation, updating of batchnorm parameters is disabled when MXNET_BACKWARD_DO_MIRROR is set.
See https://github.com/apache/incubator-mxnet/blob/beafba76395e75c093f99d20ac62e38f48e91012/src/operator/nn/cudnn/cudnn_batch_norm-inl.h#L131.

Was this page helpful?
0 / 5 - 0 ratings