The backward mirroring functionality, which can greatly reduce the memory requirement in training deep networks, is missing in the Gluon interface.
Currently, we have the support in our symbolic API but do not have the support in Gluon, see https://github.com/apache/incubator-mxnet/pull/18228. We may add the support to Gluon.
@szha @yzhliu @eric-haibin-lin @leezu Do you think it's a valid 2.0 item? PyTorch has nice support of gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html
May be enabled in the unified executor
It's an urgent job because the MXNet 2.0 delete the symbolic executor API.
@eric-haibin-lin @ArmageddonKnight could you update on this?
I'm still looking into this. Currently the mirror pass requires some shape/type information which is missing at the point of calling the pass
We mainly need to pass shape & dtype info here: https://github.com/apache/incubator-mxnet/blob/master/src/imperative/cached_op.h#L252-L255
Is it possible to implement MXNET_BACKWARD_DO_MIRROR if bacthnorm is used? It seems that the parameters of BatchNorm and SyncBatchNorm are updated by themselves instead of the trainer(I'm not sure about this). So we may need some extra memory to store these old parameters. It is meaningful since we can assume the size of these parameters are small enough. However, it seems that in current implementation, updating of batchnorm parameters is disabled when MXNET_BACKWARD_DO_MIRROR is set.
See https://github.com/apache/incubator-mxnet/blob/beafba76395e75c093f99d20ac62e38f48e91012/src/operator/nn/cudnn/cudnn_batch_norm-inl.h#L131.
Most helpful comment
May be enabled in the unified executor