Incubator-mxnet: MXNET_BACKWARD_DO_MIRROR is broken

Created on 10 Mar 2019 · 10Comments · Source: apache/incubator-mxnet

MIRROR feature is broken for a while. Is there any plan to fix this?

Bug Environment Variables

Source

antinucleon

All 10 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

mxnet-label-bot on 10 Mar 2019

Contribution is welcome!

eric-haibin-lin on 17 Mar 2019

@mxnet-label-bot add [Gluon, Bug]

zachgk on 19 Mar 2019

Repo:

(venv) bingxu@bingxu-mbp:~/research/hiconv-tvm/resnet-50$ python mx_memory.py 
3317 MB
(venv) bingxu@bingxu-mbp:~/research/hiconv-tvm/resnet-50$ MXNET_BACKWARD_DO_MIRROR=1 python mx_memory.py 
4496 MB
(venv) bingxu@bingxu-mbp:~/research/hiconv-tvm/resnet-50$

Open mirror will cost more memory.

antinucleon on 19 Mar 2019

This seems to be incorrectly labelled as Gluon. Removing the gluon label.

@mxnet-label-bot Update [Bug, Environment Variables]

piyushghai on 10 Apr 2019

# mxnet version:1.4.0, resnext101
I think the following code for MXNET_BACKWARD_DO_MIRROR has problem which is from .//src/executor/graph_executor.cc.
it will do not use mirror even the MXNET_BACKWARD_DO_MIRROR is on unless the __force_mirroring__ is set.
and...it will use more gpu memory when MXNET_BACKWARD_DO_MIRROR is set.

 260   int do_mirror = dmlc::GetEnv("MXNET_BACKWARD_DO_MIRROR", 0);
 261   auto need_mirror = [do_mirror](const nnvm::Node& node) -> int {
 262     if (node.is_variable()) return 0;
 263     const std::string& type = node.attrs.op->name;
 264     if (type == "Dropout") return false;
 265     if (get_node_attr(node, "__force_mirroring__", false)) return true;
 266     if (do_mirror == 0) return false;
 267     if (type == "Convolution") return false;
 268     if (type == "FullyConnected") return false;
 269     if (type == "Concat") return false;
 270     if (type == "SoftmaxOutput") return false;
 271     if (type == "BatchNorm") return false;
 272     if (type == "CuDNNBatchNorm") return false;
 273     return true;
 274   };

starimpact on 19 Jun 2019

I can't express more about how important this feature is. Please fix it, and get same memory cost to version 0.7. https://github.com/dmlc/mxnet-memonger https://arxiv.org/abs/1604.06174

antinucleon on 19 Jun 2019

@antinucleon Given below is the outputs that I got from the LSTM-based NMT model (from the Sockeye toolkit):

Baseline

[INFO:root] Global Step[50] Epoch[0] Batch [50] Speed: 261.10 samples/sec   perplexity=921.880075   Memory Usage (MB): pid_21652=6179,  PE Usage (W, J): dev_0=78.96,1935.41, 
[INFO:root] Global Step[100] Epoch[0] Batch [100]   Speed: 285.67 samples/sec   perplexity=616.976020   Memory Usage (MB): pid_21652=6179,  PE Usage (W, J): dev_0=82.94,3793.57, 
[INFO:root] Global Step[150] Epoch[0] Batch [150]   Speed: 294.01 samples/sec   perplexity=496.731365   Memory Usage (MB): pid_21652=6601,  PE Usage (W, J): dev_0=95.77,5878.26, 
[INFO:root] Global Step[200] Epoch[0] Batch [200]   Speed: 310.77 samples/sec   perplexity=424.378748   Memory Usage (MB): pid_21652=6601,  PE Usage (W, J): dev_0=77.22,7468.53, 
[INFO:root] Global Step[250] Epoch[0] Batch [250]   Speed: 282.17 samples/sec   perplexity=369.385264   Memory Usage (MB): pid_21652=6601,  PE Usage (W, J): dev_0=87.60,9455.43, 
[INFO:root] Global Step[300] Epoch[0] Batch [300]   Speed: 294.64 samples/sec   perplexity=321.364135   Memory Usage (MB): pid_21652=6601,  PE Usage (W, J): dev_0=82.16,11240.07,

`BACKWARD_DO_MIRROR=1`

[INFO:root] Global Step[50] Epoch[0] Batch [50] Speed: 151.09 samples/sec   perplexity=949.961463   Memory Usage (MB): pid_20928=2425,  PE Usage (W, J): dev_0=84.69,3587.42, 
[INFO:root] Global Step[100] Epoch[0] Batch [100]   Speed: 170.88 samples/sec   perplexity=625.173421   Memory Usage (MB): pid_20928=2425,  PE Usage (W, J): dev_0=76.74,6461.51, 
[INFO:root] Global Step[150] Epoch[0] Batch [150]   Speed: 178.00 samples/sec   perplexity=499.439886   Memory Usage (MB): pid_20928=2475,  PE Usage (W, J): dev_0=84.37,9494.95, 
[INFO:root] Global Step[200] Epoch[0] Batch [200]   Speed: 195.40 samples/sec   perplexity=426.799941   Memory Usage (MB): pid_20928=2475,  PE Usage (W, J): dev_0=79.16,12087.66, 
[INFO:root] Global Step[250] Epoch[0] Batch [250]   Speed: 169.05 samples/sec   perplexity=371.365061   Memory Usage (MB): pid_20928=2475,  PE Usage (W, J): dev_0=81.68,15179.92, 
[INFO:root] Global Step[300] Epoch[0] Batch [300]   Speed: 180.27 samples/sec   perplexity=323.268620   Memory Usage (MB): pid_20928=2475,  PE Usage (W, J): dev_0=73.94,17805.00,

We can see from the output logs that with the same number of global steps, both roughly reach the same training quality. The memory footprint of doing backward mirroring is around 1/3 of that in the baseline, but at the same time this comes with around 40% performance drop.

I am currently still investigating on the cause of such huge performance drop. At the same time, if you have any specific benchmark of interest (preferrably small benchmark like the one you commented above because we are still in the debugging phase). Please kindly let me know. Thanks.