MIRROR feature is broken for a while. Is there any plan to fix this?
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug
Contribution is welcome!
@mxnet-label-bot add [Gluon, Bug]
Repo:
(venv) bingxu@bingxu-mbp:~/research/hiconv-tvm/resnet-50$ python mx_memory.py
3317 MB
(venv) bingxu@bingxu-mbp:~/research/hiconv-tvm/resnet-50$ MXNET_BACKWARD_DO_MIRROR=1 python mx_memory.py
4496 MB
(venv) bingxu@bingxu-mbp:~/research/hiconv-tvm/resnet-50$
Open mirror will cost more memory.
This seems to be incorrectly labelled as Gluon. Removing the gluon label.
@mxnet-label-bot Update [Bug, Environment Variables]
# mxnet version:1.4.0, resnext101
I think the following code for MXNET_BACKWARD_DO_MIRROR has problem which is from .//src/executor/graph_executor.cc.
it will do not use mirror even the MXNET_BACKWARD_DO_MIRROR is on unless the __force_mirroring__ is set.
and...it will use more gpu memory when MXNET_BACKWARD_DO_MIRROR is set.
260 int do_mirror = dmlc::GetEnv("MXNET_BACKWARD_DO_MIRROR", 0);
261 auto need_mirror = [do_mirror](const nnvm::Node& node) -> int {
262 if (node.is_variable()) return 0;
263 const std::string& type = node.attrs.op->name;
264 if (type == "Dropout") return false;
265 if (get_node_attr(node, "__force_mirroring__", false)) return true;
266 if (do_mirror == 0) return false;
267 if (type == "Convolution") return false;
268 if (type == "FullyConnected") return false;
269 if (type == "Concat") return false;
270 if (type == "SoftmaxOutput") return false;
271 if (type == "BatchNorm") return false;
272 if (type == "CuDNNBatchNorm") return false;
273 return true;
274 };
I can't express more about how important this feature is. Please fix it, and get same memory cost to version 0.7. https://github.com/dmlc/mxnet-memonger https://arxiv.org/abs/1604.06174
@antinucleon Given below is the outputs that I got from the LSTM-based NMT model (from the Sockeye toolkit):
[INFO:root] Global Step[50] Epoch[0] Batch [50] Speed: 261.10 samples/sec perplexity=921.880075 Memory Usage (MB): pid_21652=6179, PE Usage (W, J): dev_0=78.96,1935.41,
[INFO:root] Global Step[100] Epoch[0] Batch [100] Speed: 285.67 samples/sec perplexity=616.976020 Memory Usage (MB): pid_21652=6179, PE Usage (W, J): dev_0=82.94,3793.57,
[INFO:root] Global Step[150] Epoch[0] Batch [150] Speed: 294.01 samples/sec perplexity=496.731365 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=95.77,5878.26,
[INFO:root] Global Step[200] Epoch[0] Batch [200] Speed: 310.77 samples/sec perplexity=424.378748 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=77.22,7468.53,
[INFO:root] Global Step[250] Epoch[0] Batch [250] Speed: 282.17 samples/sec perplexity=369.385264 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=87.60,9455.43,
[INFO:root] Global Step[300] Epoch[0] Batch [300] Speed: 294.64 samples/sec perplexity=321.364135 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=82.16,11240.07,
BACKWARD_DO_MIRROR=1[INFO:root] Global Step[50] Epoch[0] Batch [50] Speed: 151.09 samples/sec perplexity=949.961463 Memory Usage (MB): pid_20928=2425, PE Usage (W, J): dev_0=84.69,3587.42,
[INFO:root] Global Step[100] Epoch[0] Batch [100] Speed: 170.88 samples/sec perplexity=625.173421 Memory Usage (MB): pid_20928=2425, PE Usage (W, J): dev_0=76.74,6461.51,
[INFO:root] Global Step[150] Epoch[0] Batch [150] Speed: 178.00 samples/sec perplexity=499.439886 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=84.37,9494.95,
[INFO:root] Global Step[200] Epoch[0] Batch [200] Speed: 195.40 samples/sec perplexity=426.799941 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=79.16,12087.66,
[INFO:root] Global Step[250] Epoch[0] Batch [250] Speed: 169.05 samples/sec perplexity=371.365061 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=81.68,15179.92,
[INFO:root] Global Step[300] Epoch[0] Batch [300] Speed: 180.27 samples/sec perplexity=323.268620 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=73.94,17805.00,
We can see from the output logs that with the same number of global steps, both roughly reach the same training quality. The memory footprint of doing backward mirroring is around 1/3 of that in the baseline, but at the same time this comes with around 40% performance drop.
I am currently still investigating on the cause of such huge performance drop. At the same time, if you have any specific benchmark of interest (preferrably small benchmark like the one you commented above because we are still in the debugging phase). Please kindly let me know. Thanks.
My implementation is available here: https://github.com/UofT-EcoSystem/nnvm/blob/cce19c328427de4eacd51178d233ce23a3e3d79d/src/pass/gradient.cc#L144
Any feedback or comment is much appreciated. Thanks.