Incubator-mxnet: Gluon RNN memory leaks with extra variables

Created on 21 Jan 2019 · 10Comments · Source: apache/incubator-mxnet

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

Gluon allows one to define extra variables that may not lead to model outcome. However, having them may cause memory leak.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform     : Linux-4.14.77-70.82.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-172-16-95-144
release      : 4.14.77-70.82.amzn1.x86_64
version      : #1 SMP Mon Dec 3 20:01:27 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2706.669
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 1.0198 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0912 sec, LOAD: 0.1530 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5845 sec, LOAD: 0.1434sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0089 sec, LOAD: 0.1170 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0100 sec, LOAD: 0.3888 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0104 sec, LOAD: 0.0782 sec.```

Package used (Python/R/Scala/Julia): Python

Error Message:

If you run watch -n0.1 nvidia-smi, you may observe memory growth every by 2MB every few seconds.

Minimum reproducible example

See mxnet-memory-leak.tar.gz
The main differences between the attachment and examples/gluon/language_model/ are to add extra on Line 56 in model.py add to add mx.nd.array([], ctx=context) on Line 166 and 183 in train.py

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. python train.py --cuda --tied --nhid 200 --emsize 200 --epochs 20 --dropout 0.2 &
2. watch -n0.1 nvidia-smi

What have you tried to solve it?

Add a dummy link between all inputs and outputs. However, this may not always be possible / convenient / readable.
I previously suggested a feature request to allow None input types in the gluon models. Communicated with @szha that this would not be fundamentally challenging. However, this has not been acted upon and may be a low-hanging fruit alongside the memory fix leak.

Backend Bug CUDA Gluon Performance

Source

yifeim

Most helpful comment

The memory leak is related to the extra unused variable you passed into your RNN model but it is NOT specific to RNN. In your repro script, you created a size-zero ndarray in each loop which caused the memory leak.

for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, mx.nd.array([], ctx=context))

However, since the size-zero ndarray is unused anywhere, it is a better code practice to create once outside the loop and use it throughout your training. The same change applies to the eval() function in your repro script.

extra = mx.nd.array([], ctx=context)
for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, extra)

With this change, I ran your repro script for 10 epochs with mxnet_cu90mkl 1.3.1 and 1.4.0 packages and did not see memory leak.

But there is indeed a memory leak issue which is the root cause for this issue. Please refer to #14358 for more details.

yuxihu on 7 Mar 2019

❤1 🎉1

All 10 comments

@mxnet-label-bot Add [Gluon, Performance]

piyushghai on 22 Jan 2019

@mxnet-label-bot add [backend, cuda]

apeforest on 22 Jan 2019

@yifeim I am looking into this issue.

apeforest on 22 Jan 2019

@apeforest Why is this not a bug?

yifeim on 1 Feb 2019

@yifeim Sorry, got too busy and haven't got chance to dive deep into this. Yes, I think it's a bug. @mxnet-label-bot add [Bug]

apeforest on 1 Feb 2019

for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, mx.nd.array([], ctx=context))

extra = mx.nd.array([], ctx=context)
for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, extra)

With this change, I ran your repro script for 10 epochs with mxnet_cu90mkl 1.3.1 and 1.4.0 packages and did not see memory leak.

But there is indeed a memory leak issue which is the root cause for this issue. Please refer to #14358 for more details.

yuxihu on 7 Mar 2019

❤1 🎉1

@yifeim After a little bit more digging, I think the issue is specifically related the usage of size-zero ndarray for your extra variable. If you just use mx.nd.array([1], ctx=context) as the extra variable in the loop of your repro script, you will not observe any memory leak. The true problem is creating size-zero ndarray in a loop.

yuxihu on 7 Mar 2019

Very interesting. Thanks a lot for the insights!

yifeim on 8 Mar 2019

Thanks for handling @yuxihu!