Ml-agents: Memory Leak on Linux

Created on 5 Nov 2018 · 22Comments · Source: Unity-Technologies/ml-agents

Hi,

I have been using Unity ml-agents on my Mac to create environments. I then build the environment for Linux and upload it to git in order to pull the environment down on a Linux machine. When I use the environment on the Linux machine, I run the learn.py algorithm, but it continues to use more and more memory until it runs out and it crashes.

I am running on Ubuntu 16.04 with ml-agents 0.4.0b and Unity 2017.2.0f1.

I know this was an issue with textures in the past, but the version of ml-agents that I have is after this issue was fixed.

bug

Source

atapley

Most helpful comment

Located the problem. Will see if I can get to the bottom of this and submit a PR.

tjad on 14 May 2019

👍2

All 22 comments

My ML version 0.5 is also has this problem in Linux. The memory increase, increase ....

icaro56 on 13 Nov 2018

This is a known issue that is on our to fix bug list.

xiaomaogy on 15 Nov 2018

Resources.UnloadUnusedAssets();

valanchik on 17 Nov 2018

@xiaomaogy has fix bug on Version 0.6 ?

arixlin on 30 Nov 2018

@arixlin Not yet.

xiaomaogy on 3 Dec 2018

@xiaomaogy Thanks!

arixlin on 5 Dec 2018

any pointers to what the cause of this is ?

tjad on 14 May 2019

In my opinion it could either be an issue on Unity's end with the visual observations, or an issue with Tensorflow itself. I have seen a few things saying that Tensorflow has a memory leak like this. I don't have the issue with vector observations though, so I'm not sure.

atapley on 14 May 2019

@atapley Thanks for the fast reply. Yes, I suspected TF too, but will need to do some profiling to confirm that. The unity process (executable) itself seems stable, the python side is doing horrible things. Will post updates.

tjad on 14 May 2019

Using mlagents 0.7 with TF 1.13.1 results with the same memory leak. Could be a TF usage error - not freeing memory when it should be freed from the session?

Onwards...

tjad on 14 May 2019

Why I say it may be a usage error, is that utilizing rainbow (dopamine) with the same environment does not result in the memory leak. So it can't be TF alone - well it could, but not likely ?

tjad on 14 May 2019

@xiaomaogy let me in on any knowns - perhaps including where this bug is being managed please-, would like to keep eye on progress and provide input.

tjad on 14 May 2019

From what I can tell, it's got to do with the trainer's accumulation of experiences. The trainer is the only thing that I see that is holding onto the memory - I haven't located its limit yet, or where it maintains the amount of memory it uses.

It doesn't matter what dimensions the observations are, that just determines how quickly you'll run out of memory.

tjad on 14 May 2019

Located the problem. Will see if I can get to the bottom of this and submit a PR.

tjad on 14 May 2019

👍2

This bug is mainly being tracked on this issue. We also have a internal Trello board, which just links to this issue.

xiaomaogy on 15 May 2019

👍1

@xiaomaogy It's not the most desirable fix, but it does the job.

Ideally the training and eval would be completely decoupled, but that's a larger task than have time for at present.

This PR ultimately stops the buffer from being filled at all during evaluation, hence no leak. I believe this is desirable for efficiency.
Alternatively control flow could have been added as to clear the buffer during eval after it fills, would be fruitless though.

tjad on 25 May 2019

Ah, ignore the first commit (be5e0ea). I split some coupled changes, the patch is in 6fc56f7

tjad on 25 May 2019

Hence no leak during the evaluation. But the training will still have memory leak, am I correct? @tjad

xiaomaogy on 29 May 2019

The training wont have a leak - as far as I remember, I'll confirm that. The way in which the training works clears the buffer ,but during eval it wasn't training, and hence not clearing the buffer.

tjad on 9 Aug 2019

Probably why it's not been such a major issue :-)

tjad on 9 Aug 2019

I suspected when I looked into this issue that there may have been lack of direction where the actual memory leak was. I trained fine for hours on end. Evaluation wasn't happening as easily.

tjad on 9 Aug 2019

It looks like this issue has been solved, so I'm going to close it. Thanks, and feel free to reopen if needed.