Glow: [GLOW] Memory leaking in opencl and build issues

Created on 30 Nov 2018 · 12Comments · Source: pytorch/glow

I’m having memory issues running the mnist and cifar10 examples, and I had problems compiling which might be related.

The memory issue is that when I run e.g. the cifar10 training example with -opencl, the GPU memory usage grows at a rate of 200MB per second (looking at nvidia-smi) until the lack of memory crashes the program.

I tried updating my nvidia libraries and opencl libraries but the behaviour is the same.
Any ideas how to debug this problem? The mnist example actually makes it to the final predictions and they are right, so the program is running correctly in that respect.

The building issues I had might be related, firstly for some reason llvm-link-6.0 was found by cmake while the rest finds llvm-7.0, so I had to manually set that to llvm-link-7 (I’m running debian btw.)
Then I had to make sure the code within the FACEBOOK_INTERNAL && LLVM_VERSION_PATCH < 20181009 is run, and not the regular llvm-7 code. It seems that the LLVM_VERSION_PATCH variable is not set while I apparently need it.

Any help would be appreciated! The glow library seems otherwise perfect to use!

Source

marijnfs

Most helpful comment

@gcatron You recently touched the related memory allocation code in the OpenCL backend. Could you have a look?

opti-mix on 30 Nov 2018

👍2

All 12 comments

@gcatron You recently touched the related memory allocation code in the OpenCL backend. Could you have a look?

opti-mix on 30 Nov 2018

👍2

I'll try it on another system with different gpu cards to see if it's related to that. For now I tried it on a GTX1080 with recent drivers (on a Debian system).

marijnfs on 30 Nov 2018

When running on my laptop with opencl, I also get growing memory usage in my regular memory.
OpenCL seems to use the Intel HD graphics as the compute units which from what I can tell uses global memory. It does run much faster than the CPU backend and leads to the correct result.

So this doesn't seem to be a problem related to a specific driver or device.

marijnfs on 30 Nov 2018

Sounds like it's something I may have introduced. Can you provide the commands you are running to replicate it? I'll take a look and see if I can find the leak.

gcatron on 30 Nov 2018

Ok it seems to be the OpenCLFunction::setupRuns() function. It is the only function that calls the allocdevicebuffer function that in turn is the only one that makes clbuffers.

The function allocates a new buffer every time it is called, which is only free'd in the destructor of that class. Although setupRuns sounds like it shouldn't be called much it seems it is, and it might make sense to setup the function more times as input sizes change, while keeping the function alive. Anyway I now hacked in a delete if the pointer is not null and it doesn't leak anymore.

This is a hacky way, probably we should at least check if the new needed buffer size actually is different (at least larger) and only then create a new buffer, and reuse the old one otherwise.

void OpenCLFunction::setupRuns() {
  if (deviceBuffer_) {
    freeDeviceBuffer(deviceBuffer_);;
    deviceBuffer_ = nullptr;
  }

  deviceBuffer_ = allocDeviceBuffer(bundle_.getConstantWeightSize() +
                                    bundle_.getMutableWeightSize() +
                                    bundle_.getActivationsSize());

marijnfs on 30 Nov 2018

That makes sense, when I did the port I should have freed the buffer in tearDownRuns. I'll put up a PR shortly. Basically, setupRuns and tearDownRuns are really only meant to be run once per set of executions but that logic isn't in the execution engine yet.

gcatron on 1 Dec 2018

@marijnfs Thank you for debugging and finding the issue. Feel free to submit a PR that fixes the problem. We would love to work with you on this and other issues/features/topics.

nadavrot on 1 Dec 2018

👍1

Logic added! @marijnfs could you take a look at #2107 and let me know if that works for you?
I'm not as sure about the build issues but this should fix the memory leak.

gcatron on 1 Dec 2018

👍1

Awesome, I'll try it out!

marijnfs on 1 Dec 2018

Ok it works for me!
On another note, what I really need are 3D convolutions (I work in medical), is there an issue for that? Would it be hard to implement? I'm not sure how strong the assumption of 2D data is present in the code. I would be happy to contribute to this.

marijnfs on 1 Dec 2018

@marijnfs Thank you for working with us. Yes, we should add support for 3D convolutions. Could you please open a new 'issue' where we can discuss the design? It would be great working with you on this.

And thank you @gcatron for fixing the current issue.

nadavrot on 3 Dec 2018

cc: @artemrakhov who might look into 3d convolutions in the past ^