Glow: [GLOW] Memory leaking in opencl and build issues

Created on 30 Nov 2018  路  12Comments  路  Source: pytorch/glow

I鈥檓 having memory issues running the mnist and cifar10 examples, and I had problems compiling which might be related.

The memory issue is that when I run e.g. the cifar10 training example with -opencl, the GPU memory usage grows at a rate of 200MB per second (looking at nvidia-smi) until the lack of memory crashes the program.

I tried updating my nvidia libraries and opencl libraries but the behaviour is the same.
Any ideas how to debug this problem? The mnist example actually makes it to the final predictions and they are right, so the program is running correctly in that respect.

The building issues I had might be related, firstly for some reason llvm-link-6.0 was found by cmake while the rest finds llvm-7.0, so I had to manually set that to llvm-link-7 (I鈥檓 running debian btw.)
Then I had to make sure the code within the FACEBOOK_INTERNAL && LLVM_VERSION_PATCH < 20181009 is run, and not the regular llvm-7 code. It seems that the LLVM_VERSION_PATCH variable is not set while I apparently need it.

Any help would be appreciated! The glow library seems otherwise perfect to use!

Most helpful comment

@gcatron You recently touched the related memory allocation code in the OpenCL backend. Could you have a look?

All 12 comments

@gcatron You recently touched the related memory allocation code in the OpenCL backend. Could you have a look?

I'll try it on another system with different gpu cards to see if it's related to that. For now I tried it on a GTX1080 with recent drivers (on a Debian system).

When running on my laptop with opencl, I also get growing memory usage in my regular memory.
OpenCL seems to use the Intel HD graphics as the compute units which from what I can tell uses global memory. It does run much faster than the CPU backend and leads to the correct result.

So this doesn't seem to be a problem related to a specific driver or device.

Sounds like it's something I may have introduced. Can you provide the commands you are running to replicate it? I'll take a look and see if I can find the leak.

Ok it seems to be the OpenCLFunction::setupRuns() function. It is the only function that calls the allocdevicebuffer function that in turn is the only one that makes clbuffers.

The function allocates a new buffer every time it is called, which is only free'd in the destructor of that class. Although setupRuns sounds like it shouldn't be called much it seems it is, and it might make sense to setup the function more times as input sizes change, while keeping the function alive. Anyway I now hacked in a delete if the pointer is not null and it doesn't leak anymore.

This is a hacky way, probably we should at least check if the new needed buffer size actually is different (at least larger) and only then create a new buffer, and reuse the old one otherwise.

void OpenCLFunction::setupRuns() {
  if (deviceBuffer_) {
    freeDeviceBuffer(deviceBuffer_);;
    deviceBuffer_ = nullptr;
  }

  deviceBuffer_ = allocDeviceBuffer(bundle_.getConstantWeightSize() +
                                    bundle_.getMutableWeightSize() +
                                    bundle_.getActivationsSize());

That makes sense, when I did the port I should have freed the buffer in tearDownRuns. I'll put up a PR shortly. Basically, setupRuns and tearDownRuns are really only meant to be run once per set of executions but that logic isn't in the execution engine yet.

@marijnfs Thank you for debugging and finding the issue. Feel free to submit a PR that fixes the problem. We would love to work with you on this and other issues/features/topics.

Logic added! @marijnfs could you take a look at #2107 and let me know if that works for you?
I'm not as sure about the build issues but this should fix the memory leak.

Awesome, I'll try it out!

Ok it works for me!
On another note, what I really need are 3D convolutions (I work in medical), is there an issue for that? Would it be hard to implement? I'm not sure how strong the assumption of 2D data is present in the code. I would be happy to contribute to this.

@marijnfs Thank you for working with us. Yes, we should add support for 3D convolutions. Could you please open a new 'issue' where we can discuss the design? It would be great working with you on this.

And thank you @gcatron for fixing the current issue.

cc: @artemrakhov who might look into 3d convolutions in the past ^

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tlepley-cadence picture tlepley-cadence  路  4Comments

artemrakhov-glow picture artemrakhov-glow  路  4Comments

wayneshawn picture wayneshawn  路  3Comments

mciprian13 picture mciprian13  路  3Comments

qcolombet picture qcolombet  路  5Comments