Dlib: Calling `.clean()` between forward-passes

Created on 26 Jun 2018  Â·  35Comments  Â·  Source: davisking/dlib

  • Version: v19.13
  • Where did you get dlib: github
  • Platform: Ubuntu 16.04.4 with CUDA 8.0 and cuDNN7, as well as running via docker (base image: nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04)
  • Compiler: Built with:
cmake .. -DDLIB_NO_GUI_SUPPORT=ON -DDLIB_USE_CUDA=ON && cmake --build . --config Release

Using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

Watching the GPU memory usage via watch -n0.5 nvidia-smi while using dlib, it seems like if I deserialize a trained model from disk to memory and use it for inference, over time the GPU memory usage trends up (it fluctuates up and down but overall it slowly rises). However, I think if I first call net.clean() before each inference, the memory usage remains more stable and no longer gradually climbs.

I'm not yet sure if there is a leak somewhere or not, but I'd be interested in knowing if calling .clean() before each inference is the suggested use when executing inferences on the same model iteratively.

Some sample code to show what my usage is like (not exact, but should at least show the gist of it):

std::vector<dlib::matrix<dlib::rgb_pixel>> inputImages = {.......}; // Assume there are somewhere say 1-50 images stored here
int batchSize = 50; // Just as an example, likely irrelevant for this situation

net_type_faceDetector _faceDetector;
dlib::deserialize(modelfile) >> _faceDetector;

for(int i=0; i < 10000000; i++) {
  _faceDetector.clean(); // This is the line in question
  auto detections = _faceDetector(inputImages, batchSize);
}

I guess the tl;dr is should .clean() be necessary (or even a good idea) between inference calls? The network architectures of the models I'm using are the same architectures as mmod_front_and_rear_end_vehicle_detector or mmod_human_face_detector.

Thanks!

inactive

All 35 comments

You shouldn't be calling clean() in a loop, that's just going to slow down
processing since it then deallocates and reallocates GPU resources over and
over.

I doubt there is a memory leak, but you never know.

Thanks, that's what I was suspecting about clean().

To add more detail to my case, I found that this seems to be happening when I deserialize two separate models into memory and perform forward passes through both of them in the same application. I have a standalone app that I'm attempting to tweak so that it exhibits this same behavior, but it seems to have a bug where it occasionally dies and appears defunct:

$ ps aux | grep dnn
cchadow+  5625 44.4  0.0      0     0 pts/1    Zl+  12:16   0:28 [dnn_mmod_face_d] <defunct>

nvidia-smi shows that it is still holding onto GPU memory, as well:

| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      5625    C   ./dnn_mmod_face_detection_ex_test              291MiB |
+-----------------------------------------------------------------------------+

It's essentially just a stripped down version of the dnn_mmod_face_detection_ex.cpp example, with some additions: https://gist.github.com/cchadowitz-pf/196f3f3cc102787adaf7b2ca167393a8

I'll continue to see if I can reproduce the memory situation in a standalone application, but this defunct state is strange. Is there any reason to believe that multiple models cannot be used in the same application with CUDA?

Thanks again!

You should be able to have as many models as you want, subject to limits on
GPU memory.

Yes, and I have no problems with doing that in general. I'm not sure what's going on with the zombie process, but that's less important.

I'm wondering if there may be some issue with releasing GPU memory from the CUDA stream context that would account for the overall memory usage climbing slowly. It's clear that the GPU memory usage climbs with the batch size, as it should, but I've found that doing a forward pass on one model with (for example) batch size 50, then doing forward passes on a second model with smaller, but increasing, batch sizes (1, 2, 5, 10, 20, 50 etc), the overall GPU memory usage of the app as shown in nvidia-smi increases. Eventually, if I continue similar patterns, it runs out of GPU memory

Error while calling cudaMalloc(&data, n) in file dlib/cuda/cuda_data_ptr.cpp:28. code: 2, reason: out of memory

This happens even when I use clean() in between, so it appears that there's some situation with GPU memory management where releasing memory either isn't happening when necessary, or at all, for some portion of the used memory. Or, there isn't sufficient contiguous memory to reallocate on the GPU so more memory is allocated, which could also explain the slow increase.

Essentially, a single forward pass for a given model and a given batch size of images uses a particular amount of GPU memory the first time, and after some additional forward passes with the model and a second model (in the same application), that identical forward pass with the same model, batch size, and images appears to require more GPU memory than the first time. I can't say whether it's because memory isn't being released when it should be, or because of issues with contiguous memory allocation, or something else.

Huh. You should put some logging on the cudaMalloc and cudaFree calls and see if they pair up. They should since they are all held by smart pointers. There are only 3 places these routines are called so it's easy to log.

Thanks, I could only find two places cudaMalloc and cudaFree are called:

https://github.com/davisking/dlib/search?q=cudaMalloc&unscoped_q=cudaMalloc returns these two cudaMalloc:
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/cuda_data_ptr.cpp#L28
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/gpu_data.cpp#L195

https://github.com/davisking/dlib/search?q=cudaFree&unscoped_q=cudaFree returns these two cudaFree:
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/cuda_data_ptr.cpp#L30
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/gpu_data.cpp#L197

So I added some logging before each cudaMalloc and cudaFree call and re-ran my test that exercises this issue. From my logs, it does not appear that cudaMalloc and cudaFree pair up. I've put my full log in this gist: https://gist.github.com/cchadowitz-pf/488d0f5ed38d4a8309d1cc006477aad7

Even if you discount the logging after the first cudaMalloc error, there are more cudaMalloc calls than cudaFree calls.

Did you let the program run to completion and terminate itself properly?

The overall program that I'm using dlib with is a server/service that remains running until I choose to terminate it, so that's not really possible :-/ That's also why this has come up because I would often have it up and running and over time the memory usage will have grown - I've now been able to narrow it down to exhibit the issue in a known way.

Well, you have to test it in the context of a program that runs to termination. Maybe there are still legitimate objects existing holding onto those smart pointers. Also, I regularly run the dlib code for days at a time and don't observe memory growth. So just running it for a long time shouldn't matter.

Can you post a self contained program that exhibits the problem?

I'm still working on seeing if I can put together a self-contained program for this issue. I understand your point about maybe legitimate objects exist holding onto those smart pointers, but the way that I'm using the dlib models/calls is pretty straightforward and so I don't really know why things would be held longer than necessary. The fact that I can essentially run a handful of calls in my service that causes an out of memory error seems to indicate that along the way memory isn't being released when it possibly should be.... The only dlib-related objects that I believe are being held in memory perpetually are the net objects themselves, so perhaps something internal to those isn't getting cleaned up?

I'll continue to try to put together that self-contained program in the meanwhile. Thanks again!

For what it's worth, when I terminate my program that produced the log in the gist above, it printed these additional cudaFree lines:

cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from cuda_data_ptr.cpp line 31:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from cuda_data_ptr.cpp line 31:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from gpu_data.cpp line 199:8
cudaFree called from cuda_data_ptr.cpp line 31:8
cudaFree called from gpu_data.cpp line 199:8

So it's not that I believe memory isn't being freed, I'm just not sure if it's being freed at the right time, if that makes sense.

Still working on the self-contained program :)

Still working on that self-contained program.... At the very least, I've found this doesn't require two or more models to be used, one model can cause this as well.

The key seems to be to first find the largest batch size your particular GPU can handle without running of memory in a one-off use, say N.
Then if I do a few forward passes with batch size N, it completes successfully and fills the GPU memory almost all the way.
I then vary the batch size between 1 and N and do a few more forward passes, the memory usage drops and climbs as expected.
A handful of forward passes with batch size 1 just to give it a chance to release memory, if it matters.
And finally, a sequence of forward passes with batch sizes climbing from 1 to N (e.g. 1, 2, 10, 20, 50, 100, ..., N). The last forward pass of batch size N is where it fails with the cuda out of memory error.

This seems to be similar (at least in terms of the symptoms) as #890 though I am still working on replicating it outside of my particular codebase. #890 does explain pretty much the same thing I'm seeing, though, so perhaps there's another case where memory is being allocated for a tensor before freeing an old one?

Basing this off of the script in #890, I think I have a python-based script that reproduces this behavior. The only caveat is that the max_batch_size needs to be set to the largest batch size your GPU can handle in a single call without running out of memory. For my GPU (GeForce GTX 1070 w/8gb memory) and the attached image, it was 38. The key in this case seems to be after a number of max_batch_size detection calls, alternate batch size 1 and max_batch_size. Mine crashes at the first max_batch_size after batch size 1.

import dlib

face_detector_path = 'mmod_human_face_detector.dat'
face_detector = dlib.cnn_face_detection_model_v1(face_detector_path)

f = 'grace_hopper.jpg'
def create_img_batch(batch_size):
  return [dlib.load_rgb_image(f) for i in range(batch_size)]

# Test detector with a constant batch size
max_batch_size = 38
img_batch = create_img_batch(max_batch_size)
n_iters = 5
print("Constant batch size:")
for i in range(n_iters):
    print("\t%d/%d" % (i, n_iters))
    face_detector(img_batch, 1)

# Variable batch size
print("Varying batch sizes")

batch_sizes = [1, max_batch_size, 1, max_batch_size, 1, max_batch_size]
for i in range(len(batch_sizes)):
  batch_size = batch_sizes[i]
  print("\t%d/%d (size: %d)" % (i, len(batch_sizes), batch_size))
  face_detector(create_img_batch(batch_size), 1)

grace_hopper

That doesn't sound like a memory leak. You are just running out of ram. You shouldn't be surprised if you run right up to the limit and then do a bunch of reallocations if you get a failed allocation request.

Hmm....

After GPU memory is allocated in used in one forward-pass, is there anything kept in memory (I guess cached...?) for future forward-pass calls, or is the allocated memory just held for future calls to reduce releasing/allocation GPU memory too eagerly?

I guess I don't fully understand at what point the memory is released (whether by the smart pointers in dlib, or the CUDA context itself).

It's a little complicated since the cuda runtime has some persistent
storage associated with it. But the vast majority of the GPU RAM is freed
upon destruction of the net object.

On Fri, Jun 29, 2018 at 11:51 AM cchadowitz-pf notifications@github.com
wrote:

Hmm....

After GPU memory is allocated in used in one forward-pass, is there
anything kept in memory (I guess cached...?) for future forward-pass calls,
or is the allocated memory just held for future calls to reduce
releasing/allocation GPU memory too eagerly?

I guess I don't fully understand at what point the memory is released
(whether by the smart pointers in dlib, or the CUDA context itself).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/davisking/dlib/issues/1381#issuecomment-401396123,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF-Cx4oYCl9MLyy7F_77cQ1Di6HwPXGoks5uBkz3gaJpZM4U4ol8
.

And if the net object is held in scope and not destroyed, is the vast majority of the GPU RAM still kept allocated/in use? It seems like there is some amount that is freed, or is that strictly the GPU RAM allocated to handle the (batched) input images?

Calling clean() removes some things, but many parts are still allocated.
You shouldn't be running over and over with a changing batch size anyway
because reallocating GPU memory is a slow and expensive operation in the
CUDA runtime. You should instead batch things together into consistent
blocks for processing.

Right - I generally do batch in consistent blocks, but occasionally I'll have a somewhat smaller batch than usual. The exercises I've been doing above were to emphasize the memory fluctuation/growth that I seem to occasionally be seeing longer-term.

Thanks for the info, though - it sounds like I'll have to keep an eye on it and see if it generally is because I'm running right at the edge of the memory constraints, or because of something else. If I get any more info, I'll post it here :)

Warning: this issue has been inactive for 64 days and will be automatically closed on 2018-09-07 if there is no further activity.

If you are waiting for a response but haven't received one it's likely your question is somehow inappropriate. E.g. you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's documentation, or a Google search.

Notice: this issue has been closed because it has been inactive for 68 days. You may reopen this issue if it has been closed in error.

I have been experiencing similar issues when processing image sets of different sizes. I was getting a malloc error when doing a forward on a small image after a series of larger images. Seems like CUDA tries to optimize memory allocation but doesn't free all the memory when moving to a smaller tensor.

My solution so far has been to use an input buffer of constant size (which prevents CUDA from reallocating), set to the largest size in my dataset. Each image is then inserted in with zero-padding on the sides (when smaller than buffer size):
"
dlib::matrix input_buffer(maxNr, maxNc);
dlib::set_all_elements(input_buffer,0);
dlib::set_subm(input_buffer, dlib::range(0,image.nr()-1), dlib::range(0,image.nc()-1)) = image;

 ...then do to_tensor() and forward() ....

"
This solves the malloc issue. Note that in cases where the largest size of the input images is not known, the input_buffer could be set to an arbitrary large size that fits in your GPU memory.

In general though, it would be useful to _have the possibility to enforce a complete clear of the memory allocated by CUDA_, even if it requires a reallocation at the next call.

I'm not really sure what can be done. cudaFree() is called on all the relevant cuda buffers when you run the clean() routine. As far as I am aware, there isn't anything in the cuda API like cudaFreeNoReallyDoIt(). Or is there some way to accomplish this?

So, my only other experience with this has been with Caffe and Tensorflow.

For Tensorflow, my personal experience shows it allocating as much memory as possible on the available GPU(s) up front and utilizing it as it needs to throughout, only releasing it when the task is complete.

For Caffe, it seems to behave more like @ThomasGuerneve described, where it allocates a buffer up front (but not all possible memory) but doesn't seem to free the memory after each use. And if it needs more memory, it just allocates additional memory and continues to hold onto that. I'm not sure if it then releases it afterwards to maintain it's "default" amount of allocated memory, or if it doesn't release it until the task is complete.

That may or may not be helpful, but it is two other examples of how it could be handled I suppose...

Dlib works like caffe more or less. When a tensor object is created it allocates a buffer of the appropriate size. Then when you change its size, it either reallocates if you are asking to make it bigger or if you are making it smaller it just keeps the same buffer. To say that another way, it's like std::vector. It holds the memory forever so that resizes are fast and you don't have these problems. But this means you have to begin the use of a network by running the maximum sized image you will use through it. From then on you can send smaller images through and everything will be fine and there shouldn't be any reallocations.

However, don't call .clean() if you care about these issues as clean() forces deallocation. Really, people shouldn't be calling clean(). The main point of clean() is just to drop unneeded stuff prior to serialization to disk.

That makes sense, but if I understand correctly, it sounds like you're speaking only about the individual image size, not the batch size. Does this hold true for batch sizes as well?

A tensor is a collection of images. So a batch is always one tensor object. So yes, it holds for batch sizes.

I'm not really sure what can be done. cudaFree() is called on all the relevant cuda buffers when you run the clean() routine. As far as I am aware, there isn't anything in the cuda API like cudaFreeNoReallyDoIt(). Or is there some way to accomplish this?

This week, I stumbled upon what looks like a somewhat similar issue. It seems that I found a workaround, though.

I have a service that is always on. It receives tasks from a load-balancing message queue. The service basically never closes itself (which makes the setup quite different from the example dlib programs, for example). I have written the service in C++; there's no Python or similar. The training process runs in a thread that is spun for each new task; the thread is terminated, when the training is complete. During training, the program instance in question does not pull new tasks from the message queue, so each instance is processing max one task at a time.

Generally this approach works great, but I noticed that in between tasks, not all GPU memory was freed up. Even worse, it appears that a tiny chunk of memory leaked after every task. Eventually, all GPU memory would be taken it seems. Verified this by looking at GPU memory consumption while feeding the service an endless stream of very short tasks: the consumption grew slowly, but steadily.

The cudaMalloc and cudaFree calls do appear to line up, so I had trouble figuring out where I should look.

Fortunately, it seems that calling cuDevicePrimaryCtxReset between tasks helps. However, if I do this before destroying my cuDNN handles, then cudnnDestroy never returns when it tries to do its thing. Because in stock dlib the handles are auto-destroyed at thread exit and there's no way to do it explicitly, I had to make minor changes so I can explicitly destroy the cuDNN handles when I want (specifically before I call cuDevicePrimaryCtxReset).

@reunanen Don't make new threads. The CUDA runtime allocates resources for every thread. Use a thread pool instead so you aren't creating new threads over time.

@davisking Thanks for the tip!

I tried it right away, but unfortunately switching to use a thread pool alone doesn't help (if I don't also call cuDevicePrimaryCtxReset).

Although, doesn't dnn_trainer anyway instantiate its own thread pool(s)? And this is what matters, no? So should I actually keep even the trainer instance alive across calls? Doing so may get a bit tricky though, because in my setup, different tasks may be related to training very different-looking neural networks (e.g. different depth, and/or different type – object detection or semantic segmentation or instance segmentation). So yes, I compile different C++ types, and then choose on the fly (and yes: this partly contributes to why my compile times are long, although basically the deepest architectures dominate this and the more shallow ones don't matter all that much).

So probably I'd need to pass a long-living thread pool to the trainer, instead of letting each
trainer instance construct its own pool. Looks like this isn't really supported today, but I think I can at least test if doing so would help.

Just to test this, I made trainer's thread pool static, and the problem went away.

So, what now? Should I try to prepare a PR that allows passing a thread pool reference from outside dnn_trainer?

Oh right, I forgot about those thread pools in the trainer. Yeah. to work around this you would need to make those things have a lifetime equal to your process. Yeah, I guess make a type like

using threads = std::vector<std::shared_ptr<thread_pool>>;

Inside the dnn_trainer and make it an optional constructor argument and use that internally instead of the one that's already there.

It's kind of gross that this sort of thing is needed to work around this problem with CUDA though. You can't setup your code so you don't need to make new dnn_trainer instances?

You can't setup your code so you don't need to make new dnn_trainer instances?

Well I have a whole many different types, more than a dozen (for different kinds of tasks, i.e. different network architectures). I'm afraid I may not be able to keep an instance of each in memory. Or at least I think it's a much bigger change for me, than adding an optional constructor argument to dnn_trainer (as you advise), or simply calling cuDevicePrimaryCtxReset in between tasks (which appears to work just fine).

Yeah, that's understandable. The constructor option seems like the best thing. It's just a bummer that CUDA behaves the way it does.

Was this page helpful?
0 / 5 - 0 ratings