Dlib: Calling `.clean()` between forward-passes

Created on 26 Jun 2018 · 35Comments · Source: davisking/dlib

Version: v19.13
Where did you get dlib: github
Platform: Ubuntu 16.04.4 with CUDA 8.0 and cuDNN7, as well as running via docker (base image: nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04)
Compiler: Built with:

cmake .. -DDLIB_NO_GUI_SUPPORT=ON -DDLIB_USE_CUDA=ON && cmake --build . --config Release

Using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

Watching the GPU memory usage via watch -n0.5 nvidia-smi while using dlib, it seems like if I deserialize a trained model from disk to memory and use it for inference, over time the GPU memory usage trends up (it fluctuates up and down but overall it slowly rises). However, I think if I first call net.clean() before each inference, the memory usage remains more stable and no longer gradually climbs.

I'm not yet sure if there is a leak somewhere or not, but I'd be interested in knowing if calling .clean() before each inference is the suggested use when executing inferences on the same model iteratively.

Some sample code to show what my usage is like (not exact, but should at least show the gist of it):

std::vector<dlib::matrix<dlib::rgb_pixel>> inputImages = {.......}; // Assume there are somewhere say 1-50 images stored here
int batchSize = 50; // Just as an example, likely irrelevant for this situation

net_type_faceDetector _faceDetector;
dlib::deserialize(modelfile) >> _faceDetector;

for(int i=0; i < 10000000; i++) {
  _faceDetector.clean(); // This is the line in question
  auto detections = _faceDetector(inputImages, batchSize);
}

I guess the tl;dr is should .clean() be necessary (or even a good idea) between inference calls? The network architectures of the models I'm using are the same architectures as mmod_front_and_rear_end_vehicle_detector or mmod_human_face_detector.

Thanks!

inactive

Source

cchadowitz-pf

All 35 comments

You shouldn't be calling clean() in a loop, that's just going to slow down
processing since it then deallocates and reallocates GPU resources over and
over.

I doubt there is a memory leak, but you never know.

davisking on 26 Jun 2018

Thanks, that's what I was suspecting about clean().

To add more detail to my case, I found that this seems to be happening when I deserialize two separate models into memory and perform forward passes through both of them in the same application. I have a standalone app that I'm attempting to tweak so that it exhibits this same behavior, but it seems to have a bug where it occasionally dies and appears defunct:

$ ps aux | grep dnn
cchadow+  5625 44.4  0.0      0     0 pts/1    Zl+  12:16   0:28 [dnn_mmod_face_d] <defunct>

nvidia-smi shows that it is still holding onto GPU memory, as well:

| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      5625    C   ./dnn_mmod_face_detection_ex_test              291MiB |
+-----------------------------------------------------------------------------+

It's essentially just a stripped down version of the dnn_mmod_face_detection_ex.cpp example, with some additions: https://gist.github.com/cchadowitz-pf/196f3f3cc102787adaf7b2ca167393a8

I'll continue to see if I can reproduce the memory situation in a standalone application, but this defunct state is strange. Is there any reason to believe that multiple models cannot be used in the same application with CUDA?

Thanks again!

cchadowitz-pf on 27 Jun 2018

You should be able to have as many models as you want, subject to limits on
GPU memory.

davisking on 27 Jun 2018

Yes, and I have no problems with doing that in general. I'm not sure what's going on with the zombie process, but that's less important.

I'm wondering if there may be some issue with releasing GPU memory from the CUDA stream context that would account for the overall memory usage climbing slowly. It's clear that the GPU memory usage climbs with the batch size, as it should, but I've found that doing a forward pass on one model with (for example) batch size 50, then doing forward passes on a second model with smaller, but increasing, batch sizes (1, 2, 5, 10, 20, 50 etc), the overall GPU memory usage of the app as shown in nvidia-smi increases. Eventually, if I continue similar patterns, it runs out of GPU memory

Error while calling cudaMalloc(&data, n) in file dlib/cuda/cuda_data_ptr.cpp:28. code: 2, reason: out of memory

This happens even when I use clean() in between, so it appears that there's some situation with GPU memory management where releasing memory either isn't happening when necessary, or at all, for some portion of the used memory. Or, there isn't sufficient contiguous memory to reallocate on the GPU so more memory is allocated, which could also explain the slow increase.

Essentially, a single forward pass for a given model and a given batch size of images uses a particular amount of GPU memory the first time, and after some additional forward passes with the model and a second model (in the same application), that identical forward pass with the same model, batch size, and images appears to require more GPU memory than the first time. I can't say whether it's because memory isn't being released when it should be, or because of issues with contiguous memory allocation, or something else.

cchadowitz-pf on 27 Jun 2018

Huh. You should put some logging on the cudaMalloc and cudaFree calls and see if they pair up. They should since they are all held by smart pointers. There are only 3 places these routines are called so it's easy to log.

davisking on 28 Jun 2018

Thanks, I could only find two places cudaMalloc and cudaFree are called:

https://github.com/davisking/dlib/search?q=cudaMalloc&unscoped_q=cudaMalloc returns these two cudaMalloc:
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/cuda_data_ptr.cpp#L28
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/gpu_data.cpp#L195

https://github.com/davisking/dlib/search?q=cudaFree&unscoped_q=cudaFree returns these two cudaFree:
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/cuda_data_ptr.cpp#L30
https://github.com/davisking/dlib/blob/24ac9bc43f62e412bfc132ba2966b7c7f99e5349/dlib/cuda/gpu_data.cpp#L197

So I added some logging before each cudaMalloc and cudaFree call and re-ran my test that exercises this issue. From my logs, it does not appear that cudaMalloc and cudaFree pair up. I've put my full log in this gist: https://gist.github.com/cchadowitz-pf/488d0f5ed38d4a8309d1cc006477aad7

Even if you discount the logging after the first cudaMalloc error, there are more cudaMalloc calls than cudaFree calls.