Dlib: Memory error on detecting different batch sizes

Created on 18 Oct 2017 · 10Comments · Source: davisking/dlib

Hello! First of all, thanks for this library. It has been really useful and insightful.

This issue is quite related to this already closed one. When detecting with the CNN face detector, it all runs well if the size of the batches is constant. However, if the batch size gets at some point smaller (even by a small amount), two problems arise:

It eventually crashes with a RuntimeError: Error while calling cudaMalloc(&data, new_size*sizeof(float)) in file /tmp/pip-build-stoi524l/dlib/dlib/dnn/gpu_data.cpp:191. code: 2, reason: out of memory.
If I lower the batch size, the memory of the GPU gets higher and higher, in a very subtle way. For example, if I run a constant number of batch, the memory used in the GPU is 2727 MB. If every now and then the batch is slightly lower, this happens:
- The memory gets refreshed with the smaller batch to 2585 MB (makes sense, since the batch is smaller).
- When I process again the original batch size, the memory goes up to 2780 MB, not 2727 MB. This adds up to a point that, given enough time, it causes the out of memory error mentioned in point 1. Therefore, using a smaller batch size doesn't solve the problem, only delays it.

This is a small piece of code that replicates the problem. You might need to change the parameter max_batch_size to the memory available in your GPU:

import dlib
import numpy as np

face_detector_path = 'mmod_human_face_detector.dat'
face_detector = dlib.cnn_face_detection_model_v1(face_detector_path)

def create_img_batch(batch_size):
    return [np.zeros((200, 200, 3)) for i in range(batch_size)]

# Test detector with a constant batch size
max_batch_size = 256
img_batch = create_img_batch(max_batch_size)
n_iters = 5
print("Constant batch size:")
for i in range(n_iters):
    print("\t%d/%d" % (i, n_iters))
    face_detector(img_batch, 1)

# Variable batch size
print("(Almost) constant batch size")
for i in range(n_iters):
    batch_size = max_batch_size
    if i % 2:
        batch_size = max_batch_size - 1
    print("\t%d/%d (size: %d)" % (i, n_iters, batch_size))
    face_detector(create_img_batch(batch_size), 1)

I'm currently using Dlib 19.7.0 (Python 3.5) on Ubuntu 16. My GPU is a Nvidia GTX 1080 with 8 GB. CUDA version is 8.0, CuDNN v6.

As mentioned in the closed issue, one should try to use set_dnn_prefer_smallest_algorithms(). However, the 19.5 Dlib update seems to already address this:

The way cuDNN work buffers are managed has been improved, leading to less GPU RAM usage. Therefore, users should not need to call set_dnn_prefer_smallest_algorithms() anymore.

I'm assuming, then, that this doesn't solve this problem. I'd really appreciate if somebody could throw some light into this :)

Thanks guys!

good question

Source

GuimExc

Most helpful comment

Oh yeah, duah, this is happening because the tensor allocates a new memory
block before it frees the old one. I just changed it to the other way
around. If you grab the latest dlib from github this should go away.

davisking on 19 Oct 2017

🎉2

All 10 comments

Thanks for the detail, this is an excellent bug report :)

I just ran this python script with the latest dlib for a while and it held constant at 6624MiB. This is with a TITAN X, CUDA 8, cuDNN 6, and Ubuntu 16.04. I've also run this face detector a lot on variable sized batches and haven't seen any odd memory leaks.

Are you sure this python script exercises the problem?

davisking on 19 Oct 2017

👍1

Thanks for the quick reply :)

I have only tested this in a single Linux machine. I have tested it further and I've seen that, in my machine, it starts failing when max_batch_size >= 122. The first loop with a constant batch size always runs fine. It fails on the second loop, more concretely, in the 2nd iteration, which is the very first time that the batch size is changed.

I'll try to get my hands on another Linux machine and see if I can replicate the problem there. It seems that this might be a hardware dependent issue.

GuimExc on 19 Oct 2017

👍1

I have executed the script on a Linux laptop (Ubuntu 16.04 LTS, CUDA 8, cuDNN 6, Nvidia GTX 765M 2GB) and I can replicate the problem there as well.

In order to make it fail, you need to find a max_batch_size value small enough so it runs the first loop without problems (i.e. batch fits in GPU), but big enough that it fails on the second loop. In other words, I'd recommend finding the maximum batch size value available for your GPU, and see if it fails on the 2nd loop. Could you try that and see if you are able to replicate it? :)

GuimExc on 19 Oct 2017

davisking on 19 Oct 2017

🎉2

That completely solve the problem. Thanks man, you are amazing. I owe you a lot of beers.

GuimExc on 19 Oct 2017

👍1

Ha, no problem. Thanks for pointing out this flaw in dlib.

davisking on 19 Oct 2017

Oops, this change broke everything CUDA related. I just pushed the fix.

davisking on 20 Oct 2017

👍1

Greetings. Sorry, I can't understand is this issue resolved or not? I catch similar errors (with cloned right now code). Like:

encoding = face_recognition.face_encodings(image)[index]
Exception Type: IndexError at /v1/face/
Exception Value: list index out of range

or (don't sure that first one is dlib problem, just in case)

return cnn_face_detector(img, number_of_times_to_upsample)
Exception Type: RuntimeError at /v1/face/
Exception Value: Error while calling cudaMalloc(&data, n) in file /building/dlib/dlib/dnn/cuda_data_ptr.cpp:28. code: 2, reason: out of memory

Should I open new issue (and add more details)? Thanks.