Hello! First of all, thanks for this library. It has been really useful and insightful.
This issue is quite related to this already closed one. When detecting with the CNN face detector, it all runs well if the size of the batches is constant. However, if the batch size gets at some point smaller (even by a small amount), two problems arise:
RuntimeError: Error while calling cudaMalloc(&data, new_size*sizeof(float)) in file /tmp/pip-build-stoi524l/dlib/dlib/dnn/gpu_data.cpp:191. code: 2, reason: out of memory.This is a small piece of code that replicates the problem. You might need to change the parameter max_batch_size to the memory available in your GPU:
import dlib
import numpy as np
face_detector_path = 'mmod_human_face_detector.dat'
face_detector = dlib.cnn_face_detection_model_v1(face_detector_path)
def create_img_batch(batch_size):
return [np.zeros((200, 200, 3)) for i in range(batch_size)]
# Test detector with a constant batch size
max_batch_size = 256
img_batch = create_img_batch(max_batch_size)
n_iters = 5
print("Constant batch size:")
for i in range(n_iters):
print("\t%d/%d" % (i, n_iters))
face_detector(img_batch, 1)
# Variable batch size
print("(Almost) constant batch size")
for i in range(n_iters):
batch_size = max_batch_size
if i % 2:
batch_size = max_batch_size - 1
print("\t%d/%d (size: %d)" % (i, n_iters, batch_size))
face_detector(create_img_batch(batch_size), 1)
I'm currently using Dlib 19.7.0 (Python 3.5) on Ubuntu 16. My GPU is a Nvidia GTX 1080 with 8 GB. CUDA version is 8.0, CuDNN v6.
As mentioned in the closed issue, one should try to use set_dnn_prefer_smallest_algorithms(). However, the 19.5 Dlib update seems to already address this:
The way cuDNN work buffers are managed has been improved, leading to less GPU RAM usage. Therefore, users should not need to call
set_dnn_prefer_smallest_algorithms()anymore.
I'm assuming, then, that this doesn't solve this problem. I'd really appreciate if somebody could throw some light into this :)
Thanks guys!
Thanks for the detail, this is an excellent bug report :)
I just ran this python script with the latest dlib for a while and it held constant at 6624MiB. This is with a TITAN X, CUDA 8, cuDNN 6, and Ubuntu 16.04. I've also run this face detector a lot on variable sized batches and haven't seen any odd memory leaks.
Are you sure this python script exercises the problem?
Thanks for the quick reply :)
I have only tested this in a single Linux machine. I have tested it further and I've seen that, in my machine, it starts failing when max_batch_size >= 122. The first loop with a constant batch size always runs fine. It fails on the second loop, more concretely, in the 2nd iteration, which is the very first time that the batch size is changed.
I'll try to get my hands on another Linux machine and see if I can replicate the problem there. It seems that this might be a hardware dependent issue.
I have executed the script on a Linux laptop (Ubuntu 16.04 LTS, CUDA 8, cuDNN 6, Nvidia GTX 765M 2GB) and I can replicate the problem there as well.
In order to make it fail, you need to find a max_batch_size value small enough so it runs the first loop without problems (i.e. batch fits in GPU), but big enough that it fails on the second loop. In other words, I'd recommend finding the maximum batch size value available for your GPU, and see if it fails on the 2nd loop. Could you try that and see if you are able to replicate it? :)
Oh yeah, duah, this is happening because the tensor allocates a new memory
block before it frees the old one. I just changed it to the other way
around. If you grab the latest dlib from github this should go away.
That completely solve the problem. Thanks man, you are amazing. I owe you a lot of beers.
Ha, no problem. Thanks for pointing out this flaw in dlib.
Oops, this change broke everything CUDA related. I just pushed the fix.
Greetings. Sorry, I can't understand is this issue resolved or not? I catch similar errors (with cloned right now code). Like:
encoding = face_recognition.face_encodings(image)[index]
Exception Type: IndexError at /v1/face/
Exception Value: list index out of range
or (don't sure that first one is dlib problem, just in case)
return cnn_face_detector(img, number_of_times_to_upsample)
Exception Type: RuntimeError at /v1/face/
Exception Value: Error while calling cudaMalloc(&data, n) in file /building/dlib/dlib/dnn/cuda_data_ptr.cpp:28. code: 2, reason: out of memory
Should I open new issue (and add more details)? Thanks.
The issue is solved and closed. The script I sent on the first post no longer fails.
You should open another issue for that, but make sure first that you are not trying to be allocating more memory than you have available.
Hmm. Good point. Must speak with developer) Thanx, @GuimExc.
Most helpful comment
Oh yeah, duah, this is happening because the tensor allocates a new memory
block before it frees the old one. I just changed it to the other way
around. If you grab the latest dlib from github this should go away.