Keras: Generator memory leak bugs?

Created on 3 Dec 2017 · 47Comments · Source: keras-team/keras

I used 4 GPUs to train DenseNet(layer=190, k = 40).

tensorflow-gpu (1.4.0)
Keras (2.1.2)

It seems that during training via model.fit_generator(),
the system memory increase gradually until program crash.

The training have 250 epochs.
At the first epoch, Mem used [ 4.62/15.6GB ],
after about 50 epochs, [ 15.5GB/15.6GB ],
then program crash.

Source

BIGBALLON

👍6

Most helpful comment

Something like :

img_paths = ... # Path to all the images
datagen = ImageDataGenerator(...)
def your_gen(paths):
    while True:
         for X_batch, y_batch in get_batches(paths):
             X_batch = [datagen.random_transform(x) for x in X_batch]
             # additional preprocessing
             yield X_batch, y_batch

model.fit_generator(your_gen(img_path),...)

Dref360 on 10 Jan 2018

👍2

All 47 comments

https://github.com/fchollet/keras/pull/8666 should have fixed it. Please pull the latest version and retest.

tRosenflanz on 3 Dec 2017

I am having the same issue. I pulled your latest code with the "closing" statement in it and I am still having the issue.

apedevmicrosoft on 7 Dec 2017

@apedevmicrosoft I cloned the latest code & setup, still having the issue.

BIGBALLON on 7 Dec 2017

@tRosenflanz I have a feeling that there is still something going on with the threads. If I play with workers and max_queue_size parameters I can get into stable and non stable states. It feels like there is something that is not able to consume the data from the queue fast enough and the memory fills up. I looked at the OrderedEnqueuer and I dont see anything obvious. I have scenarios where the moment I increase workers by 1 the memory starts filling up but if I keep it below a certain threshold it stays consistent. GPU utilization ranges from 70-85% depending on the number of workers. Any ideas?

apedevmicrosoft on 7 Dec 2017

@apedevmicrosoft I am just a user myself, there was an identical bug that got fixed but that update. @Dref360 I think this is up your valley

tRosenflanz on 7 Dec 2017

Any of you have a reproducible example? I'll be able to look more into it around X-mas. (Finals and stuff)

Dref360 on 7 Dec 2017

Some questions to guide me :

OS and python version
multiprocessing=True or False
Value of workers and queue_size.
Spec of your PC (RAM, Video card)

Dref360 on 7 Dec 2017

OS and python version
Win 10, Conda 4.3.27, Python 3.5.4, Tensorflow-GPU 1.3.0
multiprocessing=True or False
False
Value of workers and queue_size.
leaks: workers=6, max_queue_size=512
Spec of your PC (RAM, Video card)
RAM: 64gb GPU: Titan XP 12gb

I will create a sample in a bit.

apedevmicrosoft on 7 Dec 2017

May be helpful: I haven't had any problems since you added the close statement so it indeed might be a specs issue although I do use multiprocessing

OS and python version
Ubuntu 16.04, Python 2.7
multiprocessing=True or False
True
Value of workers and queue_size.
No leaks with queue_size within [5;100] and workers [5:50]
Spec of your PC (RAM, Video card)
RAM: 1 TB GPU: 4x P100 16gb

tRosenflanz on 8 Dec 2017

If it's a Windows only issue, I won't be able to work on it. Since there is no fork on Windows, Python does a lot of magic to make it work. Maybe that's the issue?

Dref360 on 13 Dec 2017

I dont think its a Windows issue. @tRosenflanz is using "multiprocessing=True" which follows a different code path.

apedevmicrosoft on 13 Dec 2017

Are you using a generator or a Sequence?
Also, using more than one worker with multiprocessing=False has no benefit because of the GIL

Dref360 on 13 Dec 2017

I am using ImageDataGenerator.flow_from_directory which takes the code down the Sequence path internally (you would imagine it would go down the generator path, but it does not when I debug it).

As for the second part, I do see a significant gain when using multiprocessing=False and adding multiple workers as it does seem to create more threads to pre-process the data.

apedevmicrosoft on 13 Dec 2017

Sorry, Sequence work with multiprocessing=False, but generators don't.
I'm currently unable to reproduce your issue, maybe you could share a gist?

Dref360 on 14 Dec 2017

Ill work on getting you a sample. How large is your sample dataset? Its very prominent when using train and valid generator on a relatively large dataset, such as Imagenet.

apedevmicrosoft on 14 Dec 2017

100k 300x300 generated images
Also, what's your batch_size?

Dref360 on 14 Dec 2017

NumTrainingImages: 1,281,167
NumValImages: 500,000
InputShape: (240,240)
BatchSize: 64
steps_per_epoch: 1,281,167/64
validation_steps: 500,000/64

apedevmicrosoft on 14 Dec 2017

Seems like people are seeing the same issue in this thread: https://github.com/keras-team/keras/issues/5835#issuecomment-353629656

apedevmicrosoft on 2 Jan 2018

With your values, your queue can go up to 40 GB of RAM.
2 queues (train/val) * 512 queue_size * (64 * 240 * 240 * 3) * 4 byte.

Are you seeing any zombie threads again? Or it's just that the queue is filling up?

Dref360 on 3 Jan 2018

Thank you @apedevmicrosoft for pointing me here from #5835. I am having a similar issue. This are the system details:

OS: Ubuntu 16.04.2 LTS
RAM: 128 GB
GPU: NVIDIA GeForce GTX 1080 Ti 11GB
Python 2.7.12; Keras 2.1.2; Tensorflow 1.2.0

The data set is ImageNet ILSVRC 2012, with the following characteristics:

Number of training images: 1,281,167
Number of validation images: 50,000
Image shape: [150, 200, 3]
Encoding: Float 32
Batch size: 64

What I have observed so far: I try to train with fit_generator. Before starting training, the RAM memory gets filled slowly up to the 128 GB available, then the swap memory and at some point (when Swp is about 100 GB) I get a MemoryError.

This happens with the default values of max_queue_size=10, workers=1, use_multiprocessing=False. But I have also tried with max_queue_size=1, workers=10, use_multiprocessing=True and others with identical result. I have also found the batch size irrelevant (have tried with 16 instead of 64).

Finally, the only way I have managed to enable the training with ImageNet is by reducing the number of training images to 10 % (128,116). The RAM gets slowly filled similarly, then the swap memory gets filled up to ~50 GB. At this point it starts training and the swap memory gets empty and the RAM memory is reduced to a constant ~75 GB. With 50 % (640583) I get the memory error as well.

I am not sure about all the details of how fit_generator works, but my intuition is that it is trying to allocate on RAM at least as many images as the amount of available images for training, which does not make sense. If I am not wrong, this was working with a previous version of Keras, but I do not remember which and I haven't tried again.

I am happy to help find out what's going on if you have some time too.

alexhernandezgarcia on 10 Jan 2018

Are you too using ImageDataGenerator? If so, with flow or flow_from_directory?

Dref360 on 10 Jan 2018

Yes, I am using ImageDataGenerator with flow.

alexhernandezgarcia on 10 Jan 2018

Could you give me a quick snippet of your generator?
flow takes 2 numpy array so you call flow on each batch?

Dref360 on 10 Jan 2018

# Create batch generators
batch_gen_tr = ImageDataGenerator(width_shift_range=0.1,
                                  height_shift_range=0.1,
                                  horizontal_flip=True)
batch_gen_val = ImageDataGenerator()

# Train model
model.fit_generator(generator=batch_gen_tr.flow(images_tr, labels_tr),
                    steps_per_epoch=20000,
                    epochs=25,
                    validation_data=batch_gen_val.flow(images_val, labels_val),
                    validation_steps=780)

images_tr, images_val, labels_tr and labels_val are dask arrays in my case.

alexhernandezgarcia on 10 Jan 2018

So the shape of image_tr is ~ [1281167,150, 200, 3] ?

Dref360 on 10 Jan 2018

Yes, exactly.

alexhernandezgarcia on 10 Jan 2018

Then your memory is already filled up with the images, it's not a bug. You load 1 million images in memory.
You need to use a generator to create your samples

Dref360 on 10 Jan 2018

Sorry, I forgot the batch_size argument in the flow function. It should be:

model.fit_generator(generator=batch_gen_tr.flow(images_tr, labels_tr, batch_size=64),
                    steps_per_epoch=20000,
                    epochs=25)

So, if I am not wrong, fit_generator should be able to retrieve the data batch by batch, without filling up the memory.

images_tr can be [1281167, 150, 200, 3] because it is a dask array, so it is loaded on hard disk, not RAM.

alexhernandezgarcia on 10 Jan 2018

ImageDatagenerator cast images_tr to an array. here
Feel free to submit a PR if you want to avoid this behavior. Or use a true generator.

Dref360 on 10 Jan 2018

Oh I see, thanks! I think it would be better if Keras was able to handle large dask arrays. The API would be more transparent. Then, is it not possible to use fit_generator with large data sets? What do you mean by using a "true generator", @Dref360?

alexhernandezgarcia on 10 Jan 2018

Something like :

img_paths = ... # Path to all the images
datagen = ImageDataGenerator(...)
def your_gen(paths):
    while True:
         for X_batch, y_batch in get_batches(paths):
             X_batch = [datagen.random_transform(x) for x in X_batch]
             # additional preprocessing
             yield X_batch, y_batch

model.fit_generator(your_gen(img_path),...)

Dref360 on 10 Jan 2018

👍2

@Dref360

Are you seeing any zombie threads again? Or it's just that the queue is filling up?

I am actually not exactly sure where the problem is. Could you suggest a way to debug this?

apedevmicrosoft on 10 Jan 2018

Hi, i was looking for this exact issue and i found this one #8677. Are you guys using a call back ? Because it seems threads are started again for both validation and after callbacks. And I do have zombies.

Ricocotam on 18 Jan 2018

These are the callbacks that I am using:

string_start_time = datetime.fromtimestamp(time.time()).strftime('%Y-%m-%dT%H-%M-%SZ')

lr_reducer      = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.1),
                                    cooldown=0, patience=20, min_lr=0.5e-6)
lr_logger = LearningRateLogger(lr_file)

early_stopper   = EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=50)
model_checkpoint= ModelCheckpoint(model_file, monitor="val_acc", save_best_only=True,
                                  save_weights_only=False)

callbacks=[lr_reducer, lr_logger, early_stopper, model_checkpoint]

if K.backend() == 'tensorflow':
    tensorboard = TensorBoard(log_dir='logs/{}'.format(string_start_time)) #, histogram_freq=5, batch_size=batch_size)
    callbacks.append(tensorboard)

apedevmicrosoft on 18 Jan 2018

I tried to run the same model i had (xception on transfert learning) but without call backs as it is suggested on #8677, I still get a crash. I read on #8946 that it's been corrected on master (#8666) so I'll get a new try.

Ricocotam on 22 Jan 2018

Any updates on this ?

mohapatras on 6 Mar 2018

Yes, I just update keras and it works well.

Ricocotam on 6 Mar 2018

Working with Keras 2.1.5, Google colab, still crashes with a batch size of 100 article images, and 3 layers of LSTM. When I downgrade to 2 layers, it starts working again. model.fit_generator seems to explode in memory.

Haseeb92 on 20 Mar 2018

If it works with 2 layers and not 3, it's your model that is too big. Not fit_generator related.

Dref360 on 20 Mar 2018

This issue might be caused by Tensorflow, rather than Keras.
I was working on Keras 2.2.4 with Tensorflow 1.14.0(CPU) backend, and I had the same issue. Then I downgraded the Tensorflow to 1.13.1 and I found out no memory is leaking anymore. I haven't changed my python script or Keras version.

adt58780 on 22 Jun 2019

I have a similar problem to the one discussed in this issue. Training the network increases the memory batch after batch, until the program gets killed because of memory exceeding the limits.

I'm using Keras 2.2.4 and 1.12.0. I currently have 128GB of RAM and my dataset occupies 20 GB (compressed). I load the compressed dataset in RAM and then use a Sequence generator to uncompress batches of data to feed the network. Each of my samples is 100x100x100 pixels stored as float32, I'm using a batch_size of 64, queue_size of 5 and 27 workers with multiprocessing=True. In theory, I should have a total of 100100100 *4 *64 *5 *27 =~ 35 GB. However, when I run my script, it gets killed by the queuing system because of excessive memory usage:

slurmstepd: error: Job XXXX exceeded memory limit (1192359452 > 131072000), being killed

I've even tried to use a max_queue_size as little as 2, and the process is still exceeding the maximum memory. To make things even more difficult to understand, sometimes, completely by chance, the process is executed properly (even with a max_queue_size of 30!).

The only way to get the script to work is to set workers=2, although it becomes incredibly slow (my generator is very computationally hard, I need more workers to avoid GPUs starving). If I check ps aux, I can see 4 different processes:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user  6524 96.3 16.3 54067464 21438188 ?   R    18:26  14:45 python3 main.py --experiment=0
user  6525  0.4 16.2 54004960 21395432 ?   S    18:26   0:04 python3 main.py --experiment=0
user  6528 96.8 16.3 54067464 21444344 ?   R    18:26  14:49 python3 main.py --experiment=0
user  6530  0.3 16.2 54007012 21395440 ?   S    18:26   0:03 python3 main.py --experiment=0

The memory allocated to each process is 21GB! Which explains why I run out of memory with more than 2 workers. Why is this happening? Isn't the memory meant to be shared between workers?

I'm really going crazy with this issue...

ale152 on 5 Jul 2019

Hi @Dref360,
I have the same problem. I am using Tensorflow 1.14 and keras 2.2.4 on ubuntu, GPU 1080 GTX cuda 9 and cudnn 7.
I am using keras fit_generator to train the model. The generator is done by Tensorpack (dataflow) and it is tested correctly for 2000 iterations.
Batch_size = 4, image_size =512x512x3, dataset size =2000 images
fit_generator with default parameters.
I see that the RAM is keep increasing while the model is training, from 50% to 75% for only 70 epochs!

rafikg on 2 Aug 2019

Hi all ,I had recently the same issue when using a generator in Keras, not sure what's going one inside the fit_generator function but when I used a Sequence instead it did't crash anymore, if anyone has a clue about the fit_generator behaviour, I am more than interested. Thanks

sarahboufelja on 12 Sep 2019

I am facing the same issue with numpy arrays as an input. The memory just increases with every batch and is not cleared. Is there a solution to the problem by now? Or do we all understand the concept of the fit_generator wrong?
From what I thought, the fit_generator is supposed to load a new sample for every batch. But after this data is processed by the model (forward- and backward propagation), is hould be released from memory: Is this correct?

AlexVaith on 28 Nov 2019

👍1

Just wanted to mention that I'm facing the exact same problem while using fit_generator. The memory inside RAM keeps on increasing and finally I get a ResourceExhaustedError at some point. Looks like memory from the previous batch is not released or released very slowly leading to this. I have tried creating very small batches of like 8 images. So, I get a numpy array that's not too large. Still after some number of epochs memory is released very slowly and finally it just crashes. I'm running it on Kaggle kernel with 16 GB of RAM.

Shashank9830 on 13 Jan 2020

I am using tf 1.15 to train an encoder-decoder-like model, and running into the same problem when I call the .predict method of the encoder every epoch. (GPU memory explodes after around 100 epoches). Simply changing the .predict to .predict_on_batch with nothing else altered seems fix my problem, and there seems no demage to the performance.