Keras: Generator memory leak bugs?

Created on 3 Dec 2017  路  47Comments  路  Source: keras-team/keras

I used 4 GPUs to train DenseNet(layer=190, k = 40).

  • tensorflow-gpu (1.4.0)
  • Keras (2.1.2)

It seems that during training via model.fit_generator(),
the system memory increase gradually until program crash.

The training have 250 epochs.
At the first epoch, Mem used [ 4.62/15.6GB ],
after about 50 epochs, [ 15.5GB/15.6GB ],
then program crash.

Most helpful comment

Something like :

img_paths = ... # Path to all the images
datagen = ImageDataGenerator(...)
def your_gen(paths):
    while True:
         for X_batch, y_batch in get_batches(paths):
             X_batch = [datagen.random_transform(x) for x in X_batch]
             # additional preprocessing
             yield X_batch, y_batch

model.fit_generator(your_gen(img_path),...)

All 47 comments

https://github.com/fchollet/keras/pull/8666 should have fixed it. Please pull the latest version and retest.

I am having the same issue. I pulled your latest code with the "closing" statement in it and I am still having the issue.

@apedevmicrosoft I cloned the latest code & setup, still having the issue.

@tRosenflanz I have a feeling that there is still something going on with the threads. If I play with workers and max_queue_size parameters I can get into stable and non stable states. It feels like there is something that is not able to consume the data from the queue fast enough and the memory fills up. I looked at the OrderedEnqueuer and I dont see anything obvious. I have scenarios where the moment I increase workers by 1 the memory starts filling up but if I keep it below a certain threshold it stays consistent. GPU utilization ranges from 70-85% depending on the number of workers. Any ideas?

@apedevmicrosoft I am just a user myself, there was an identical bug that got fixed but that update. @Dref360 I think this is up your valley

Any of you have a reproducible example? I'll be able to look more into it around X-mas. (Finals and stuff)

Some questions to guide me :

  1. OS and python version
  2. multiprocessing=True or False
  3. Value of workers and queue_size.
  4. Spec of your PC (RAM, Video card)
  1. OS and python version
    Win 10, Conda 4.3.27, Python 3.5.4, Tensorflow-GPU 1.3.0
  2. multiprocessing=True or False
    False
  3. Value of workers and queue_size.
    leaks: workers=6, max_queue_size=512
  4. Spec of your PC (RAM, Video card)
    RAM: 64gb GPU: Titan XP 12gb

I will create a sample in a bit.

May be helpful: I haven't had any problems since you added the close statement so it indeed might be a specs issue although I do use multiprocessing

  1. OS and python version
    Ubuntu 16.04, Python 2.7
  2. multiprocessing=True or False
    True
  3. Value of workers and queue_size.
    No leaks with queue_size within [5;100] and workers [5:50]
  4. Spec of your PC (RAM, Video card)
    RAM: 1 TB GPU: 4x P100 16gb

If it's a Windows only issue, I won't be able to work on it. Since there is no fork on Windows, Python does a lot of magic to make it work. Maybe that's the issue?

I dont think its a Windows issue. @tRosenflanz is using "multiprocessing=True" which follows a different code path.

Are you using a generator or a Sequence?
Also, using more than one worker with multiprocessing=False has no benefit because of the GIL

I am using ImageDataGenerator.flow_from_directory which takes the code down the Sequence path internally (you would imagine it would go down the generator path, but it does not when I debug it).

As for the second part, I do see a significant gain when using multiprocessing=False and adding multiple workers as it does seem to create more threads to pre-process the data.

Sorry, Sequence work with multiprocessing=False, but generators don't.
I'm currently unable to reproduce your issue, maybe you could share a gist?

Ill work on getting you a sample. How large is your sample dataset? Its very prominent when using train and valid generator on a relatively large dataset, such as Imagenet.

100k 300x300 generated images
Also, what's your batch_size?

  1. NumTrainingImages: 1,281,167
  2. NumValImages: 500,000
  3. InputShape: (240,240)
  4. BatchSize: 64
  5. steps_per_epoch: 1,281,167/64
  6. validation_steps: 500,000/64

Seems like people are seeing the same issue in this thread: https://github.com/keras-team/keras/issues/5835#issuecomment-353629656

With your values, your queue can go up to 40 GB of RAM.
2 queues (train/val) * 512 queue_size * (64 * 240 * 240 * 3) * 4 byte.

Are you seeing any zombie threads again? Or it's just that the queue is filling up?

Thank you @apedevmicrosoft for pointing me here from #5835. I am having a similar issue. This are the system details:

  • OS: Ubuntu 16.04.2 LTS
  • RAM: 128 GB
  • GPU: NVIDIA GeForce GTX 1080 Ti 11GB
  • Python 2.7.12; Keras 2.1.2; Tensorflow 1.2.0

The data set is ImageNet ILSVRC 2012, with the following characteristics:

  • Number of training images: 1,281,167
  • Number of validation images: 50,000
  • Image shape: [150, 200, 3]
  • Encoding: Float 32
  • Batch size: 64

What I have observed so far: I try to train with fit_generator. Before starting training, the RAM memory gets filled slowly up to the 128 GB available, then the swap memory and at some point (when Swp is about 100 GB) I get a MemoryError.

This happens with the default values of max_queue_size=10, workers=1, use_multiprocessing=False. But I have also tried with max_queue_size=1, workers=10, use_multiprocessing=True and others with identical result. I have also found the batch size irrelevant (have tried with 16 instead of 64).

Finally, the only way I have managed to enable the training with ImageNet is by reducing the number of training images to 10 % (128,116). The RAM gets slowly filled similarly, then the swap memory gets filled up to ~50 GB. At this point it starts training and the swap memory gets empty and the RAM memory is reduced to a constant ~75 GB. With 50 % (640583) I get the memory error as well.

I am not sure about all the details of how fit_generator works, but my intuition is that it is trying to allocate on RAM at least as many images as the amount of available images for training, which does not make sense. If I am not wrong, this was working with a previous version of Keras, but I do not remember which and I haven't tried again.

I am happy to help find out what's going on if you have some time too.

Are you too using ImageDataGenerator? If so, with flow or flow_from_directory?

Yes, I am using ImageDataGenerator with flow.

Could you give me a quick snippet of your generator?
flow takes 2 numpy array so you call flow on each batch?

# Create batch generators
batch_gen_tr = ImageDataGenerator(width_shift_range=0.1,
                                  height_shift_range=0.1,
                                  horizontal_flip=True)
batch_gen_val = ImageDataGenerator()
# Train model
model.fit_generator(generator=batch_gen_tr.flow(images_tr, labels_tr),
                    steps_per_epoch=20000,
                    epochs=25,
                    validation_data=batch_gen_val.flow(images_val, labels_val),
                    validation_steps=780)

images_tr, images_val, labels_tr and labels_val are dask arrays in my case.

So the shape of image_tr is ~ [1281167,150, 200, 3] ?

Yes, exactly.

Then your memory is already filled up with the images, it's not a bug. You load 1 million images in memory.
You need to use a generator to create your samples

Sorry, I forgot the batch_size argument in the flow function. It should be:

model.fit_generator(generator=batch_gen_tr.flow(images_tr, labels_tr, batch_size=64),
                    steps_per_epoch=20000,
                    epochs=25)

So, if I am not wrong, fit_generator should be able to retrieve the data batch by batch, without filling up the memory.

images_tr can be [1281167, 150, 200, 3] because it is a dask array, so it is loaded on hard disk, not RAM.

ImageDatagenerator cast images_tr to an array. here
Feel free to submit a PR if you want to avoid this behavior. Or use a true generator.

Oh I see, thanks! I think it would be better if Keras was able to handle large dask arrays. The API would be more transparent. Then, is it not possible to use fit_generator with large data sets? What do you mean by using a "true generator", @Dref360?

Something like :

img_paths = ... # Path to all the images
datagen = ImageDataGenerator(...)
def your_gen(paths):
    while True:
         for X_batch, y_batch in get_batches(paths):
             X_batch = [datagen.random_transform(x) for x in X_batch]
             # additional preprocessing
             yield X_batch, y_batch

model.fit_generator(your_gen(img_path),...)

@Dref360

Are you seeing any zombie threads again? Or it's just that the queue is filling up?

I am actually not exactly sure where the problem is. Could you suggest a way to debug this?

Hi, i was looking for this exact issue and i found this one #8677. Are you guys using a call back ? Because it seems threads are started again for both validation and after callbacks. And I do have zombies.

These are the callbacks that I am using:

string_start_time = datetime.fromtimestamp(time.time()).strftime('%Y-%m-%dT%H-%M-%SZ')

lr_reducer      = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.1),
                                    cooldown=0, patience=20, min_lr=0.5e-6)
lr_logger = LearningRateLogger(lr_file)

early_stopper   = EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=50)
model_checkpoint= ModelCheckpoint(model_file, monitor="val_acc", save_best_only=True,
                                  save_weights_only=False)

callbacks=[lr_reducer, lr_logger, early_stopper, model_checkpoint]

if K.backend() == 'tensorflow':
    tensorboard = TensorBoard(log_dir='logs/{}'.format(string_start_time)) #, histogram_freq=5, batch_size=batch_size)
    callbacks.append(tensorboard)

I tried to run the same model i had (xception on transfert learning) but without call backs as it is suggested on #8677, I still get a crash. I read on #8946 that it's been corrected on master (#8666) so I'll get a new try.

Any updates on this ?

Yes, I just update keras and it works well.

Working with Keras 2.1.5, Google colab, still crashes with a batch size of 100 article images, and 3 layers of LSTM. When I downgrade to 2 layers, it starts working again. model.fit_generator seems to explode in memory.

If it works with 2 layers and not 3, it's your model that is too big. Not fit_generator related.

This issue might be caused by Tensorflow, rather than Keras.
I was working on Keras 2.2.4 with Tensorflow 1.14.0(CPU) backend, and I had the same issue. Then I downgraded the Tensorflow to 1.13.1 and I found out no memory is leaking anymore. I haven't changed my python script or Keras version.

I have a similar problem to the one discussed in this issue. Training the network increases the memory batch after batch, until the program gets killed because of memory exceeding the limits.

I'm using Keras 2.2.4 and 1.12.0. I currently have 128GB of RAM and my dataset occupies 20 GB (compressed). I load the compressed dataset in RAM and then use a Sequence generator to uncompress batches of data to feed the network. Each of my samples is 100x100x100 pixels stored as float32, I'm using a batch_size of 64, queue_size of 5 and 27 workers with multiprocessing=True. In theory, I should have a total of 100100100 *4 *64 *5 *27 =~ 35 GB. However, when I run my script, it gets killed by the queuing system because of excessive memory usage:

slurmstepd: error: Job XXXX exceeded memory limit (1192359452 > 131072000), being killed

I've even tried to use a max_queue_size as little as 2, and the process is still exceeding the maximum memory. To make things even more difficult to understand, sometimes, completely by chance, the process is executed properly (even with a max_queue_size of 30!).

The only way to get the script to work is to set workers=2, although it becomes incredibly slow (my generator is very computationally hard, I need more workers to avoid GPUs starving). If I check ps aux, I can see 4 different processes:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user  6524 96.3 16.3 54067464 21438188 ?   R    18:26  14:45 python3 main.py --experiment=0
user  6525  0.4 16.2 54004960 21395432 ?   S    18:26   0:04 python3 main.py --experiment=0
user  6528 96.8 16.3 54067464 21444344 ?   R    18:26  14:49 python3 main.py --experiment=0
user  6530  0.3 16.2 54007012 21395440 ?   S    18:26   0:03 python3 main.py --experiment=0

The memory allocated to each process is 21GB! Which explains why I run out of memory with more than 2 workers. Why is this happening? Isn't the memory meant to be shared between workers?

I'm really going crazy with this issue...

Hi @Dref360,
I have the same problem. I am using Tensorflow 1.14 and keras 2.2.4 on ubuntu, GPU 1080 GTX cuda 9 and cudnn 7.
I am using keras fit_generator to train the model. The generator is done by Tensorpack (dataflow) and it is tested correctly for 2000 iterations.
Batch_size = 4, image_size =512x512x3, dataset size =2000 images
fit_generator with default parameters.
I see that the RAM is keep increasing while the model is training, from 50% to 75% for only 70 epochs!

Hi all ,I had recently the same issue when using a generator in Keras, not sure what's going one inside the fit_generator function but when I used a Sequence instead it did't crash anymore, if anyone has a clue about the fit_generator behaviour, I am more than interested. Thanks

I am facing the same issue with numpy arrays as an input. The memory just increases with every batch and is not cleared. Is there a solution to the problem by now? Or do we all understand the concept of the fit_generator wrong?
From what I thought, the fit_generator is supposed to load a new sample for every batch. But after this data is processed by the model (forward- and backward propagation), is hould be released from memory: Is this correct?

Just wanted to mention that I'm facing the exact same problem while using fit_generator. The memory inside RAM keeps on increasing and finally I get a ResourceExhaustedError at some point. Looks like memory from the previous batch is not released or released very slowly leading to this. I have tried creating very small batches of like 8 images. So, I get a numpy array that's not too large. Still after some number of epochs memory is released very slowly and finally it just crashes. I'm running it on Kaggle kernel with 16 GB of RAM.

I am using tf 1.15 to train an encoder-decoder-like model, and running into the same problem when I call the .predict method of the encoder every epoch. (GPU memory explodes after around 100 epoches). Simply changing the .predict to .predict_on_batch with nothing else altered seems fix my problem, and there seems no demage to the performance.

I found this problem too with ubuntu18.04 and anaconda,but when i use the python3 which inside in ubuntu,bug fixed.

Was this page helpful?
0 / 5 - 0 ratings