[x] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
[x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
[x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
Since tf 1.8, the memory leak is worse. It's only happening when calling fit/evaluate multiple time so this is not likely to happen in a real setting. But this may have played a role in https://github.com/keras-team/keras/pull/10674 .
For info, tf > 1.8 uses more memory than 1.7.
pip install psutil
import numpy as np
import tensorflow as tf
from keras import Sequential
from keras.layers import Dense
print("Tensorflow version", tf.__version__)
model1 = Sequential()
model1.add(Dense(10, input_shape=(1000,)))
model1.add(Dense(3, activation='relu'))
model1.compile('sgd', 'mse')
def gen():
while True:
yield np.zeros([10, 1000]), np.ones([10, 3])
import os
import psutil
process = psutil.Process(os.getpid())
g = gen()
while True:
print(process.memory_info().rss / float(2 ** 20))
model1.fit_generator(g, 100, 2, use_multiprocessing=True, verbose=0)
model1.evaluate_generator(gen(), 100, use_multiprocessing=True, verbose=0)
('Tensorflow version', '1.7.0')
160.546875
176.2421875
177.70703125
177.8671875
177.87109375
177.87890625
177.88671875
177.890625
177.8984375
177.90234375
177.90234375
177.90234375
177.90234375
177.91015625
177.9140625
177.91796875
177.91796875
177.91796875
177.91796875
177.91796875
177.91796875
177.91796875
177.93359375
('Tensorflow version', '1.9.0')
166.3125
180.30078125
182.125
182.16015625
182.22265625
182.31640625
182.3203125
182.32421875
182.3359375
182.34375
182.34765625
182.35546875
182.35546875
182.359375
182.37109375
182.38671875
182.390625
182.40234375
182.40625
182.41015625
182.41796875
182.41796875
182.41796875
182.41796875
I can confirm this behaviour across several systems + sometimes GPU memory also doesn't get properly cleaned causing OOMs. I have noticed that the worker processes don't always terminate after fitting is done/interrupted.
I don't think this is related to data_utils.
See this script that does exactly like fit_generator.
import numpy as np
from keras.utils import GeneratorEnqueuer
import multiprocessing
def gen():
while True:
yield np.zeros([10, 1000]), np.ones([10, 3])
import os
import psutil
process = psutil.Process(os.getpid())
g = gen()
def k():
for _ in range(200):
ord = GeneratorEnqueuer(g,True)
ord.start(2)
g1 = ord.get()
for _ in range(10):
next(g1)
ord.stop()
print(process.memory_info().rss / float(2 ** 20))
print(len(multiprocessing.active_children()))
if __name__ == '__main__':
k()
The memory is stable and there are no active children.
So was I correctish about it being linked to multiprocessing: https://github.com/keras-team/keras/pull/10674 ?
I might need to disable some multiprocessing in some scripts until the issue is identified.
From what I found those a 2 distinct problems. I used pytest-memory-profiler.
In #10674, the processes take a lot of memory at the start (2GB), this goes OOM and then the OS kills the processes silently.
I profiled the script above with memory_profiler and it seems that a lot of dict have no owner. And it seems like a bunch of logs. So maybe we're not cleaning them correctly.
It is to be noted that we don't see this behaviour with keras.utils.Sequence so refactoring GeneratorEnqueuer to use a Pool may be the solution.
What's the status of this issue now? I have the same issue recent days, I'm using
a data generator inherited from keras.utils.Sequence with use_multiprocessing=True, and the memory keeps growing until the training process dies.
@soulmachine this is an issue when using generators, not Sequences.
Be sure to be on the latest version of Keras. If the issue persists, please submit a reproducible script.
Any update on this? I tried to reproduce this in am enviroment using CIFAR-100, but failed to reproduce the leak. However, in this other project of mine (https://github.com/adrianodennanni/furniture_classifier), I can reproduce the leak. Maybe it is involved with image resizing.