I couldn't find detailed explanations about max_queue_size (default size = 10) and the mechanism behind it along with other parameters: workers, use_multiprocessing which are related with generator.
_Question 1:_
I might be totally wrong and want to get your feedback on my understanding of it. I thought that multiple generator instances (like producer) will be launched to feed the data into a queue what was create/maintained by model.fit_generator() function, meanwhile data will be grabbed from queue into GPU for training(consumer). If training with GPU is not bottleneck, then the more data could be yield/produced by generator, the faster the overall process would be. I learned by default the max_queue_size = 10, how to define the proper max_queue_size once the generator is thread-safe?
_Question 2:_
Also, is there a way to measure weather the bottleneck is generator(producer) or GPU training(consumer)? I use verbose = 1 to print the status bar, as well as how many rows a single thread generator yield. Right now it always like:
number of rows yield = (max_queue_size + number of steps has been processed) * batch_size
So it seems like the queue is alway full which means the bottleneck is consumer/GPU training?
_Question 3_
the Accuracy within Epoch 1 improves almost every steps from 0.6719 to 0.87316. However, within Epoch 2, after the half steps completed, the accuracy started declines from 0.9551 to 0.8317. The trend continued in Epoch 3 and now the accuracy is 0.6633.
The input is 3D array contains number of rows * 1000 timestamp * 2400 length of word2vec's vectors. The word2vec model is trained from scratch contains 200k vocabulary.
I only use one LSTM layer followed by Dropout layer then the output. The code and layer structure is like:
data = pd.read_csv("data.csv", header=0, delimiter="\t", quoting=3, encoding="utf-8")
y = data.label
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2)
def data_genereator(data, batch_size):
num_rows = int(data.shape[0])
# Initialize a counter
counter = 0
while True:
for content, label in zip(data['content'], data['label']):
X_train[counter%batch_size] = transform(content)
y_train[counter%batch_size] = np.asarray(label)
counter = counter + 1
if(counter%batch_size == 0):
yield X_train, y_train
model = Sequential()
model.add(LSTM(64, input_shape=(1000, 2400), return_sequences=False, kernel_initializer='he_normal', dropout=0.15, recurrent_dropout=0.15, implementation=2))
model.add(Dropout(0.3))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
training_generator = data_genereator(X_train, batch_size=512)
validation_generator = data_genereator(X_test, batch_size=512)
model.fit_generator(training_generator,
steps_per_epoch=8856,
validation_data=validation_generator,
epochs=10,
verbose=1,
workers=1,
use_multiprocessing=False,
validation_steps=2216)
Epoch 1/10
1/8856 [..............................] - ETA: 87:38:53 - loss: 0.6919 - acc: 0.6719 yield at counter 6144
...
8856/8856 [==============================] - 63350s 7s/step - loss: 0.3740 - acc: 0.8316 - val_loss: 0.1098 - val_acc: 0.9590
Epoch 2/10
1/8856 [..............................] - ETA: 13:53:45 - loss: 0.1892 - acc: 0.9297 yield at counter 4540180
...
4937/8856 [===============>..............] - ETA: 6:20:03 - loss: 0.1261 - acc: 0.9551 yield at counter 7067412
...
8856/8856 [==============================] - 60512s 7s/step - loss: 0.3364 - acc: 0.8317 - val_loss: 0.5985 - val_acc: 0.6876
Epoch 3/10
1/8856 [..............................] - ETA: 14:45:26 - loss: 0.6098 - acc: 0.6816 yield at counter 9074728
...
4681/8856 [==============>...............] - ETA: 6:45:02 - loss: 0.6130 - acc: 0.6633 yield at counter 11470888
+1
+1
+1
I am also a bit more interested in Question 3, had the same experience with one of the model training!
Question 1
P.S: I found this answer for queue size which totally make sense!
https://stackoverflow.com/a/36989864/4496896
up +1
2 - max_queue_size is just multi process queue
The queue size can be sized according to Little's Law.
+1
Hey, I'm the one that built the V2 of this stuff so I'll try to answer it.
I'm starting a Wiki page for this, it's pretty basic so feel free propose changes:
https://github.com/keras-team/keras/wiki/Understanding-parallelism-in-Keras
Important When using use_multiprocessing=False, you're blocked by the GIL 90% of the time.
Also Prefers keras.utils.Sequence.
Question 1: As @CMCDragonkai said, the Little's Law is a good choice, but 2 times the number of workers is a great start.
Question 2: I would propose to look at your GPU usage, if it's at 100% consistently, the GPU is the bottleneck
Question 3: Your generator is weird, it manipulates X_train, y_train.
Pretty sure it's not doing what you want.
I've noticed that keras workers parameter is bit strange. If you set it to 1. And you enable multiprocessing, there are 2 child processes created. One of the child processes does absolutely nothing. It has 0.0% CPU usage, and stracing it shows that it's stuck on a futex syscall.
So when I gave it workers = 12, it produced 24 child processes.
The extra worker process seems useless. What is it for? The docs only say that worker processes are used to generate "input" for the neural network. But the redundant child processes do absolutely nothing.
Are you sure those workers are not just the validation workers? During training, the worker used for validation will be idle.
Also, are you sure that they are processes and not threads?
Depending on the size of the queue, there are multiple threads per process.
You can see the actual processes with :
multiprocessing.active_children()
Those workers are definitely processes. I used htop to switch between showing user threads and not showing them. In this case I had it on not showing them.
Hey @CMCDragonkai and @Dref360, I am new to DL and currently using Keras for building my first few models, I am still confused as to how do you use queue size, workers and use_multiprocessing. Can you please give me an example of how would you use them if you had 2xGPU(V100/P100) and a 8 core CPU? Or is it better to keep default values?
Thanks!!
@Dref360
Important When using use_multiprocessing=False, you're blocked by the GIL 90% of the time.
So why use_multiprocessing=False by default?
Using multiprocessing is an advanced usage and brings some complications. You need to care about process-safe resources, Windows Fork-and-Exec, etc. So using threads is better from a UX pov.
@Dref360
What can be the possible reason of that with use_multiprocessing=False I have lower CPU load than with use_multiprocessing=True using same number of workers?
https://stackoverflow.com/questions/57381464/keras-using-multiply-cores-in-batch-generator
Because when set to False, we use threads. There are no real threads in Python so you won't see much improvement.
Most helpful comment
Hey @CMCDragonkai and @Dref360, I am new to DL and currently using Keras for building my first few models, I am still confused as to how do you use queue size, workers and use_multiprocessing. Can you please give me an example of how would you use them if you had 2xGPU(V100/P100) and a 8 core CPU? Or is it better to keep default values?
Thanks!!