I'm trying to run the LSTM network from the imdb example on my own data.
I have more training examples (100k), long sentences (300 instead of 100 maxlen), and slightly larger vocabulary (23000 instead of 20000)
Now - I get the out-of-memory error below (although only after finishing a whole epoch)
This is on a 4GB GTX 760 card.
My question is: What determines the memory usage of this network? Obviously a larger vocab means a larger embed layer.
Do longer sentences mean a bigger network? i.e. are lstm nodes unwound in time for ALL words? Is there such a thing as a "window size"? Or is the back-propagation in time "virtual" and does not increase memory used?
Do more training examples mean more memory usage? I thought only one mini-batch was loaded at a time?
Otherwise thanks for a great library!
Train on 94990 samples, validate on 10555 samples
Epoch 0
WARNING: unused streams above 2048 (Tune GPU_mrg get_n_streams)
94976/94990 [============================>.] - ETA: 0s - loss: 0.2739 - acc.: 0.9235Error allocating 3242496000 bytes of device memory (out of memory). Driver report 876146688 bytes free and 4294246400 bytes total
Traceback (most recent call last):
File "nntrain.py", line 45, in <module>
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=5, validation_split=0.1, show_accuracy=True)
File "build/bdist.linux-x86_64/egg/keras/models.py", line 206, in fit
File "build/bdist.linux-x86_64/egg/keras/models.py", line 132, in test
File "/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 588, in __call__
self.fn.thunks[self.fn.position_of_error])
File "/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 579, in __call__
outputs = self.fn()
File "/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/gof/op.py", line 644, in rval
r = p(n, [x[0] for x in i], o)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/sandbox/cuda/basic_ops.py", line 2223, in perform
out[0] = x.reshape(tuple(shp))
MemoryError: Error allocating 3242496000 bytes of device memory (out of memory).
Apply node that caused the error: GpuReshape{2}(GpuDimShuffle{1,0,2}.0, MakeVector.0)
Inputs types: [CudaNdarrayType(float32, 3D), TensorType(int64, vector)]
Inputs shapes: [(300, 10555, 256), (2,)]
Inputs strides: [(256, 76800, 1), (8,)]
Inputs scalar values: ['not scalar', 'not scalar']
Error allocating 3242496000 bytes of device memory (out of memory). Driver report 876146688 bytes free and 4294246400 bytes total
This tells you everything you need to know. Your GPU does not have enough memory for this task.
Things you can try:
Alternative solutions...
Obviously a larger vocab means a larger embed layer.
Yes.
Do longer sentences mean a bigger network?
No, the network size will be the same, but each sample will be larger therefore you will be using more memory to load each batch.
Thanks for the feedback - I just wonder if there may be "memory leak" or something - since the network in theory fits on my GPU and it successfully trained one epoch before failing.
I reduced the sequence length to 100 - then it trained two epochs before failing.
I am quite ignorant about GPU programming - but could there be memory that isn't "freed" after each epoch?
Would settings the truncate_gradient option for the LSTM layer reduce memory consumption?
100 seems large for a LSTM. Try 32...
Garbage collection is not instantaneous, so if you're working close to the memory limit you have a very high risk to get out of memory even though your work fits in memory "in theory".
Would settings the truncate_gradient option for the LSTM layer reduce memory consumption?
No.
thank you
Most helpful comment
Error allocating 3242496000 bytes of device memory (out of memory). Driver report 876146688 bytes free and 4294246400 bytes totalThis tells you everything you need to know. Your GPU does not have enough memory for this task.
Things you can try:
Alternative solutions...
Yes.
No, the network size will be the same, but each sample will be larger therefore you will be using more memory to load each batch.