Hi, tried keras.layers.CuDNNLSTM after seeing fchollet's tweet the other day. I have the latest Keras and Tensorflow, but there is a tensorflow problem with the Op 'CudnnRNN' not being registered.
Have a missed something?
Thanks.
OS: Windows10
Keras version: master (as of today: 2.08+)
Tensorflow backend version: master (as of today ~1.4rc0)
GPU: Geforce GTX 1080Ti (11GB)
Cuda version: v8.0
cuDNN version: cudnn-8.0-windows10-x64-v6.0
Code to reproduce error:
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import CuDNNLSTM
from keras.optimizers import RMSprop
class TestCudnnLSTM():
def __init__(self):
self.max_length = 1000
self.n_input_dim = 1
self.model = []
self.config()
self.create_model()
def config(self):
print("Keras version: " + keras.__version__)
print("Tensorflow version: " + tf.__version__)
config = tf.ConfigProto()
return config
def create_model(self):
print('Creating Model')
model = Sequential()
model.add(CuDNNLSTM(1,
return_sequences=True,
stateful=False,
kernel_initializer='he_normal',
input_shape=(self.max_length, self.n_input_dim)))
print (model.summary())
opt = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'],
weighted_metrics=['accuracy'],
sample_weight_mode='temporal')
print('Model compiled')
self.model = model
return self
if __name__ == "__main__":
mt = TestCudnnLSTM()
Console Output:
Using TensorFlow backend.
2017-10-13 13:16:23.067049: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2017-10-13 13:16:23.742057: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0b:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2017-10-13 13:16:24.022971: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:a1:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2017-10-13 13:16:24.023675: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Device peer to peer matrix
2017-10-13 13:16:24.024187: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1051] DMA: 0 1
2017-10-13 13:16:24.024421: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 0: Y N
2017-10-13 13:16:24.024671: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 1: N Y
2017-10-13 13:16:24.025126: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)
2017-10-13 13:16:24.025743: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a1:00.0, compute capability: 6.1)
Keras version: 2.0.8
Tensorflow version: 1.4.0-dev20171010
Creating Model
Traceback (most recent call last):
File "D:\users\philip\RevCtrl\GIT_RD_Python\Ch2017\ch2017_train\testCuDnnLSTM.py", line 54, in <module>
mt = TestCudnnLSTM()
File "D:\users\philip\RevCtrl\GIT_RD_Python\Ch2017\ch2017_train\testCuDnnLSTM.py", line 19, in __init__
self.create_model()
File "D:\users\philip\RevCtrl\GIT_RD_Python\Ch2017\ch2017_train\testCuDnnLSTM.py", line 37, in create_model
input_shape=(self.max_length, self.n_input_dim)))
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\models.py", line 442, in add
layer(x)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\layers\recurrent.py", line 456, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\engine\topology.py", line 602, in __call__
output = self.call(inputs, **kwargs)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\layers\cudnn_recurrent.py", line 76, in call
output, states = self._process_batch(inputs, initial_state)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\layers\cudnn_recurrent.py", line 495, in _process_batch
is_training=True)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\python\ops\cudnn_rnn_ops.py", line 1443, in __call__
input_data, input_h, input_c, params, is_training=is_training)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\python\ops\cudnn_rnn_ops.py", line 1334, in __call__
seed=self._seed)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\python\ops\cudnn_rnn_ops.py", line 823, in _cudnn_rnn
name=name)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\ops\gen_cudnn_rnn_ops.py", line 104, in cudnn_rnn
is_training=is_training, name=name)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'CudnnRNN' in binary running on RD1080TI. Make sure the Op and Kernel are registered in the binary running in this process.
Please try with TF 1.3. This sounds like an issue with your TF installation (such issues are more likely on Windows).
It's also not entirely impossible that TF doesn't make CuDNN RNNs available on Windows.
Thanks for your reply. Actually I first tried it with official TF 1.3 release and got the same error and thought maybe I needed the more recent version of TF.
Then it's definitely a TF Windows issue. Please open an issue on the TF GitHub repo.
ok lets see what TF says
Now fixed in the Windows Cmake (missing ops added). Will try it out shortly.
Works nicely now with the latest nightly build (tf_nightly_gpu-1.5.0.dev20171014-cp35-cp35m-win_amd64.whl)
Comparing LSTM to CUDNN_LSTM layers in terms of performance for my problem (1-D time series classification, ~8000 records, max 4500 timesteps, 1 dimensional feature):
for the same batch size (300) the epoch time reduced by a factor of 7.8 (from 94s to 12s). But I was also able to use a much larger batch size with the CuDNNLSTM (1200) and that reduced the epoch time further (to 5 s, a factor of 19 faster). Indeed, when I looked at nvidia-smi to observe my GPU %utilization, it was peaking in the 90% range, something I have not seen with regular TF-LSTM (where 30-40% utilization was typical).
One caveat is that without dropout available, some (not all) training models are very bad and likely the old fashioned way of training multiple models will be more important. (Could dropout eventually be part of the CuDNN LSTM functionality? Maybe this now is in the realm of NVidia responsibility)
Thanks fchollet for the help in making this possible: adding the new layer and getting the TF fix done quickly for the Windows build
Way better performance now ... great work
The performance is great! Thank you, everyone! FYI to install the keras version that supports it, use this command:
pip install https://github.com/fchollet/keras/archive/cudnn.zip
Does tf.keras version 1.4 LSTM implement this fast cudnn version?
Thanks,
Dylan
Looks like the master changes that included the CuDNN LSTM implementations are in the new Keras 2.0.9:
https://github.com/fchollet/keras/releases/tag/2.0.9
Does tf.keras version 1.4 LSTM implement this fast cudnn version?
I can't find it.
tf.keras in TF 1.4 follows the Keras 2.0.8 API and thus doesn't contain these new layers. They will be in the next release.
I had the same problem on Windows with tensorflow 1.3.0 and after I updated to tensorflow 1.4.0, its working.
Is there a reason that the tensorflow dropout argument is not included in the Keras interface of CuDNNLSTM? I quickly added it in
keras.layers.cudnn_recurrent.py
so that it was passed on to the tensorflow layer
tensorflow.contrib.cudnn_rnn.python.ops.cudnn_rnn_ops.CudnnLSTM
It seemed to improve generalization although I did not do an extensive test.
The regular LSTM Keras interface has both the dropout and recurrent_dropout parameter, while the tensorflow layer only provides a single dropout (no special treatment for recurrent weights compared to forward-only weights?) which may be part of the reasoning for not exposing the dropout parameter?
Thanks
hey @pawarrick
I faced the same issue, but now it is fixed for me. Follow this link:
https://github.com/tensorflow/tensorflow/issues/13696#issuecomment-599179322
Most helpful comment
Works nicely now with the latest nightly build (tf_nightly_gpu-1.5.0.dev20171014-cp35-cp35m-win_amd64.whl)
Comparing LSTM to CUDNN_LSTM layers in terms of performance for my problem (1-D time series classification, ~8000 records, max 4500 timesteps, 1 dimensional feature):
for the same batch size (300) the epoch time reduced by a factor of 7.8 (from 94s to 12s). But I was also able to use a much larger batch size with the CuDNNLSTM (1200) and that reduced the epoch time further (to 5 s, a factor of 19 faster). Indeed, when I looked at nvidia-smi to observe my GPU %utilization, it was peaking in the 90% range, something I have not seen with regular TF-LSTM (where 30-40% utilization was typical).
One caveat is that without dropout available, some (not all) training models are very bad and likely the old fashioned way of training multiple models will be more important. (Could dropout eventually be part of the CuDNN LSTM functionality? Maybe this now is in the realm of NVidia responsibility)
Thanks fchollet for the help in making this possible: adding the new layer and getting the TF fix done quickly for the Windows build