Keras: Problem using keras.layers.CuDNNLSTM: Op type not registered 'CudnnRNN'

Created on 13 Oct 2017  路  15Comments  路  Source: keras-team/keras

Hi, tried keras.layers.CuDNNLSTM after seeing fchollet's tweet the other day. I have the latest Keras and Tensorflow, but there is a tensorflow problem with the Op 'CudnnRNN' not being registered.

Have a missed something?

Thanks.

OS: Windows10
Keras version: master (as of today: 2.08+)
Tensorflow backend version: master (as of today ~1.4rc0)

GPU: Geforce GTX 1080Ti (11GB)
Cuda version: v8.0
cuDNN version: cudnn-8.0-windows10-x64-v6.0

Code to reproduce error:

import tensorflow as tf

import keras
from keras.models import Sequential
from keras.layers import CuDNNLSTM
from keras.optimizers import RMSprop


class TestCudnnLSTM():

  def __init__(self):  
    self.max_length = 1000
    self.n_input_dim = 1    

    self.model = []

    self.config()
    self.create_model()

  def config(self):
    print("Keras version: " + keras.__version__)
    print("Tensorflow version: " + tf.__version__)

    config = tf.ConfigProto()
    return config

  def create_model(self):        

    print('Creating Model')
    model = Sequential()
    model.add(CuDNNLSTM(1,
                    return_sequences=True,
                    stateful=False,
                    kernel_initializer='he_normal',
                    input_shape=(self.max_length, self.n_input_dim)))
    print (model.summary())

    opt = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

    model.compile(loss='categorical_crossentropy',
                  optimizer=opt,
                  metrics=['accuracy'],
                  weighted_metrics=['accuracy'],
                  sample_weight_mode='temporal')

    print('Model compiled')      
    self.model = model
    return self


if __name__ == "__main__":
  mt = TestCudnnLSTM()

Console Output:

    Using TensorFlow backend.
    2017-10-13 13:16:23.067049: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2017-10-13 13:16:23.742057: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties: 
    name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
    pciBusID: 0000:0b:00.0
    totalMemory: 11.00GiB freeMemory: 9.10GiB
    2017-10-13 13:16:24.022971: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 1 with properties: 
    name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
    pciBusID: 0000:a1:00.0
    totalMemory: 11.00GiB freeMemory: 9.10GiB
    2017-10-13 13:16:24.023675: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Device peer to peer matrix
    2017-10-13 13:16:24.024187: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1051] DMA: 0 1 
    2017-10-13 13:16:24.024421: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 0:   Y N 
    2017-10-13 13:16:24.024671: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 1:   N Y 
    2017-10-13 13:16:24.025126: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)
    2017-10-13 13:16:24.025743: I C:\tf_jenkins\home\workspace\tf-nightly-windows\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a1:00.0, compute capability: 6.1)
    Keras version: 2.0.8
    Tensorflow version: 1.4.0-dev20171010
    Creating Model
    Traceback (most recent call last):
      File "D:\users\philip\RevCtrl\GIT_RD_Python\Ch2017\ch2017_train\testCuDnnLSTM.py", line 54, in <module>
        mt = TestCudnnLSTM()
      File "D:\users\philip\RevCtrl\GIT_RD_Python\Ch2017\ch2017_train\testCuDnnLSTM.py", line 19, in __init__
        self.create_model()
      File "D:\users\philip\RevCtrl\GIT_RD_Python\Ch2017\ch2017_train\testCuDnnLSTM.py", line 37, in create_model
        input_shape=(self.max_length, self.n_input_dim)))
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\models.py", line 442, in add
        layer(x)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\layers\recurrent.py", line 456, in __call__
        return super(RNN, self).__call__(inputs, **kwargs)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\engine\topology.py", line 602, in __call__
        output = self.call(inputs, **kwargs)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\layers\cudnn_recurrent.py", line 76, in call
        output, states = self._process_batch(inputs, initial_state)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\keras\layers\cudnn_recurrent.py", line 495, in _process_batch
        is_training=True)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\python\ops\cudnn_rnn_ops.py", line 1443, in __call__
        input_data, input_h, input_c, params, is_training=is_training)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\python\ops\cudnn_rnn_ops.py", line 1334, in __call__
        seed=self._seed)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\python\ops\cudnn_rnn_ops.py", line 823, in _cudnn_rnn
        name=name)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\contrib\cudnn_rnn\ops\gen_cudnn_rnn_ops.py", line 104, in cudnn_rnn
        is_training=is_training, name=name)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
        op_def=op_def)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 2958, in create_op
        set_shapes_for_outputs(ret)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 2209, in set_shapes_for_outputs
        shapes = shape_func(op)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
        require_shape_fn)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
        input_tensors_as_shapes, status)
      File "C:\ProgramData\Anaconda3\envs\tensorflowGPU_1.4rc0\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
        c_api.TF_GetCode(self.status.status))
    tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'CudnnRNN' in binary running on RD1080TI. Make sure the Op and Kernel are registered in the binary running in this process.

Most helpful comment

Works nicely now with the latest nightly build (tf_nightly_gpu-1.5.0.dev20171014-cp35-cp35m-win_amd64.whl)

Comparing LSTM to CUDNN_LSTM layers in terms of performance for my problem (1-D time series classification, ~8000 records, max 4500 timesteps, 1 dimensional feature):

for the same batch size (300) the epoch time reduced by a factor of 7.8 (from 94s to 12s). But I was also able to use a much larger batch size with the CuDNNLSTM (1200) and that reduced the epoch time further (to 5 s, a factor of 19 faster). Indeed, when I looked at nvidia-smi to observe my GPU %utilization, it was peaking in the 90% range, something I have not seen with regular TF-LSTM (where 30-40% utilization was typical).

One caveat is that without dropout available, some (not all) training models are very bad and likely the old fashioned way of training multiple models will be more important. (Could dropout eventually be part of the CuDNN LSTM functionality? Maybe this now is in the realm of NVidia responsibility)

Thanks fchollet for the help in making this possible: adding the new layer and getting the TF fix done quickly for the Windows build

All 15 comments

Please try with TF 1.3. This sounds like an issue with your TF installation (such issues are more likely on Windows).

It's also not entirely impossible that TF doesn't make CuDNN RNNs available on Windows.

Thanks for your reply. Actually I first tried it with official TF 1.3 release and got the same error and thought maybe I needed the more recent version of TF.

Then it's definitely a TF Windows issue. Please open an issue on the TF GitHub repo.

ok lets see what TF says

Now fixed in the Windows Cmake (missing ops added). Will try it out shortly.

Works nicely now with the latest nightly build (tf_nightly_gpu-1.5.0.dev20171014-cp35-cp35m-win_amd64.whl)

Comparing LSTM to CUDNN_LSTM layers in terms of performance for my problem (1-D time series classification, ~8000 records, max 4500 timesteps, 1 dimensional feature):

for the same batch size (300) the epoch time reduced by a factor of 7.8 (from 94s to 12s). But I was also able to use a much larger batch size with the CuDNNLSTM (1200) and that reduced the epoch time further (to 5 s, a factor of 19 faster). Indeed, when I looked at nvidia-smi to observe my GPU %utilization, it was peaking in the 90% range, something I have not seen with regular TF-LSTM (where 30-40% utilization was typical).

One caveat is that without dropout available, some (not all) training models are very bad and likely the old fashioned way of training multiple models will be more important. (Could dropout eventually be part of the CuDNN LSTM functionality? Maybe this now is in the realm of NVidia responsibility)

Thanks fchollet for the help in making this possible: adding the new layer and getting the TF fix done quickly for the Windows build

Way better performance now ... great work

The performance is great! Thank you, everyone! FYI to install the keras version that supports it, use this command:
pip install https://github.com/fchollet/keras/archive/cudnn.zip

Does tf.keras version 1.4 LSTM implement this fast cudnn version?

Thanks,
Dylan

Looks like the master changes that included the CuDNN LSTM implementations are in the new Keras 2.0.9:
https://github.com/fchollet/keras/releases/tag/2.0.9

Does tf.keras version 1.4 LSTM implement this fast cudnn version?

I can't find it.

tf.keras in TF 1.4 follows the Keras 2.0.8 API and thus doesn't contain these new layers. They will be in the next release.

I had the same problem on Windows with tensorflow 1.3.0 and after I updated to tensorflow 1.4.0, its working.

Is there a reason that the tensorflow dropout argument is not included in the Keras interface of CuDNNLSTM? I quickly added it in

keras.layers.cudnn_recurrent.py

so that it was passed on to the tensorflow layer

tensorflow.contrib.cudnn_rnn.python.ops.cudnn_rnn_ops.CudnnLSTM

It seemed to improve generalization although I did not do an extensive test.

The regular LSTM Keras interface has both the dropout and recurrent_dropout parameter, while the tensorflow layer only provides a single dropout (no special treatment for recurrent weights compared to forward-only weights?) which may be part of the reasoning for not exposing the dropout parameter?

Thanks

hey @pawarrick
I faced the same issue, but now it is fixed for me. Follow this link:

https://github.com/tensorflow/tensorflow/issues/13696#issuecomment-599179322

Was this page helpful?
0 / 5 - 0 ratings