Keras: Weights for CUDNN GRU are in the wrong order

Created on 7 Dec 2017 · 16Comments · Source: keras-team/keras

I've been delving in CuDNNGRU (making a conversion to non-CuDNN for after training)

I think the weights in the kernel and recurrent matrices of the CuDNNGRU are in the wrong order. For example the kernel matrix is returned as a numpy array sized as [channels,weights].

But actually if you delve into it, the weights are actually in [weights,channels] order.

The gist has a demo of this, along with what I have just worked on to find this out - a GRU cell which is compatible with the CuDNNGRU but doesn't need Cuda (so I can train on tensorflow, but deploy models without tensorflow/cuda). https://gist.github.com/joemarshall/338b0f0c0741408d044f3104b0d3b91d There is an equivalent cell in Tensorflow also, so I don't know how useful it is.

One could fix this by flipping the arrays in the CuDNNGRU cell before they go into the tensorflow code, but it would break all saved networks, so I'm not sure what should be done about it or whether it matters.

Source

joemarshall

Most helpful comment

Hi @naifrec,

Yes, indeed, the biases also need to be reordered. The convert_recurrent_kernel() above was also wrong.

The following worked for me, when the weights are filled with random values:

Keras -> torch:

def convert_input_kernel(kernel):
    kernel_z, kernel_r, kernel_h = np.hsplit(kernel, 3)
    kernels = [kernel_r, kernel_z, kernel_h]
    return np.vstack([k.reshape(k.T.shape) for k in kernels])

def convert_recurrent_kernel(kernel):
    kernel_z, kernel_r, kernel_h = np.hsplit(kernel, 3)
    kernels = [kernel_r, kernel_z, kernel_h]
    return np.vstack(kernels)

def convert_bias(bias):
    bias = bias.reshape(2, 3, -1) 
    return bias[:, [1, 0, 2], :].reshape(-1)

torch -> Keras (just the reverse transformation):

def convert_input_kernel(kernel):
    kernel_r, kernel_z, kernel_h = np.vsplit(kernel, 3)
    kernels = [kernel_z, kernel_r, kernel_h]
    return np.hstack([k.reshape(k.T.shape) for k in kernels])

def convert_recurrent_kernel(kernel):
    kernel_r, kernel_z, kernel_h = np.vsplit(kernel, 3)
    kernels = [kernel_z, kernel_r, kernel_h]
    return np.hstack(kernels)

def convert_bias(bias):
    bias = bias.reshape(2, 3, -1) 
    return bias[:, [1, 0, 2], :].reshape(-1)

yuyang-huang on 28 Feb 2019

❤1 🎉1 👍1

All 16 comments

More information about the data layer can be found in !8307.

bzamecnik on 18 Jan 2018

Actually, if your goal is using GRU weights trained on GPU with CUDNN GRU on a non-GPU implementation of GRU, that's something I have implemented two weeks ago in https://github.com/bzamecnik/keras/commits/cudnn-compatible-gru and I'm going to make a PR today. Yes, weights need some kind of transposition similar to the LSTM case. Stay tuned.

bzamecnik on 18 Jan 2018

I've deployed something using that code in the gist above and it works great - basically it's just a replacement GRU cell which is compatible to CUDNN. Which also takes account of the stupid transposition bug in that standard CUDNNGRUCell (this bug).

joemarshall on 18 Jan 2018

Yeah, it's a different convention (based on the original paper) with reset gate after projection. Note also that CuDNN has recurrent_activation=sigmoid, while default in Keras GRU is hard_sigmoid. The difference is minor. With all the stuff compatible the difference in result values is negligible.

bzamecnik on 18 Jan 2018

The different convention isn't the bug though, the bug is that the weight matrices are the wrong way round, they're sized as if they are [channels,weights], but interpreted as stored in [weight,channel] order when they are sent to CUDNN. Or at least I think that is why i had to transpose all of them when implementing the reset after projection version.

If for example you set the first weight of every channel to 1, it actually sets all the weights of the first channel to 1. I spent ages manually setting individual weights and sending known inputs through to work out what the heck was going wrong.

joemarshall on 18 Jan 2018

I believe the same bug also exists for CuDNNLSTM. The kernel matrices should really be transposed. I've reported it and proposed a fix by transposing the weights in (the first version of) #8307. However, it's not accepted since it'll break saved models.

yuyang-huang on 18 Jan 2018

Yeah, @joemarshall & @myutwo150, it's possible that the weights are in wrong order. Sorry, I wrongly interpreted the issue... Maintaining backwards compatibility in the saved weights is a problem. Possibly we could make some adapter that would detect weights in the wrong format and convert them to better format. So far CuDNNLSTM (& CuDNNGRU) piggyback on a converter from Keras 1 to 2.

bzamecnik on 19 Jan 2018

Hello, sorry to wake up a long forgotten thread, but think I need your help @bzamecnik.

Here is my usecase: I am trying to load weights from a trained keras.layers.CuDNNGRU into a torch.nn.GRU (I know, risky business from the get go). Problem: even after loading the weights, seemingly correctly, both implementations return widely different results.

You will find in this gist some code to reproduce the error, together with a Dockerfile to reproduce my environment exactly:

Ubuntu 16.04
CUDA 8.0
CuDNN 5.x
torch 0.4.1
keras 2.2.4
tensorflow-gpu 1.2.0

You will find in the gist two script two scripts which should show you that I cannot make it work in any direction:

creating torch GRU, load its weights into CudNNGRU and check for equality
creating CudNNGRU, load its weights into torch GRU and check for equality

Note that the structure of the keras.layers.CudNNGRU weights when using get_weights is as follows:

input matrix of shape (input_dimension, 3 * hidden_dimension)
recurrent matrix of shape (hidden_dimension, 3 * hidden_dimension)
bias of shape (2 * 3 * hidden_dimension) where the first half is the input bias, second half is the recurrent bias (that part I am unsure of, cannot understand it from Keras' source code)

The structure of the torch.nn.GRU weights is as follows:

input matrix torch.nn.GRU.weight_ih_l0 has shape (3 * hidden_dimension, input_dimension)
recurrent matrix torch.nn.GRU.weight_hh_l0 has shape (3 * hidden_dimension, hidden_dimension)
input bias torch.nn.GRU.bias_ih_l0 has shape (3 * hidden_dimension,)
recurrent bias torch.nn.GRU.bias_hh_l0 has shape (3 * hidden_dimension,)

EDIT 1: try importing torch before tensorflow in my scripts and see what happens!! complete mayhem, with some sort of buffer / stack overflow. Here is an excerpt of the trace:

7fc8aa55a000-7fc8aa59a000 rw-p 00000000 00:00 0 
7fc8aa59a000-7fc8aa5c1000 r--p 00000000 00:9a 60                         /usr/lib/locale/C.UTF-8/LC_CTYPE
7fc8aa5c1000-7fc8aa5c2000 r--p 00000000 00:9a 59                         /usr/lib/locale/C.UTF-8/LC_NUMERIC
7fc8aa5c2000-7fc8aa5c3000 r--p 00000000 00:9a 58                         /usr/lib/locale/C.UTF-8/LC_TIME
7fc8aa5c3000-7fc8aa735000 r--p 00000000 00:9a 57                         /usr/lib/locale/C.UTF-8/LC_COLLATE
7fc8aa735000-7fc8aa736000 r--p 00000000 00:9a 56                         /usr/lib/locale/C.UTF-8/LC_MONETARY
7fc8aa736000-7fc8aa737000 r--p 00000000 00:9a 55                         /usr/lib/locale/C.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES
7fc8aa737000-7fc8aa73e000 r--s 00000000 00:9a 48                         /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7fc8aa73e000-7fc8aa744000 rw-p 00000000 00:00 0 
7fc8aa744000-7fc8aa745000 r--p 00000000 00:9a 53                         /usr/lib/locale/C.UTF-8/LC_PAPER
7fc8aa745000-7fc8aa746000 r--p 00000000 00:9a 52                         /usr/lib/locale/C.UTF-8/LC_NAME
7fc8aa746000-7fc8aa747000 r--p 00000000 00:9a 51                         /usr/lib/locale/C.UTF-8/LC_ADDRESS
7fc8aa747000-7fc8aa748000 r--p 00000000 00:9a 50                         /usr/lib/locale/C.UTF-8/LC_TELEPHONE
7fc8aa748000-7fc8aa749000 r--p 00000000 00:9a 49                         /usr/lib/locale/C.UTF-8/LC_MEASUREMENT
7fc8aa749000-7fc8aa74a000 r--p 00000000 00:9a 45                         /usr/lib/locale/C.UTF-8/LC_IDENTIFICATION
7fc8aa74a000-7fc8aa74b000 r--p 00025000 00:9a 32                         /lib/x86_64-linux-gnu/ld-2.23.so
7fc8aa74b000-7fc8aa74c000 rw-p 00026000 00:9a 32                         /lib/x86_64-linux-gnu/ld-2.23.so
7fc8aa74c000-7fc8aa74d000 rw-p 00000000 00:00 0 
7ffde0031000-7ffde0058000 rw-p 00000000 00:00 0                          [stack]
7ffde01e1000-7ffde01e4000 r--p 00000000 00:00 0                          [vvar]
7ffde01e4000-7ffde01e6000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted (core dumped)

Now I am wondering if there is not a memory leak triggering the difference between the two classes in the first place!!

naifrec on 28 Feb 2019

Hi @naifrec,

As discussed above, the weight matrices of the CuDNNGRU layer are in the wrong order and need to be transformed before being used in another library.

Also, the PyTorch weight matrix represents [kernel_r, kernel_z, kernel_h] while Keras weight matrix represents [kernel_z, kernel_r, kernel_h].

For converting from Keras to PyTorch, you can use:

def convert_input_kernel(kernel):
    kernel_z, kernel_r, kernel_h = np.hsplit(kernel, 3)
    kernels = [kernel_r, kernel_z, kernel_h]
    return np.vstack([k.reshape(k.T.shape) for k in kernels])

def convert_recurrent_kernel(kernel):
    kernels = np.hsplit(kernel, 3)
    return np.vstack([k.T for k in kernels])

# layer = CuDNNGRU(...)
weights = layer.get_weights()
weight_ih = torch.from_numpy(convert_input_kernel(weights[0]))
weight_hh = torch.from_numpy(convert_recurrent_kernel(weights[1]))

Similarly, for converting from PyTorch to Keras,

def convert_input_kernel(kernel):
    kernel_r, kernel_z, kernel_h = np.vsplit(kernel, 3)
    kernels = [kernel_z, kernel_r, kernel_h]
    return np.hstack([k.reshape(k.T.shape) for k in kernels])

def convert_recurrent_kernel(kernel):
    kernels = np.vsplit(kernel, 3)
    return np.hstack([k.T for k in kernels])

# gru = torch.nn.GRU(...)
weights[0] = convert_input_kernel(gru.weight_ih_l0.detach().numpy())
weights[1] = convert_recurrent_kernel(gru.weight_hh_l0.detach().numpy())

yuyang-huang on 28 Feb 2019

❤1

Hello @yuyang-huang, thank you so much for your answer!

I indeed realized over night that whatever @joemarshall originally proposed had to be applied block-wise, I was applying his trick over the entire recurrent and input matrices before and it failed of course.

Thank you again for the code snippet!

naifrec on 28 Feb 2019

Hello @yuyang-huang, it seems that the biases also suffer from an order issue.

I updated my keras_torch_gru_minima_example.py script in my gist, as well as the Dockerfile to add a few dependencies likefire, to make it easier for you to see the problem.

I realized that the equality tests between torch and keras GRU pass when you use the randomly initialized weights by Keras CuDRNNGRU. However, the biases are initialized to zero. As soon as you change them to be non-zero, then the output activation differ between torch and keras.

You can try this out yourself by running the following commands:

python3.6 keras_torch_gru_minimal_example.py --input_dimension 7 --gru_size 13 --atol 1e-6 --non_zero_bias False # should return WITHOUT Not equal to tolerance rtol=0, atol=1e-06 message
python3.6 keras_torch_gru_minimal_example.py --input_dimension 7 --gru_size 13 --atol 1e-6 --non_zero_bias True # should return WITH Not equal to tolerance rtol=0, atol=1e-06 message

So my guess is that there is the same type of thing happening to the bias than to the kernels. Biases are most likely stored in an matrix of shape (2, 3 * gru_size) in the CuDNNGRU underlying C class, then flattened to this long vector of shape (2 * 3 * gru_size,). I have tried a combination of split, reshape and flatten but got nothing to work yet. Are you able to modify convert_bias to get the output to match?

It is also possible that torch.nn.GRU and keras.layers.CuDNNGRU just use biases differently at a higher level, but I sincerely doubt it.

Thanks for the help.

naifrec on 28 Feb 2019

hey,

when I initially found out the ordering thing for the weights, I did a load of tests with manually set weights, sticking in 1 to a single weight in the matrix and leaving the rest zero, and seeing which cell I had to change to get the same results from a model in the non cudnn matrix. Doing that meant I could isolate each one and work out the wrongness. Better than using random initialisers. I guess you need to do the same with the bias (I'm not sure that I didn't already do this in my experimentation, but I can't remember and I don't have the code to hand right now).

On naif notifications@github.com, 28 Feb 2019 3:49 p.m. wrote:

Hello @yuyang-huanghttps://github.com/yuyang-huang, it seems that the biases also suffer from an order issue.

I updated my keras_torch_gru_minima_example.py script in my [gist],() as well as the Dockerfile to add a few dependencies likefire, to make it easier for you to see the problem.

I realized that the equality tests between torch and keras GRU pass when you use the randomly initialized weights by Keras CuDRNNGRU. However, the biases are initialized to zero. As soon as you change them to be non-zero, then the output activation differ between torch and keras.

You can try this out yourself by running the following commands:

python3.6 keras_torch_gru_minimal_example.py --input_dimension 7 --gru_size 13 --atol 1e-6 --non_zero_bias False # should return WITHOUT Not equal to tolerance rtol=0, atol=1e-06 message
python3.6 keras_torch_gru_minimal_example.py --input_dimension 7 --gru_size 13 --atol 1e-6 --non_zero_bias True # should return WITH Not equal to tolerance rtol=0, atol=1e-06 message

So my guess is that there is the same type of thing happening to the bias than to the kernels. Biases are most likely stored in an matrix of shape (2, 3 * gru_size) in the CuDNNGRU underlying C class, then flattened to this long vector of shape (2 * 3 * gru_size,). I have tried a combination of split, reshape and flatten but got nothing to work yet. Are you able to modify convert_bias to get the output to match?

It is also possible that torch.nn.GRU and keras.layers.CuDNNGRU just use biases differently at a higher level, but I sincerely doubt it.

Thanks for the help.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/keras-team/keras/issues/8720#issuecomment-468323097, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABXse9gJHLYy3tCLe3ybh52oWfzuCzhAks5vR_pogaJpZM4Q5XRB.

This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please contact the sender and delete the email and
attachment.

Any views or opinions expressed by the author of this email do not
necessarily reflect the views of the University of Nottingham. Email
communications with the University of Nottingham may be monitored
where permitted by law.

joemarshall on 28 Feb 2019

Oh hang on, code is in the gist above, you need reshape and permute dimensions to fix keras weights and biases

joemarshall on 28 Feb 2019

Hi @naifrec,

Yes, indeed, the biases also need to be reordered. The convert_recurrent_kernel() above was also wrong.

The following worked for me, when the weights are filled with random values:

Keras -> torch:

def convert_input_kernel(kernel):
    kernel_z, kernel_r, kernel_h = np.hsplit(kernel, 3)
    kernels = [kernel_r, kernel_z, kernel_h]
    return np.vstack([k.reshape(k.T.shape) for k in kernels])

def convert_recurrent_kernel(kernel):
    kernel_z, kernel_r, kernel_h = np.hsplit(kernel, 3)
    kernels = [kernel_r, kernel_z, kernel_h]
    return np.vstack(kernels)

def convert_bias(bias):
    bias = bias.reshape(2, 3, -1) 
    return bias[:, [1, 0, 2], :].reshape(-1)

torch -> Keras (just the reverse transformation):

def convert_input_kernel(kernel):
    kernel_r, kernel_z, kernel_h = np.vsplit(kernel, 3)
    kernels = [kernel_z, kernel_r, kernel_h]
    return np.hstack([k.reshape(k.T.shape) for k in kernels])

def convert_recurrent_kernel(kernel):
    kernel_r, kernel_z, kernel_h = np.vsplit(kernel, 3)
    kernels = [kernel_z, kernel_r, kernel_h]
    return np.hstack(kernels)

def convert_bias(bias):
    bias = bias.reshape(2, 3, -1) 
    return bias[:, [1, 0, 2], :].reshape(-1)

yuyang-huang on 28 Feb 2019

❤1 🎉1 👍1

@yuyang-huang when converting pytorch to keras, there are 2 biases in pytorch, shoud i concat them first, and then call the convert_bias function?