Keras: Is there keras layer which builds one-hot representation on the fly?

Created on 25 Dec 2016  路  11Comments  路  Source: keras-team/keras

Imagine I have pretty much big dataset X.shape = (10000000, dim) if dim will be something like ~256 (char set size) it will be a disaster (Memory error, etc.). So instead I would like to have ability to unpack one-hot representation from just number and have X.shape = (10000000, 1).

Most helpful comment

Hi Marat,

I don't see anything readily available that would ensure one-hot representations. On the other hand, there shouldn't be a situation where you prefer one-hot to an actual embedding. You should be able to use the embedding layer:

https://keras.io/layers/embeddings/

The input is a list of ints and the output is an embedding in an n-dimensional space. Think of it as a one-hot embedding and a linear layer mashed into a single layer.

Please let me know if the Embedding layer works correctly for your input shape. I'm not sure if it handles multi-dimensional data correctly. You can also try wrapping it in TimeDistributed. If the Embedding layer isn't working, it might take a small PR to get it working for multi-dimensional data.

If there are some extenuating circumstances where you actually need one-hot encodings, then you can subclass Embedding layer and make sure the weight matrix is the identity matrix. You could also write a regularizer such that the weight matrix is very close to identity, and use the existing Embedding layer.

Cheers,
Ben

All 11 comments

Hi Marat,

I don't see anything readily available that would ensure one-hot representations. On the other hand, there shouldn't be a situation where you prefer one-hot to an actual embedding. You should be able to use the embedding layer:

https://keras.io/layers/embeddings/

The input is a list of ints and the output is an embedding in an n-dimensional space. Think of it as a one-hot embedding and a linear layer mashed into a single layer.

Please let me know if the Embedding layer works correctly for your input shape. I'm not sure if it handles multi-dimensional data correctly. You can also try wrapping it in TimeDistributed. If the Embedding layer isn't working, it might take a small PR to get it working for multi-dimensional data.

If there are some extenuating circumstances where you actually need one-hot encodings, then you can subclass Embedding layer and make sure the weight matrix is the identity matrix. You could also write a regularizer such that the weight matrix is very close to identity, and use the existing Embedding layer.

Cheers,
Ben

It is seems to work as it is. Can't say anything about model quality yet. But at least its compile and train

`

left = Sequential()
left.add(Embedding(input_dim=csize, output_dim=rnn_size, input_length=q_seq_size, mask_zero=True))
left.add(LSTM(rnn_size))

right = Sequential()
right.add(Embedding(input_dim=csize, output_dim=rnn_size, input_length=t_seq_size, mask_zero=True))
right.add(LSTM(rnn_size))

merged = Sequential()
merged.add(Merge([left, right], mode='cos'))
merged.compile(optimizer='adam', loss='mse')

`

@bstriner why wouldn't you prefer one-hot over embedding? Is that to say always use embedding over one-hot (as in like general best practices)?

Only because several examples using the current keras build, like a char-based LSTM-RNN's are feeding one-hot encoded arrays into a keras layer. And I don't see them using the embedding layer instead

@naisanza a one-hot encoding followed by a dense layer is the same as a single embedding layer. Try both and you should get the same results with different runtime. Do the linear algebra if you need to convince yourself.

The other big difference is lets say you have 256 categories. Each sample could be one unsigned short (1 byte) or 256 floats (4*256 bytes). Passing data back and forth to the CPU, the former should be much faster.

I try to never generate one-hot encodings on the CPU then send them to the GPU because it feels like such a waste. However, it is a lot easier to understand, so might be better for examples.

You can feed embeddings into an LSTM as well. That would be like having one-hot, then dense, then LSTM, so one more layer than the current examples have.

Quick question @bstriner about the output of a network if using Embedding instead of a one-hot encoding:
For one-hot encoding an example input/output pair might be:

X[0] = [1,3,2,0,8,7,0,1,0,9,2,8,4,5,0]  #input sequence 15 long, 10 possible classes 
y[0] = [0,0,0,0,0,1,0,0,0,0]  #predict next word in the sequence, one-hot encoding for 10 classes

and thus your network structure might be:

seq_length, n_classes = 15, 10
model.add(LSTM(512, return_sequences=True, input_shape=(n_examples, seq_length)))
model.add(LSTM(512, return_sequences=False))
model.add(Dense(n_classes, activation='softmax'))

but if you instead use an Embedding layer, I'm a bit confused what my output will be now:

model.add(Embedding(n_classes, 512, input_length=seq_length))
model.add(LSTM(512, return_sequences=True))
model.add(LSTM(512, return_sequences=False))
model.add(Dense(output, activation=act))

I'm not predicting a one-hot output vector anymore, so I'm a bit confused what output and act should be. I don't think I should be predicting a single number, i.e. one of my 10 possible classes, should I? Couldn't that confuse the network into thinking there's a hierarchy (e.g. 3>2>1), and also cast the problem as a regression problem (but I want classification)? It sounded like one benefit of an Embedding layer is to avoid one-hot encoding altogether, which would be really nice since one-hot encoding a text corpus takes wayy to much memory and usually can't fit inside a GPU. Also, for whatever this output is, I assume it should be normalized to be between 0-1?

Any help/advice would be most appreciated, thanks.

There is a way, you just have to work around the embedding layer. Here is an example:

model.add(Embedding(input_dim=16, output_dim=16, input_length=1, 
embeddings_initializer='identity'))

One thing to remember, the input_dim and output_dim must be equal for this to work.
To view that this is correct, you can use the following code snippet to check the output of your layer:

from keras import backend as K
from keras.models import Sequential
from keras.layers import Embedding
x = [[0],[1],[2],[3],[4]]
model = Sequential()
model.add(Embedding(input_dim=5, output_dim=5, input_length=1, embeddings_initializer='identity'))
layer0_output_fcn = K.function([model.layers[0].input], [model.layers[0].output])
layer0_output = layer0_output_fcn([x])[0]
print x
print layer0_output

Ah, I see. So the final output layer predicts a vector which has the same dimensions as the embedding layer. Thanks a lot @adnanmunawar.

This seems straightforward if you're using pre-trained embeddings (e.g. GloVe), since you can just transform your target values right away to the correct embedded vectors. However this might be more difficult overall for the network to train if you're learning the embedding layer too, since you don't yet know what you're ground output vectors should even be (if the embedding weights are changing through time)? I guess you would use the embedding matrix weights at each iteration and apply it to your target to generate what the ground truth should be for that iteration?

@silburt, my post was aimed towards answering the original question in the post which asks if its possible to generate one-hot representation, I should have mentioned that in my reply.
Regarding the question about having to train the embedding weights as well, I agree and I think that the embedding layer weights would generally be learned regardless and choosing an identity matrix might be negating the purpose of the embedding layer.
We could, however, freeze the weights in the embedding layer to prevent them for changing as you suggested and train the weights in the subsequent layers.

i think you have asked for one hot encoder not an embedding layer
here you go

layer=keras.layers.Lambda(lambda x:K.one_hot(K.cast(x,'int64'),number_of_classes))(previous_layer)

so the input is for example (864,6851,646,1) so that the last dimension contains intergers or the indexes
and the output will be (864,6851,646,number_of_classes)
here it is in the documentation https://keras.io/backend/#one_hot

@bstriner is there any reason to use an embedding layer if your first layer is not a dense layer and there is no relationship to be learned about the different categories? (as in data that is completely interchangeable and is only valuable when it comes to grouping samples with the same value)

There is a way, you just have to work around the embedding layer. Here is an example:

model.add(Embedding(input_dim=16, output_dim=16, input_length=1, 
embeddings_initializer='identity'))

One thing to remember, the input_dim and output_dim must be equal for this to work.
To view that this is correct, you can use the following code snippet to check the output of your layer:

from keras import backend as K
from keras.models import Sequential
from keras.layers import Embedding
x = [[0],[1],[2],[3],[4]]
model = Sequential()
model.add(Embedding(input_dim=5, output_dim=5, input_length=1, embeddings_initializer='identity'))
layer0_output_fcn = K.function([model.layers[0].input], [model.layers[0].output])
layer0_output = layer0_output_fcn([x])[0]
print x
print layer0_output

Maybe we should set the Embedding layer trainable = False to ensure non modification when training.

Was this page helpful?
0 / 5 - 0 ratings