From my current modeling tasks, I see that it would be useful to have the flexibility to encode a categorical feature either in one-hot format or embedding format (using Embedding layer) right in the model construction phase instead of creating dummy columns in advance in case of one-hot encoding (it is the zero-based integers in case of Embedding). Though we can use Lambda layer for that purpose, I think it would be more convenient to have a OneHot layer instead. I wrote the code for the propose OneHot layer already which just calls K.one_hot() internally. Feel free to give your thought on whether we should add such layer or not in Keras. I am happy to contribute the code via a PR. Thanks.
The pseudo-code would be like this
models = []
for feature in features:
if is_categorical(feature):
model = Sequential()
if to_encode(feature) == 'one_hot':
model.add(OneHot())
else:
model.add(Embedding())
models.append(model)
else:
model = Sequential()
model.add(Dense())
models.append(model)
model = Sequential()
model.add(Merge(models, mode='concat'))
...more layers added...
I created a PR https://github.com/fchollet/keras/pull/3846
Lambda(K.one_hot()) instead as suggested by @fchollet
There are a few catches when using Lambda(K.one_hot), but generally it's possible:
from keras import backend as K
from keras.layers import Input, Lambda
input_shape = (10, ) # sequences of length 10
nb_classes = 20
output_shape = (input_shape[0], nb_classes)
input = Input(shape=input_shape, dtype='uint8')
x_ohe = Lambda(K.one_hot, arguments={'nb_classes': nb_classes}, output_shape=output_shape)(input)
Try this like:
import numpy as np
from keras.models import Model
# 5 sequences of length 10
X_classes = np.random.randint(0, 20, size=(5, 10))
assert Model(input, x_ohe).predict(X_classes) == (5, 10, 20)
Full example in a gist: https://gist.github.com/bzamecnik/a33052ec46ee7efeb217856d98a4fb5f
Most helpful comment
There are a few catches when using Lambda(K.one_hot), but generally it's possible:
Try this like: