Keras: High cardinality categorical outputs

Created on 5 Aug 2015 · 14Comments · Source: keras-team/keras

Howdy,

I have a dataset where the output is one of 5k categories. I also have millions of samples. The naive representation of y_indices_naive (the outputs) is:

[1,5,4300,...]

But it seems that Keras/Theano require one-hot encodings of the output.

Problem is, np_utils.to_categorical(y_indices_naive) causes an out-of-memory error because then I need a 3mil x 3k matrix.

Is there any way to get Keras to accept y_indices_naive without converting it to one-hot? I would be happy to add some code if someone would point out how to best do it.

Source

sergeyf

Most helpful comment

The trick to fix issue with the error expecting a 3D input when using sparse_categorical_crossentropy is to format outputs in a sparse 3-dimensional way. So instead of formatting the output like this:

y_indices_naive = [1,5,4300,...]

is should be formatted this way:

y_indices_naive = [[1,], [5,] , [4300,],...]

That will make Keras happy and it'll trained the model as expected.

paulomalvar on 29 Apr 2017

👍2

All 14 comments

Theano has no support for sparse operations as far as I know (and Keras certainly doesn't either). So all data will have to be converted to dense arrays at some point.
However a 5k-dimensional output space doesn't seem very large to me.

You can solve your OOM error by one-hot encoding and training batch-by-batch instead of 3M samples at once. Break down your dataset into small batches, and for each batch:

y_batch = np_utils.to_categorical(y_indices_batch, nb_classes=5000)
model.train_on_batch(X_batch, y_batch)

As long as 1) your model fits in memory and 2) your batches are small enough, this will not cause any memory issues.

fchollet on 6 Aug 2015

👍2

Theano 'tensor.nnet.categorical_crossentropy' can accept vector of integers as true distribution.

See: http://deeplearning.net/software/theano/library/tensor/nnet/nnet.html#tensor.nnet.categorical_crossentropy

ps thanks for your great library btw!

c0stya on 1 Sep 2015

@lightcaster Thanks for the response, and sorry for the long delay.

It does indeed look like tensor.nnet.categorical_crossentropy allows the output to be a vector of integers, but I am not sure how to get Keras and Theano to play nice here.

Here is my model:

rnn_dim = 512
dense_dim = 512
model = Sequential()
model.add(Embedding(n_symbols + 1, rnn_dim,  mask_zero=True)) 
model.add(GRU(rnn_dim, dense_dim, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(dense_dim, n_symbols, activation='sigmoid'))

Note that the output is n_symbols in dimension, so if I try to do model.fit with just a vector of integers it throws me this error:

ValueError: GpuElemwise. Input dimension mis-match. Input 1 (indices start at 0) has shape[1] == 1, but the output's size on that axis is 4938

Any ideas on what to do?

sergeyf on 9 Sep 2015

The target is going to be normalized and reshaped by model.fit. You can
avoid this by implementing your own version of model.train_on_batch / etc
that would simply be calling directly the Thenao functions model._train,
models._test, model._predict.

We'll look into more direct support.

On 9 September 2015 at 11:09, Sergey Feldman [email protected]
wrote:

@lightcaster https://github.com/lightcaster Thanks for the response,
and sorry for the long delay.

It does indeed look like tensor.nnet.categorical_crossentropy allows the
output to be a vector of integers, but I am not sure how to get Keras and
Theano to play nice here.

Here is my model:

rnn_dim = 512
dense_dim = 512
model = Sequential()
model.add(Embedding(n_symbols + 1, rnn_dim, mask_zero=True))
model.add(GRU(rnn_dim, dense_dim, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(dense_dim, n_symbols, activation='sigmoid'))

Note that the output is n_symbols in dimension, so if I try to do
model.fit with just a vector of integers it throws me this error:

ValueError: GpuElemwise. Input dimension mis-match. Input 1 (indices start at 0) has shape[1] == 1, but the output's size on that axis is 4938

Any ideas on what to do?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/483#issuecomment-138994755.

fchollet on 9 Sep 2015

Ah great, thanks. I'll see about calling the Theano functions directly.

sergeyf on 9 Sep 2015

I am facing a similar problem, number of output classes in my case is 50000 and the loss is 'categorical_crossentropy'. If I pass index of 1 in 1 hot encoding, keras complains about the shape to be of 3 dimensions. I checked in theano's T.nnet.categorical_crossentropy and it accepts index of input in 1 hot encoding rather than full 1 hot encoding vector. Can't keras also support this functionality?

shashankg7 on 19 Apr 2016

You should be using sparse_categorical_crossentropy instead, which
accepts label indices rather than one-hot encoded labels.

On 18 April 2016 at 23:29, Shashank Gupta [email protected] wrote:

I am facing a similar problem, number of output classes in my case is
50000 and the loss is 'categorical_crossentropy'. If I pass index of 1 in 1
hot encoding, keras complains about the shape to be of 3 dimensions. I
checked in theano's T.nnet.categorical_crossentropy and it accepts index of
input in 1 hot encoding rather than full 1 hot encoding vector. Can't keras
also support this functionality?

—
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/483#issuecomment-211755618

fchollet on 19 Apr 2016

👍1

@fchollet I tried it, but it was also giving same error (expecting 3D input but got 2D instead), so I switched to batch training mode with output label encoded in 1 hot encoding.

shashankg7 on 19 Apr 2016

sparse_categorical_crossentropy works fine (it's unit-tested, and I use it
regularly), so your problem lies elsewhere entirely.

On 19 April 2016 at 10:11, Shashank Gupta [email protected] wrote:

@fchollet https://github.com/fchollet I tried it, but it was also
giving same error (expecting 3D input but got 2D instead), so I switched to
batch training mode with output label encoded in 1 hot encoding.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/483#issuecomment-212022120