Keras: Dropout in embedding layer

Created on 10 Jul 2017 · 10Comments · Source: keras-team/keras

In this paper, the authors state that applying dropout to the input of an embedding layer by selectively dropping certain ids is an effective method for preventing overfitting.
For example, if the embedding is a word2vec embedding, this method of dropout might drop the word "the" from the entire input sequence. In this case, the input "the dog and the cat" would become "-- dog and -- cat". The input would never become "-- dog and the cat". This is useful to prevent the model from depending on certain words.

Although keras currently allows for applying dropout to the output vector of an embedding layer, as far as I can read from the documents, it does not allow for applying dropout selectively to certain ids.
Since embeddings are frequently used, and the above paper states that embeddings are prone to overfitting, this feature seems to be a feature that would be useful for a relatively wide range of users. The expected API would be something like

from keras.layers import Embedding

embedding = Embedding(x, y, dropout=0.2)

where the dropout rate signifies the rate of ids to drop.
Would this be a worthy feature to add? Or is there a relatively obvious way to implement this functionality already?

Source

keitakurita

👍10 👀1

Most helpful comment

I was also trying to find a solution for (word) embedding dropout.

The Dropout specification says: _"noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features)."_

See also the SpatialDropout1D implementation here (https://github.com/keras-team/keras/blob/master/keras/layers/core.py), which actually uses the mask that is mentioned above.

So, SpatialDropout1D performs variational dropout at least in terms of models related to NLP. We tested both Dropout(noise_shape=(batch_size, 1, features)) and SpatialDropout() and as we expected, they both apply variational dropout (https://arxiv.org/pdf/1512.05287.pdf).

So, if you need to drop a full word type (embedding), you have to use noise_shape=(batch_size, sequence_size, 1) in Dropout layer, or you can also create a new layer based on the SpatialDropout1D paradigm, like:

class TimestepDropout(Dropout):
    """Timestep Dropout.

    This version performs the same function as Dropout, however it drops
    entire timesteps (e.g., words embeddings in NLP tasks) instead of individual elements (features).

    # Arguments
        rate: float between 0 and 1. Fraction of the timesteps to drop.

    # Input shape
        3D tensor with shape:
        `(samples, timesteps, channels)`

    # Output shape
        Same as input

    # References
        - A Theoretically Grounded Application of Dropout in Recurrent Neural Networks (https://arxiv.org/pdf/1512.05287)
    """

    def __init__(self, rate, **kwargs):
        super(TimestepDropout, self).__init__(rate, **kwargs)
        self.input_spec = InputSpec(ndim=3)

    def _get_noise_shape(self, inputs):
        input_shape = K.shape(inputs)
        noise_shape = (input_shape[0], input_shape[1], 1)
        return noise_shape

Please let me know if you find this helpful and most importantly correct in terms of the expected behaviour. @keitakurita @riadsouissi

iliaschalkidis on 28 May 2018

👍7 ❤3

All 10 comments

Isn't this behavior pretty similar to what spatialdropout1D does?

beaunorgeot on 29 Sep 2017

Isn't SpatialDropout1D meant to drop dimensions from an output vector/matrix while embedding dropout is meant to drop word types completely without touching dimensions?
If so, any Keras implenetation of Embedding dropout?

riadsouissi on 25 Feb 2018

I was also trying to find a solution for (word) embedding dropout.

See also the SpatialDropout1D implementation here (https://github.com/keras-team/keras/blob/master/keras/layers/core.py), which actually uses the mask that is mentioned above.

class TimestepDropout(Dropout):
    """Timestep Dropout.

    This version performs the same function as Dropout, however it drops
    entire timesteps (e.g., words embeddings in NLP tasks) instead of individual elements (features).

    # Arguments
        rate: float between 0 and 1. Fraction of the timesteps to drop.

    # Input shape
        3D tensor with shape:
        `(samples, timesteps, channels)`

    # Output shape
        Same as input

    # References
        - A Theoretically Grounded Application of Dropout in Recurrent Neural Networks (https://arxiv.org/pdf/1512.05287)
    """

    def __init__(self, rate, **kwargs):
        super(TimestepDropout, self).__init__(rate, **kwargs)
        self.input_spec = InputSpec(ndim=3)

    def _get_noise_shape(self, inputs):
        input_shape = K.shape(inputs)
        noise_shape = (input_shape[0], input_shape[1], 1)
        return noise_shape

Please let me know if you find this helpful and most importantly correct in terms of the expected behaviour. @keitakurita @riadsouissi

iliaschalkidis on 28 May 2018

👍7 ❤3

@iliaschalkidis This is an awesome explanation. I have been struggling with implementing variational dropout with Keras for a couple days now, and I think this cleared it up for me. Thanks very much!

As for the dropout on embeddings, I am confused as to where to apply the dropout. Do we apply it _before_ the embedding layer, such that word ids are dropped? Or _after_ the embedding layer, such that the embeddings for certain word ids are dropped?

EDIT:

The answer appears to be _after_ the embedding layer, as per (https://arxiv.org/pdf/1512.05287.pdf):

"... it is therefore more efficient to first map the words to the word embeddings, and only then to zero-out word embeddings based on their word type."

However, as far as I understand, if you use the TimestepDropout class as implemented above _after_ an embedding layer you will drop full word embeddings for _tokens_ as opposed to _types_ (when I place a TimestepDropout after my embedding layer my performance takes a big hit, seemingly indicating this is the case). The difference is (again) described in (https://arxiv.org/pdf/1512.05287.pdf):

... "we drop word types at random rather than word tokens (as an example, the sentence 'the dog and the cat' might become '— dog and — cat' or 'the — and the cat', but never '— dog and the cat')"

Is my understanding correct? How might you perform dropout on word _types_ after embedding word _tokens_ in Keras?

JohnGiorgi on 28 May 2018

@JohnGiorgi

word_embeddings = Embedding() # first map to embeddings
word_embeddings = TimestepDropout(0.10)(word_embeddings) # then zero-out word embeddings
word_embeddings = SpatialDropout1D(0.50)(word_embeddings) # and possibly drop some dimensions on every single embedding (timestep)

can be translated as

"... it is therefore more efficient to first map the words to the word embeddings, and only then to zero-out word embeddings [NOT: based on their word type.]"

This code does not zero-out word types per sequence as the authors claim, which means that:

S1: _"The big brown dog was playing with another black dog"_

if we apply dropout 0.2, the sentence can be translated both as S1' and S1''

S1': _"The big brown - was playing with another black -"_
S1'':_"The big - dog was playing with another black -"_

If you want to zero-out word embeddings based on their word type (which means, as you quoted before, that "the" embedding is masked), I think you should bring back the old Embeddings layer from Keras 1 (https://github.com/keras-team/keras/blob/keras-1/keras/layers/embeddings.py), which is actually dropping random word-types per step in my understanding:

    def call(self, x, mask=None):
        if 0. < self.dropout < 1.:
            retain_p = 1. - self.dropout
            B = K.random_binomial((self.input_dim,), p=retain_p) * (1. / retain_p)
            B = K.expand_dims(B)
            W = K.in_train_phase(self.W * B, self.W)
        else:
            W = self.W
        out = K.gather(W, x)
        return out

My honest question would be "Does it really matters?".

The intuition is to mask some words and learn to take correct decisions (predictions) without those words, in other words to avoid overfit in specific words, that possibly tend to have many occurrences in the dataset. But how often we see a task key-word, something that really matters (e.g. a person's name or an indicative verb when we train a NER) and not a stop-word, as the quote describes ("the"), twice or more in a single sentence?

iliaschalkidis on 1 Jun 2018

👍2

@iliaschalkidis

Ah okay, so it is as I expected. I will definitely take a look at the Embeddings layer from Keras 1, thanks for the tip! I wonder why Keras 2 no longer support dropout on the embedding layer right out of the box?

I think our intuitions are the same here, and I do agree with you that it seems unlikely to matter _much_ whether we dropout word _tokens_ or word _types_ from the sequence. However, I noticed a significant drop in recall when I applied the TimeStep dropout layer after my embedding layer, even with a small dropout rate of 0.1. I may have to lower it even further.

~~Perhaps my understanding is wrong, but, if we happen to dropout a task key-word (for me thats a named entity) then we can't make a prediction on it, hence my drop in recall. Correct?~~ Nevermind, you can, of course, still make a prediction on a word even if it was "dropped".

Cheers.

JohnGiorgi on 1 Jun 2018

@iliaschalkidis , @JohnGiorgi

Been tackling a similar problem lately, and I think you are missing something from your solution.
Depending on what you feed your embedding into, dropout -- as implemented in vanilla Keras -- might be problematic, as the inputs are scaled in training. In my case, this was not wanted, and I had to re-scale after the Dropout appropriately: multiplying by (1 - drop_rate)

This issue exists both in TimestepDropout and SpatialDropout.

My solution looks like this:

    x = embedding()
    drop_p = 0.2
    x = KL.Lambda(lambda inpt: K.in_train_phase(
        (1 - drop_p) * K.dropout(
            inpt,
            drop_p,
            noise_shape=(K.shape(inpt)[0], K.int_shape(inpt)[1], 1),
            seed=np.random.randint(10000)),
        inpt)
    )(x)

A better solution would probably be to multiply x by a suitable random matrix and avoid the scaling and re-scaling altogether.

tzachar on 8 Jan 2019

@tzachar: I am curious - if you don't scale the inputs when do dropout, there must be a mismatch in training and test time. How do you fix that gap in your case? Thanks.

hoangcuong2011 on 11 Apr 2020

@hoangcuong2011
I am doing something different with my embeddings - on the line of taking the column-wise mean of the non zero dimensions, which makes scaling redundant.

tzachar on 11 Apr 2020

👍1

@tzachar: interesting to know. thanks for sharing!

hoangcuong2011 on 11 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings