Keras: Is the implementation of Dropout correct?

Created on 25 Jul 2016 · 7Comments · Source: keras-team/keras

The Keras implementation of Dropout references this paper: http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

The following excerpt is from that paper.

"The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2."

The Keras documentation mentions that dropout is only used at train time, and the following line from the Dropout implementation

x = K.in_train_phase(K.dropout(x, level=self.p), x)

seems to indicate that indeed outputs from layers are simply passed along during test time. Further, I cannot find code which scales down the weights after training is complete as the paper suggests. I'm hoping somebody can show me said code in the repo and put my mind at ease :-).

stale

Source

mklawonn

All 7 comments

@mklawonn

It seems that Keras takes an different approach. Instead of sub-sampling in training and scaled down in testing, it uses scale-up (by 1/p) at training phrase and not transform in testing.

Here in the implementation of Dropout layer, you see it calls K.dropout. https://github.com/fchollet/keras/blob/master/keras/layers/core.py#L87

And in K.dropout, the output from dropout is scaled up 1/p. https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py#L929

linxihui on 26 Jul 2016

👍1

@linxihui Interesting. This is not equivalent to the traditional approach, correct? The activations which fire (i.e are not dropped out) at each layer will output values twice as high as they would otherwise. This should influence the loss and thus the whole training phase differently than simply dropping activations. So in the end you'll probably converge to a different model than if the traditional dropout method had been used. The effects of this procedure are unclear to me.

If I am correct and this alternate dropout method is not equivalent to the original method, I feel it should be made explicit in the documentation. It's possible users (like me) were/are incorrectly assuming that dropout here is the same as dropout in the literature.

mklawonn on 26 Jul 2016

You seen very confused. Of course they are equivalent.

On Jul 26, 2016 11:42 AM, "Matt Klawonn" [email protected] wrote:

@linxihui https://github.com/linxihui Interesting. This is not
equivalent to the traditional approach, correct? The activations which fire
(i.e are not dropped out) at each layer will output values twice as high as
they would otherwise. This should influence the loss and thus the whole
training phase differently than simply dropping activations. So in the end
you'll probably converge to a different model than if the traditional
dropout method had been used. The effects of this procedure are unclear to
me.

If I am correct and this alternate dropout method is not equivalent to the
original method, I feel it should be made explicit in the documentation.
It's possible users (like me) were/are incorrectly assuming that dropout
here is the same as dropout in the literature.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/3305#issuecomment-235365130,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AArWb41qekaPhq-j-OYGByE_y5YwdFqtks5qZlUBgaJpZM4JULYS
.

fchollet on 26 Jul 2016

😕4

@fchollet Indeed I am. Could you humor me and explain why they are equivalent?

Consider network 1 which has dropout applied as Keras applies it, and network 2 which has dropout applied as in the original paper, with all other parameters/hyperparameters/design choices being the same. If the activations in network 1 are being scaled by a factor of 1/retain_probability at train time, then

Point 1: I think the loss and gradients computed by network 1 will be different than network 2.

As a result the weight updates will be different, and

Point 2: I don't see why the weights in network 1 and network 2 would converge to the same state.

Is there something I am missing with point 1 or point 2?

If not, is there something provable about the expected output for the two networks at test time? If there is, it is very much not obvious to me.

mklawonn on 27 Jul 2016

@mklawonn This could help: http://cs231n.github.io/neural-networks-2/, see Dropout in the Regularization part. I think it's also done this way in Torch, for example.

tmannen on 27 Jul 2016

@tmannen Thanks for the link! Now I know what to call this alternative (and potentially equivalent) method: Inverted Dropout.

That link, unfortunately, is still lacking in an explanation of why inverted dropout and dropout are equivalent. Now that I know what to call it, however, I have been searching for more information on inverted dropout. I found the following links.

https://github.com/deeplearning4j/deeplearning4j/issues/373
http://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout

Also, a few other links which simply mention they use inverted dropout as opposed to dropout. I have yet to see anyone even claim that the two are equivalent, much less prove it. It does seem they are doing similar things in principle, but there appears to be no analysis on how they perform relative to one another. Can anyone confirm or deny that

1)They are exactly equivalent.
2)They simply do similar things

Thanks for everyone's help so far!

mklawonn on 27 Jul 2016

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.