Bert: Why is there extra denser layer in pooler?

Created on 4 Nov 2018 · 1Comment · Source: google-research/bert

I'm referring to this line

In the paper, you state

In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding. We denote this vector as C ∈ R^H. The only new parameters added during fine-tuning are for a classification layer W ∈ R^{K X H} , where K is the number of classifier labels.

But here, you have a H X H dense layer which is in contradiction to the above. Even more perplexing to me is that activation of this layer is tanh! I'm surprised all the models worked with tanh instead of rely activation.

I suspect that I'm missing something here. Thanks for your patience.

Source

chsasank

Most helpful comment

Yeah it was an oversight that we didn't mention it in the paper (we'll mention it in the updated version), but we have an extra projection layer for the classifier and LM before feeding it into the classification.

However, these layers are both pre-trained with the rest of the network and are included in the pre-trained checkpoint. So the part about "the only new parameters added during fine-tuning" is correct, it's just not correct to say "output of the Transformer", it's really "output of the Transformer fed through one additional non-linear transformation".

The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.