Transformers: Why the activation function is tanh in BertPooler

Created on 12 Jul 2019  路  2Comments  路  Source: huggingface/transformers

I found the activation function in the BertPooler layer is tanh, but Bert never mentions that it uses the tanh. It says gelu activation function is applied in the paper.

So why there is a tanh here ? Waiting for some explanation. Thanks.

class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

Most helpful comment

Just wanted to point out for future reference the motivation has been answered by the original BERT authors in [this GitHub issue].

All 2 comments

Just wanted to point out for future reference the motivation has been answered by the original BERT authors in [this GitHub issue].

Was this page helpful?
0 / 5 - 0 ratings