Transformers: Why the activation function is tanh in BertPooler

Created on 12 Jul 2019  路  2Comments  路  Source: huggingface/transformers

I found the activation function in the BertPooler layer is tanh, but Bert never mentions that it uses the tanh. It says gelu activation function is applied in the paper.

So why there is a tanh here ? Waiting for some explanation. Thanks.

class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

Most helpful comment

Just wanted to point out for future reference the motivation has been answered by the original BERT authors in [this GitHub issue].

All 2 comments

Just wanted to point out for future reference the motivation has been answered by the original BERT authors in [this GitHub issue].

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yspaik picture yspaik  路  3Comments

adigoryl picture adigoryl  路  3Comments

HanGuo97 picture HanGuo97  路  3Comments

iedmrc picture iedmrc  路  3Comments

0x01h picture 0x01h  路  3Comments