Transformers: Why the activation function is tanh in BertPooler

Created on 12 Jul 2019 · 2Comments · Source: huggingface/transformers

I found the activation function in the BertPooler layer is tanh, but Bert never mentions that it uses the tanh. It says gelu activation function is applied in the paper.

So why there is a tanh here ? Waiting for some explanation. Thanks.

class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

Source