I found the activation function in the BertPooler layer is tanh, but Bert never mentions that it uses the tanh. It says gelu activation function is applied in the paper.
So why there is a tanh here ? Waiting for some explanation. Thanks.
class BertPooler(nn.Module):
def __init__(self, config):
super(BertPooler, self).__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
Because that's what Bert's authors do in the official TF code:
https://github.com/google-research/bert/blob/bee6030e31e42a9394ac567da170a89a98d2062f/modeling.py#L231
Just wanted to point out for future reference the motivation has been answered by the original BERT authors in [this GitHub issue].
Most helpful comment
Just wanted to point out for future reference the motivation has been answered by the original BERT authors in [this GitHub issue].