This question is just about the term "pooler", and maybe more of an English question than a question about BERT.
By reading this repository and its issues, I found the "pooler layer" is put after Transformer encoder stacks, ant it changes depends on the training task.
but I can't understand why it is called "pooler".
I googled about the word "pooler" and "pooler layer", and it seems that this is not ML terminology.
BTW, The pooling layer, which appears on CNN something, is a similar word, but it seems to be a different thing.
I agree that the name pooler might be a little confusing. The BERT model can be divided into three parts for understanding it easily
In the paper which describes BERT, after passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.
I think it's ok to call "pooler" layer.
This layer transforms the output shape of the Transformer from [batch_size, seq_length, hidden_size] to [batch_size, hidden_size]. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.
So functionally speaking, this is the pooling.
The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.
Hi, I have a question about this NSP purpose. Since the pooler is used on downstream tasks, like sentence classification, is it helpful to use a pooler that was trained for predicting the next sentence? Since the task now is to predict a label...
Thanks
Most helpful comment
I think it's ok to call "pooler" layer.
This layer transforms the output shape of the Transformer from
[batch_size, seq_length, hidden_size]to[batch_size, hidden_size]. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.So functionally speaking, this is the pooling.