Bert: Question: What does "pooler layer" mean? Why it called pooler?

Created on 11 Jun 2020 · 3Comments · Source: google-research/bert

This question is just about the term "pooler", and maybe more of an English question than a question about BERT.

By reading this repository and its issues, I found the "pooler layer" is put after Transformer encoder stacks, ant it changes depends on the training task.
but I can't understand why it is called "pooler".

I googled about the word "pooler" and "pooler layer", and it seems that this is not ML terminology.

BTW, The pooling layer, which appears on CNN something, is a similar word, but it seems to be a different thing.

Source

miyamonz

👍2

Most helpful comment

I think it's ok to call "pooler" layer.

This layer transforms the output shape of the Transformer from [batch_size, seq_length, hidden_size] to [batch_size, hidden_size]. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.

So functionally speaking, this is the pooling.

secsilm on 16 Jun 2020

👍4

All 3 comments

I agree that the name pooler might be a little confusing. The BERT model can be divided into three parts for understanding it easily

Embedding layer: Gets the embeddings from one-hot encodings of the words
Encoder: This is the transformer with self attention heads
Pooler: It takes the output representation corresponding to the first token and uses it for downstream tasks

In the paper which describes BERT, after passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

ameet-1997 on 14 Jun 2020

👍1

I think it's ok to call "pooler" layer.

So functionally speaking, this is the pooling.

secsilm on 16 Jun 2020

👍4

The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

Hi, I have a question about this NSP purpose. Since the pooler is used on downstream tasks, like sentence classification, is it helpful to use a pooler that was trained for predicting the next sentence? Since the task now is to predict a label...

Thanks