Bert: Features extracted from layer -1 represent sentence embedding for a sentence?

Created on 7 Nov 2018 · 9Comments · Source: google-research/bert

Hi Jacob,
Just to make sure, Do features extracted from layer -1 represent sentence embedding for a sentence ?
And if I want to extract the ssentence embedding for a sentence. Should I add sapce between Chinese character by myself in the input file?
Thanks a lot !

Source

mfxss

Most helpful comment

-1 means the last hidden layer. There are 12 or 24 hidden layers, so -1,-2,-3,-4 means 12,11,10,9 (for BERT-Base.) It's extracted for each token. There is not any "sentence embedding" in BERT (the hidden state of the first token is _not_ a good sentence representation). If you want sentence representation that you don't want to train, your best bet would just to be to average all the final hidden layers of all of the tokens in the sentence (or second-to-last hidden layers, i.e., -2, would be better).

If you're using the latest version of the repo then you don't need to tokenize it yourself, the Chinese character tokenization is handled by tokenization.py.

jacobdevlin-google on 7 Nov 2018

👍33 🚀2 🎉2 😄2 ❤1

All 9 comments

If you're using the latest version of the repo then you don't need to tokenize it yourself, the Chinese character tokenization is handled by tokenization.py.

jacobdevlin-google on 7 Nov 2018

👍33 🚀2 🎉2 😄2 ❤1

👍

Sent from my Redmi 4A
On Jacob Devlin notifications@github.com, Nov 7, 2018 12:12 PM wrote:

-1 means the last hidden layer. There are 12 or 24 hidden layers, so -1,-2,-3,-4 means 12,11,10,9 (for BERT-Base.) It's extracted for each token. There is not any "sentence embedding" in BERT (the hidden state of the first token is not a good sentence representation). If you want sentence representation that you don't want to train, your best bet would just to be to average all of the final hidden layers (or second-to-last hidden layers, i.e., -2, would be better).

If you're using the latest version of the repo then you don't need to tokenize it yourself, the Chinese character tokenization is handled by tokenization.py.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/71#issuecomment-436507081, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWBz_MppBh51SWAi-Uftm8jlrhRyGP_6ks5usmvCgaJpZM4YRupB.

IntelOSt on 7 Nov 2018

Hi,
I wonder if you extract feature vectors with Chinese successfully. In my practice,it seems that the procedure can't recognize Chinese characters, and display UNK for a Chinese sentence.I don't know if it's a character encoding problem.

WalkerWen on 7 Nov 2018

Chinese wasn't supported until this weekend, please pull the latest version and then re-try.

jacobdevlin-google on 7 Nov 2018

If you want sentence representation that you don't want to train, your best bet would just to be to average all the final hidden layers of all of the tokens in the sentence (or second-to-last hidden layers, i.e., -2, would be better).

Thanks Jacob. Curious as to why second to last hidden layer may be better?

ajbarber on 3 Apr 2019

👍1

Hi @jacobdevlin-google, when you say "the hidden state of the first token is not a good sentence representation", is the "first token" here the CLS token? If so, I'm wondering why it's not a good sentence representation.

hlums on 29 Apr 2019

I got the answer from https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-use-the-hidden-state-of-the-first-token-as-default-strategy-i-e-the-cls

hlums on 29 Apr 2019

👍1

@ajbarber https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last

hlums on 29 Apr 2019

thanks @hlums. My take on it is I believe the last layer is subject to the encoder-decoder attention mechanism, which produces this bias referred to in your link.

ajbarber on 30 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings