Hi Jacob,
Just to make sure, Do features extracted from layer -1 represent sentence embedding for a sentence ?
And if I want to extract the ssentence embedding for a sentence. Should I add sapce between Chinese character by myself in the input file?
Thanks a lot !
-1 means the last hidden layer. There are 12 or 24 hidden layers, so -1,-2,-3,-4 means 12,11,10,9 (for BERT-Base.) It's extracted for each token. There is not any "sentence embedding" in BERT (the hidden state of the first token is _not_ a good sentence representation). If you want sentence representation that you don't want to train, your best bet would just to be to average all the final hidden layers of all of the tokens in the sentence (or second-to-last hidden layers, i.e., -2, would be better).
If you're using the latest version of the repo then you don't need to tokenize it yourself, the Chinese character tokenization is handled by tokenization.py.
👍
Sent from my Redmi 4A
On Jacob Devlin notifications@github.com, Nov 7, 2018 12:12 PM wrote:
-1 means the last hidden layer. There are 12 or 24 hidden layers, so -1,-2,-3,-4 means 12,11,10,9 (for BERT-Base.) It's extracted for each token. There is not any "sentence embedding" in BERT (the hidden state of the first token is not a good sentence representation). If you want sentence representation that you don't want to train, your best bet would just to be to average all of the final hidden layers (or second-to-last hidden layers, i.e., -2, would be better).
If you're using the latest version of the repo then you don't need to tokenize it yourself, the Chinese character tokenization is handled by tokenization.py.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/71#issuecomment-436507081, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWBz_MppBh51SWAi-Uftm8jlrhRyGP_6ks5usmvCgaJpZM4YRupB.
Hi,
I wonder if you extract feature vectors with Chinese successfully. In my practice,it seems that the procedure can't recognize Chinese characters, and display UNK for a Chinese sentence.I don't know if it's a character encoding problem.
Chinese wasn't supported until this weekend, please pull the latest version and then re-try.
If you want sentence representation that you don't want to train, your best bet would just to be to average all the final hidden layers of all of the tokens in the sentence (or second-to-last hidden layers, i.e., -2, would be better).
Thanks Jacob. Curious as to why second to last hidden layer may be better?
Hi @jacobdevlin-google, when you say "the hidden state of the first token is not a good sentence representation", is the "first token" here the CLS token? If so, I'm wondering why it's not a good sentence representation.
I got the answer from https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-use-the-hidden-state-of-the-first-token-as-default-strategy-i-e-the-cls
@ajbarber https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last
thanks @hlums. My take on it is I believe the last layer is subject to the encoder-decoder attention mechanism, which produces this bias referred to in your link.
Most helpful comment
-1 means the last hidden layer. There are 12 or 24 hidden layers, so -1,-2,-3,-4 means 12,11,10,9 (for
BERT-Base.) It's extracted for each token. There is not any "sentence embedding" in BERT (the hidden state of the first token is _not_ a good sentence representation). If you want sentence representation that you don't want to train, your best bet would just to be to average all the final hidden layers of all of the tokens in the sentence (or second-to-last hidden layers, i.e., -2, would be better).If you're using the latest version of the repo then you don't need to tokenize it yourself, the Chinese character tokenization is handled by
tokenization.py.