Bert: Trouble to understand position embedding.

Created on 6 Nov 2018 · 22Comments · Source: google-research/bert

position_embeddings is only a matrix which is random init?

Which code part means the position info of word in sentence?

Thank you!!

Source

guotong1988

Most helpful comment

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

hsm207 on 8 Jan 2020

👍4 ❤1

All 22 comments

It was randomly initialized when it was created but it was trained during pre-training with the rest of the network. You don't need to store the position info. full_position_embeddings is a tensor of shape [max_position_embeddings, width]. So full_position_embeddings[i:i+1,] is the position embedding of position i. So you can just add it to the input matrix and each item will be applied at the correct position (which is what's done right after that code).

jacobdevlin-google on 6 Nov 2018

Thank you very much.
It hard to understand why it works.
Could you please describe what it is like when the position embedding is trained?

guotong1988 on 6 Nov 2018

Well if you don't have a positional embedding matrix then the Transformer has no way of knowing the relative position of each word. It would be exactly like randomly shuffling the input sentence. So the positional embeddings let the model learn the actual sequential ordering of the input sentence (which something like an LSTM gets for free).

jacobdevlin-google on 6 Nov 2018

👍5

Thank you very much, again.
In transformer model, we don't have an LSTM.

guotong1988 on 6 Nov 2018

Sorry to trouble you.
I still don't know why this op can let the position info work.
Thank you.

guotong1988 on 6 Nov 2018

This is a general technique used in the Transformer, which has been successfully in hundreds of papers, so it's outside the scope of this page. Please see a general guide to the Transformer such as The Illustrated Transformer if you want to know more about it. They have a section on the positional embeddings with a heatmap.

jacobdevlin-google on 6 Nov 2018

👍2

It seems like position embedding is not properly implemented in Google Bert Python version. PE is reinitialized on each pass; there are no sine / cosine positional updates as per Section 3.5 of Attention is All You Need .
https://medium.com/@ranko.mosic/googles-bert-nlp-5b2bb1236d78

mosicr on 31 Dec 2018

Hi @jacobdevlin-google I have small doubt, by default the max_position_embeddings arg is set to 512 in bert_config file of the downloaded model (Cased-12 Layer model open-sourced by google).
Now I want to send a sequence larger than 512 tokens. Can I do this in any way?
What if I change the max_position_embeddings value in config file to something like 2048.
Or is this variable dependent on the trained model?

kapilkd13 on 11 Feb 2019

It is a little confused for me in two aspects.

the Transformer does not shuffle the input sequence, so the sequence information could be retained. the multi-head attent is just related to the length of embedding rather than the sequence.
when the sequence is too long, whether this position embedding works or not. If I set the max_sequence too large will that be a potential problem?

mealsd on 5 May 2019

@mealsd the largest max_sequence you can set is 512 because this was the value used during pretraining. So, there's no problem as long as your max_sequence length is less than 512.

hsm207 on 11 May 2019

@mealsd the largest max_sequence you can set is 512 because this was the value used during pretraining. So, there's no problem as long as your max_sequence length is less than 512.

Actually, I'm doing something on a Transformer structure, and maybe need 9000 length position embedding. I'm confusing whether this make sense or not.

mealsd on 11 May 2019

If your input sequence is a really long text e.g. an essay vs a sentence, then 9k length position embedding makes sense (if your hardware can handle it) . Unfortunately, BERT is only able to handle sequence of up to length 512. As an alternative, you may consider chunking your input sequence into smaller units e.g. paragraphs.

hsm207 on 11 May 2019

Isn't the positional embedding supposed to be a function of sine / cosine. @jacobdevlin-google say's this is randomly initialized. Does BERT initialize the positional embedding differently?

bnicholl on 8 Jan 2020

hsm207 on 8 Jan 2020

👍4 ❤1

Thanks for the response. Last question. Are the token embeddings the integer values I obtain when I call:

string_values_from_sentence = ['going', 'to', 'code', 'this', 'weekend']
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
print('Ids: ', tokenizer.encode( string_values_from_sentence ))

OUTPUT: Ids: [101, 2183, 2000, 15132, 2023, 5353, 102]

If so, I'm assuming these tokens are randomly initialized, and are not learned(since there of type int)?

bnicholl on 9 Jan 2020

@bnicholl no, [101, 2183, 2000, 15132, 2023, 5353, 102] are not the token embeddings.

They are the index to the token embeddings.

There's a huge matrix with dimension vocab size x embedding dimension that records the embedding for each token. So, to get the embedding for the word "going", you will go to row 101 of this matrix (assuming "going" gets tokenized to "going" instead of "go" and "ing".)

Hope this clarifies.

hsm207 on 9 Jan 2020

Thanks for clarifying. So in the example above, if the word going gets broken to ##go ##ing, how does that word embedding get learned. It is my understanding that if the word is not in the corpus, that is when the word gets split into multiple words. So, if BERT comes across a word that it is not familiar with during the fine tuning phase, does it just randomly initialize that word with a word embedding, then minimize the error during fine tuning?

bnicholl on 13 Jan 2020

Words that are unknown to BERT will get replace with a special token. I think its [UNK] (not entirely sure, since it's been awhile since I work with BERT). [UNK] tokens will have their own embedding that is learned too during the finetuning phase.

hsm207 on 13 Jan 2020

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

linhlt-it-ee on 27 Jan 2020

👍2

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

Well. That looks reasonable at the first glance.
However, sinusoidal functions have many good features, e.g., the different varying period at different dimensions. I don not think using a randomly initialized embedding can bring the same benefits.

BaoshengHeTR on 24 Feb 2020

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

Well. That looks reasonable at the first glance.
However, sinusoidal functions have many good features, e.g., the different varying period at different dimensions. I don not think using a randomly initialized embedding can bring the same benefits.

With the huge dataset, it could be. I think like you at first. However, in BERT CODE I trust. :). Please share with me if you discover something else. Tks

linhlt-it-ee on 25 Feb 2020

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

Well. That looks reasonable at the first glance.
However, sinusoidal functions have many good features, e.g., the different varying period at different dimensions. I don not think using a randomly initialized embedding can bring the same benefits.

With the huge dataset, it could be. I think like you at first. However, in BERT CODE I trust. :). Please share with me if you discover something else. Tks

with very large dataset, you can achieve such good staff? Like local periodcial and linear consistency? Hope people from google can give an explanation.

BaoshengHeTR on 25 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings