Bert: Trouble to understand position embedding.

Created on 6 Nov 2018  路  22Comments  路  Source: google-research/bert

image

position_embeddings is only a matrix which is random init?

Which code part means the position info of word in sentence?

Thank you!!

Most helpful comment

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

image

All 22 comments

It was randomly initialized when it was created but it was trained during pre-training with the rest of the network. You don't need to store the position info. full_position_embeddings is a tensor of shape [max_position_embeddings, width]. So full_position_embeddings[i:i+1,] is the position embedding of position i. So you can just add it to the input matrix and each item will be applied at the correct position (which is what's done right after that code).

Thank you very much.
It hard to understand why it works.
Could you please describe what it is like when the position embedding is trained?

Well if you don't have a positional embedding matrix then the Transformer has no way of knowing the relative position of each word. It would be exactly like randomly shuffling the input sentence. So the positional embeddings let the model learn the actual sequential ordering of the input sentence (which something like an LSTM gets for free).

Thank you very much, again.
In transformer model, we don't have an LSTM.

image
Sorry to trouble you.
I still don't know why this op can let the position info work.
Thank you.

This is a general technique used in the Transformer, which has been successfully in hundreds of papers, so it's outside the scope of this page. Please see a general guide to the Transformer such as The Illustrated Transformer if you want to know more about it. They have a section on the positional embeddings with a heatmap.

It seems like position embedding is not properly implemented in Google Bert Python version. PE is reinitialized on each pass; there are no sine / cosine positional updates as per Section 3.5 of Attention is All You Need .
https://medium.com/@ranko.mosic/googles-bert-nlp-5b2bb1236d78

Hi @jacobdevlin-google I have small doubt, by default the max_position_embeddings arg is set to 512 in bert_config file of the downloaded model (Cased-12 Layer model open-sourced by google).
Now I want to send a sequence larger than 512 tokens. Can I do this in any way?
What if I change the max_position_embeddings value in config file to something like 2048.
Or is this variable dependent on the trained model?

It is a little confused for me in two aspects.

  1. the Transformer does not shuffle the input sequence, so the sequence information could be retained. the multi-head attent is just related to the length of embedding rather than the sequence.
  2. when the sequence is too long, whether this position embedding works or not. If I set the max_sequence too large will that be a potential problem?

@mealsd the largest max_sequence you can set is 512 because this was the value used during pretraining. So, there's no problem as long as your max_sequence length is less than 512.

@mealsd the largest max_sequence you can set is 512 because this was the value used during pretraining. So, there's no problem as long as your max_sequence length is less than 512.

Actually, I'm doing something on a Transformer structure, and maybe need 9000 length position embedding. I'm confusing whether this make sense or not.

If your input sequence is a really long text e.g. an essay vs a sentence, then 9k length position embedding makes sense (if your hardware can handle it) . Unfortunately, BERT is only able to handle sequence of up to length 512. As an alternative, you may consider chunking your input sequence into smaller units e.g. paragraphs.

Isn't the positional embedding supposed to be a function of sine / cosine. @jacobdevlin-google say's this is randomly initialized. Does BERT initialize the positional embedding differently?

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

image

Thanks for the response. Last question. Are the token embeddings the integer values I obtain when I call:

string_values_from_sentence = ['going', 'to', 'code', 'this', 'weekend']
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
print('Ids: ', tokenizer.encode( string_values_from_sentence ))

OUTPUT: Ids: [101, 2183, 2000, 15132, 2023, 5353, 102]

If so, I'm assuming these tokens are randomly initialized, and are not learned(since there of type int)?

@bnicholl no, [101, 2183, 2000, 15132, 2023, 5353, 102] are not the token embeddings.

They are the index to the token embeddings.

There's a huge matrix with dimension vocab size x embedding dimension that records the embedding for each token. So, to get the embedding for the word "going", you will go to row 101 of this matrix (assuming "going" gets tokenized to "going" instead of "go" and "ing".)

Hope this clarifies.

Thanks for clarifying. So in the example above, if the word going gets broken to ##go ##ing, how does that word embedding get learned. It is my understanding that if the word is not in the corpus, that is when the word gets split into multiple words. So, if BERT comes across a word that it is not familiar with during the fine tuning phase, does it just randomly initialize that word with a word embedding, then minimize the error during fine tuning?

Words that are unknown to BERT will get replace with a special token. I think its [UNK] (not entirely sure, since it's been awhile since I work with BERT). [UNK] tokens will have their own embedding that is learned too during the finetuning phase.

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:

image

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:
image

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

Well. That looks reasonable at the first glance.
However, sinusoidal functions have many good features, e.g., the different varying period at different dimensions. I don not think using a randomly initialized embedding can bring the same benefits.

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:
image

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

Well. That looks reasonable at the first glance.
However, sinusoidal functions have many good features, e.g., the different varying period at different dimensions. I don not think using a randomly initialized embedding can bring the same benefits.

With the huge dataset, it could be. I think like you at first. However, in BERT CODE I trust. :). Please share with me if you discover something else. Tks

@bnicholl in BERT, the positional embedding is a learnable feature. As far as I know, the sine/cosine thing was introduced in the attention is all you need paper and they found that it produces almost the same results as making it a learnable feature:
image

Tks for clarifying. However I think they choose the learned position embedding because it would dramatically change corresponding with the change of context (words around) when fine tuning.

Well. That looks reasonable at the first glance.
However, sinusoidal functions have many good features, e.g., the different varying period at different dimensions. I don not think using a randomly initialized embedding can bring the same benefits.

With the huge dataset, it could be. I think like you at first. However, in BERT CODE I trust. :). Please share with me if you discover something else. Tks

with very large dataset, you can achieve such good staff? Like local periodcial and linear consistency? Hope people from google can give an explanation.

Was this page helpful?
0 / 5 - 0 ratings