I have a question about the implementation of the position embedding. It seems like position endoding is randomly initialized and updated in the training just like tokens embedding. What confuses me is how does this ways works learn specific position information? Can you point out what I misunderstood?
There's two kinds of positional embeddings.
The first are learned ones [1], which learn a separate embedding for each position in the input. For example, if your sentence is:
words: the cat sat on the mat
positions: 0 1 2 3 4 5
input to network: emb(the)+emb(pos0) emb(cat)+emb(pos1) emb(sat)+emb(pos2) ...
Another kind of positional embedding is the sinusoidal ones introduced in the "Attention is all you need" paper. These are a static function of the position number [2].
[1] https://github.com/pytorch/fairseq/blob/master/fairseq/modules/learned_positional_embedding.py
[2] https://github.com/pytorch/fairseq/blob/master/fairseq/modules/sinusoidal_positional_embedding.py
Most helpful comment
There's two kinds of positional embeddings.
The first are learned ones [1], which learn a separate embedding for each position in the input. For example, if your sentence is:
Another kind of positional embedding is the sinusoidal ones introduced in the "Attention is all you need" paper. These are a static function of the position number [2].
[1] https://github.com/pytorch/fairseq/blob/master/fairseq/modules/learned_positional_embedding.py
[2] https://github.com/pytorch/fairseq/blob/master/fairseq/modules/sinusoidal_positional_embedding.py