Fairseq: Mapping embedding matrix back to words

Created on 20 May 2020 · 2Comments · Source: pytorch/fairseq

❓ Questions and Help

Hi there. I am attempting to extract the word embeddings that go into the encoder, similar to what's shown here. For that purpose I loaded my finetuned BART model and extracted the encoder embedded token weights with the below command:

>>> bart.state_dict()['model.encoder.embed_tokens.weight'].shape
torch.Size([50264, 1024])

I assume 50264 refers to the number of tokens in my dictionary. And 1024 is the max sequence length.

How can I map back the 50264 vectors to the corresponding tokens? So, ideally, I'd like to see {word_1: [val_1, val_2, ..., val_1024], ..., word_50264: [val_1, val_2, ..., val_1024]}, where _word_1_ is for example 'soup'.

Thanks!

question

Source

NadjaHergerTR

Most helpful comment

Yes, 50264 is the number of tokens in the dictionary and 1024 is the embedding dimension.

For the main vocabulary (i.e., everything after the first 4 special symbols), you can use bart.decode to map them to the raw byte-pair encoded symbols:

embed = {
  bart.decode(torch.tensor([i])): \
    bart.state_dict()['model.encoder.embed_tokens.weight'][i]
  for i in range(4, 10)
}
embed.keys()  # dict_keys(['.', ' the', ',', ' to', ' and', ' of'])

Note that these are not always "words" in the traditional sense but byte-pair encoded (BPE) symbols. Because we use a byte-level BPE, they may not even be full unicode characters. For example:

print(bart.decode(torch.tensor([17])))
# �

Also, the first four symbols are beginning-of-sentence, pad, end-of-sentence and unknown. You can access them via bart.task.source_dictionary if you need:

print([bart.task.source_dictionary[i] for i in range(4)])
# ['<s>', '<pad>', '</s>', '<unk>']

myleott on 22 May 2020

❤1 👍1

All 2 comments