Transformers: How to average sub-words embeddings to obtain word embeddings?

Created on 6 Dec 2019 · 4Comments · Source: huggingface/transformers

Hi~

How to average sub-words embeddings to obtain word embeddings?
I only want word-level embedding instead of sub-word-level, how can I get them?

Is there any tokenizer that provides a method that can output the index/mask of sub-words or something?

wontfix

Source

speedcell4

👀3

Most helpful comment

I have similar usage as well, I did a simple experiment, and observe that the subword embedding [subword1, subword2, subword3...] when input a whole sentence,
the cosine similarity of [subword1,subword2],[subword1,subword3]... tends to above 90%.
So that sum and average subwords' embedding doesn't change much.
Btw, I tested this with Roberta models, and I observe quite different result for Bert models.

zhoudoufu on 30 Dec 2019

👍3

All 4 comments

You may use the word as the input and make the sentence embedding as the word embedding.
for example, input is
"puppeteer"
tokens as
'[CLS]', 'puppet', '##eer', '[SEP]'
and then get embedding of this tokens list output.

karajan1001 on 6 Dec 2019

zhoudoufu on 30 Dec 2019

👍3

Take a look at how bert-sense does it :)
https://github.com/uhh-lt/bert-sense/blob/bfecb3c0e677d36ccfab4e2131ef9183995efaef/BERT_Model.py#L342

glicerico on 5 Feb 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.