Transformers: How to average sub-words embeddings to obtain word embeddings?

Created on 6 Dec 2019  路  4Comments  路  Source: huggingface/transformers

Hi~

How to average sub-words embeddings to obtain word embeddings?
I only want word-level embedding instead of sub-word-level, how can I get them?

Is there any tokenizer that provides a method that can output the index/mask of sub-words or something?

wontfix

Most helpful comment

I have similar usage as well, I did a simple experiment, and observe that the subword embedding [subword1, subword2, subword3...] when input a whole sentence,
the cosine similarity of [subword1,subword2],[subword1,subword3]... tends to above 90%.
So that sum and average subwords' embedding doesn't change much.
Btw, I tested this with Roberta models, and I observe quite different result for Bert models.

All 4 comments

You may use the word as the input and make the sentence embedding as the word embedding.
for example, input is
"puppeteer"
tokens as
'[CLS]', 'puppet', '##eer', '[SEP]'
and then get embedding of this tokens list output.

I have similar usage as well, I did a simple experiment, and observe that the subword embedding [subword1, subword2, subword3...] when input a whole sentence,
the cosine similarity of [subword1,subword2],[subword1,subword3]... tends to above 90%.
So that sum and average subwords' embedding doesn't change much.
Btw, I tested this with Roberta models, and I observe quite different result for Bert models.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings