Hi~
How to average sub-words embeddings to obtain word embeddings?
I only want word-level embedding instead of sub-word-level, how can I get them?
Is there any tokenizer that provides a method that can output the index/mask of sub-words or something?
You may use the word as the input and make the sentence embedding as the word embedding.
for example, input is
"puppeteer"
tokens as
'[CLS]', 'puppet', '##eer', '[SEP]'
and then get embedding of this tokens list output.
I have similar usage as well, I did a simple experiment, and observe that the subword embedding [subword1, subword2, subword3...] when input a whole sentence,
the cosine similarity of [subword1,subword2],[subword1,subword3]... tends to above 90%.
So that sum and average subwords' embedding doesn't change much.
Btw, I tested this with Roberta models, and I observe quite different result for Bert models.
Take a look at how bert-sense does it :)
https://github.com/uhh-lt/bert-sense/blob/bfecb3c0e677d36ccfab4e2131ef9183995efaef/BERT_Model.py#L342
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I have similar usage as well, I did a simple experiment, and observe that the subword embedding [subword1, subword2, subword3...] when input a whole sentence,
the cosine similarity of [subword1,subword2],[subword1,subword3]... tends to above 90%.
So that sum and average subwords' embedding doesn't change much.
Btw, I tested this with Roberta models, and I observe quite different result for Bert models.