Transformers: Unseen Vocab

Created on 28 Nov 2018 · 3Comments · Source: huggingface/transformers

Thank you so much for this well-documented and easy-to-understand implementation! I remember meeting you at WeCNLP and am so happy to see you push out usable implementations of the SOA in pytorch for the community!!!!!

I have a question: The convert_tokens_to_ids method in the BertTokenizer that provides input to the BertEncoder uses an OrderedDict for the vocab attribute, which throws an error (e.g. KeyError: 'ketorolac') for any words not in the vocab. Can I create another vocab object that adds unseen words and use that in the tokenizer? Does the pretrained BertEncoder depend on the default id mapping?

It seems to me that ideally in the long-term, this repo would incorporate character level embeddings to deal with unseen words, but idk if that is necessary for this use-case.

Source

siddsach

👍2

Most helpful comment

If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.

artemisart on 29 Nov 2018

👍2

All 3 comments

artemisart on 29 Nov 2018

👍2

Hi @siddsach,
Thanks for your kind words!
@artemisart is right, BPE progressively falls-back on character level embeddings for unseen words.

thomwolf on 30 Nov 2018

👍1

If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.

Hi, what do you mean tokenize properly the input (tokenize before convert_tokens) ?
Can you refer a tokenization sample (before and after) or a sample code if any? thank you