Transformers: how can i finetune BertTokenizer?

Created on 31 Jan 2020 · 5Comments · Source: huggingface/transformers

Is it possible to fine tune BertTokenizer so that the new vocab.txt file which it uses gets updated on my custom dataset? or do i need to retrain the bert model from scratch for the same?

wontfix

Source

raj5287

Most helpful comment

You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))

cronoik on 1 Feb 2020

👍4 🎉1

All 5 comments

cronoik on 1 Feb 2020

👍4 🎉1

You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))

Note that this simply adds a new token to the vocabulary but doesn't train its embedding (obviously). This implies that your results will be quite poor if your training data contains a lot of newly added (untrained) tokens.

BramVanroy on 1 Feb 2020

👍1

@cronoik once the dictionary is resized don't I have to train the tokenizer model again?

@BramVanroy umm.. so what could be the probable solution if I am having a custom data set? How can I can retrain this BertTokenizer Model to get new vocab.txt file?

raj5287 on 4 Feb 2020

👀1

What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens).

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...).

cronoik on 18 Feb 2020

🚀2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.