Is it possible to fine tune BertTokenizer so that the new vocab.txt file which it uses gets updated on my custom dataset? or do i need to retrain the bert model from scratch for the same?
You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))
You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))
Note that this simply adds a new token to the vocabulary but doesn't train its embedding (obviously). This implies that your results will be quite poor if your training data contains a lot of newly added (untrained) tokens.
@cronoik once the dictionary is resized don't I have to train the tokenizer model again?
@BramVanroy umm.. so what could be the probable solution if I am having a custom data set? How can I can retrain this BertTokenizer Model to get new vocab.txt file?
What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens).
In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))