Bert: Can one expand the vocabulary for fine-tuning by replacing foreign unicode characters?

Created on 6 Feb 2019 · 5Comments · Source: google-research/bert

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?

Source

bsugerman

Most helpful comment

As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings.
I obtained better results at a number of tasks using this approach.

gaphex on 9 Feb 2019

👍5

All 5 comments

gaphex on 9 Feb 2019

👍5

As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings.
I obtained better results at a number of tasks using this approach.
@gaphex
Hello, can I ask you a question?
If I want to use my own corpus to do additional pre-training based on the checkpoint provided by Google, do I need to do the word segmentation and clause processing on my own corpus, and then generate a new vocabulary? I am not using English corpus.
I hope I can get your answer.

space-N on 14 Feb 2019

Hi @gaphex
I'm new to BERT and I want to add domain specific vocabulary to the vocabulary of BERT model. I know I have to replace the first 1000 lines with my vocabulary.
After adding my domain specific words in those unused lines, how to train the model after that ?
can u please share the code ?

ali4friends71 on 26 Jun 2020

@ali4friends71 you could use the code from the Colab notebook, beginning from step 5. Check out the article for further instructions.

gaphex on 26 Jun 2020

@gaphex Thanks alot.
So when running the code, I got an error than I don't have access to cloud storages.
So should I have to create a GCS bucket and use them while running the code ?
And is there any other way to save and load the model other than GCS bucket ?

ali4friends71 on 27 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings