I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?
As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings.
I obtained better results at a number of tasks using this approach.
As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings.
I obtained better results at a number of tasks using this approach.
@gaphex
Hello, can I ask you a question?
If I want to use my own corpus to do additional pre-training based on the checkpoint provided by Google, do I need to do the word segmentation and clause processing on my own corpus, and then generate a new vocabulary? I am not using English corpus.
I hope I can get your answer.
Hi @gaphex
I'm new to BERT and I want to add domain specific vocabulary to the vocabulary of BERT model. I know I have to replace the first 1000 lines with my vocabulary.
After adding my domain specific words in those unused lines, how to train the model after that ?
can u please share the code ?
@ali4friends71 you could use the code from the Colab notebook, beginning from step 5. Check out the article for further instructions.
@gaphex Thanks alot.
So when running the code, I got an error than I don't have access to cloud storages.
So should I have to create a GCS bucket and use them while running the code ?
And is there any other way to save and load the model other than GCS bucket ?
Most helpful comment
As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings.
I obtained better results at a number of tasks using this approach.