Transformers: Vocab changes in lm_finetuning in BERT

Created on 9 Apr 2019 · 11Comments · Source: huggingface/transformers

I want to use lm_finetuning for BERT. A potential issue is vocab_size. Since I'm using Hinglish data (Hindi text written using English Alphabets) there can be new words which are not present in English vocabulary. According to BERT doc...

If using your own vocabulary, make sure to change vocab_size in bert_config.json. If you use a larger vocabulary without changing this, you will likely get NaNs when training on GPU or TPU due to unchecked out-of-bounds access.

How do I do this?

Discussion wontfix

Source

bhoomit

Most helpful comment

We did this for SciBERT, and you might find this discussion useful https://github.com/allenai/scibert/issues/29

ibeltagy on 10 Apr 2019

❤3

All 11 comments

Hi,
Also it should produce vocab.txt and bert_config.json along with pytorch_model.bin.
How you are getting those?

search4mahesh on 9 Apr 2019

We did this for SciBERT, and you might find this discussion useful https://github.com/allenai/scibert/issues/29

ibeltagy on 10 Apr 2019

❤3

lm_finetuning produce pytorch_model.bin alone (and not bert_config.json)
what do you think @Rocketknight1 ?

Emmanuel75 on 12 Apr 2019

👍1

Model name '../../models/bert/' was not found in model name list (bert-base-uncased, 
bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
bert-base-multilingual-cased, bert-base-chinese). We assumed '../../models/bert/vocab.txt'
was a path or url but couldn't find any file associated to this path or url.

Yet your fine-tuning script does not produce any such file.

glnmario on 20 Apr 2019

The lm_finetuning script assumes you're using one of the existing models in the repo, that you're fine-tuning it for a narrower domain in the same language, and that the saved pytorch_model.bin is basically just updated weights for that model - it doesn't support changes in vocab. Altering the vocab and config would probably require more extensive retraining of the model, possibly from scratch, which this repo isn't supporting yet because of the requirement for TPUs to do it quickly enough.

I can contribute code if @thomwolf thinks it's relevant, but I'm not sure if or how we should be supporting this use-case right now. It might have to wait until we add from-scratch training and TPU support.

Rocketknight1 on 22 Apr 2019

Hi @Rocketknight1
Does this mean that the pytorch_BERT and also the google_BERT implementation do not support finetuning with new vocabulary, sentences respectively?

I would like to train a german model on a domain-specific text: the amount of german words in the multilingual model is relatively small and so I cannot access hidden states for out-of-vocabulary words even when using synonyms generated using FastText, as also those synonyms are out of vocabulary. Is there any suggestion you can give me to alleviate this problem?
I see that Issue 405 has some suggestions together with Issue 9:
Can I really achieve my goal by appending my vocabulary to the end of vocab.txt and adjusting the config.json accordingly? Do I need to use the google bert model and subsequently convert_tf_checkpoint_to_pytorch.py or can I use this repo somehow directly?

gro1m on 30 May 2019

Hi @gro1m
Any luck with adding vocab to bert_pytorch?

MittalShruti on 20 Jun 2019

Hi @gro1m
Any luck with adding vocab to bert_pytorch?

I'm using https://github.com/kwonmha/bert-vocab-builder to build Vocab. Will share experience.

bhoomit on 26 Jun 2019

Hi, I am trying to use SciBert, the version with it's own vocab. I am wondering how to point to that vocab.txt file, and not the original.

Edit

Found the answer

https://github.com/huggingface/pytorch-transformers/issues/69#issuecomment-443215315

you can just do a direct path to it

Santosh-Gupta on 28 Jul 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 26 Sep 2019

@bhoomit
To achieve BERT level results for Hinglish. would fine-tuning BERT English model with Hinglish data(approx 200 MB) could achieve good results?
or it would be best to train the model from scratch in case of hinglish ?