I want to use lm_finetuning for BERT. A potential issue is vocab_size. Since I'm using Hinglish data (Hindi text written using English Alphabets) there can be new words which are not present in English vocabulary. According to BERT doc...
If using your own vocabulary, make sure to change vocab_size in bert_config.json. If you use a larger vocabulary without changing this, you will likely get NaNs when training on GPU or TPU due to unchecked out-of-bounds access.
How do I do this?
Hi,
Also it should produce vocab.txt and bert_config.json along with pytorch_model.bin.
How you are getting those?
We did this for SciBERT, and you might find this discussion useful https://github.com/allenai/scibert/issues/29
lm_finetuning produce pytorch_model.bin alone (and not bert_config.json)
what do you think @Rocketknight1 ?
Model name '../../models/bert/' was not found in model name list (bert-base-uncased,
bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
bert-base-multilingual-cased, bert-base-chinese). We assumed '../../models/bert/vocab.txt'
was a path or url but couldn't find any file associated to this path or url.
Yet your fine-tuning script does not produce any such file.
The lm_finetuning script assumes you're using one of the existing models in the repo, that you're fine-tuning it for a narrower domain in the same language, and that the saved pytorch_model.bin is basically just updated weights for that model - it doesn't support changes in vocab. Altering the vocab and config would probably require more extensive retraining of the model, possibly from scratch, which this repo isn't supporting yet because of the requirement for TPUs to do it quickly enough.
I can contribute code if @thomwolf thinks it's relevant, but I'm not sure if or how we should be supporting this use-case right now. It might have to wait until we add from-scratch training and TPU support.
Hi @Rocketknight1
Does this mean that the pytorch_BERT and also the google_BERT implementation do not support finetuning with new vocabulary, sentences respectively?
I would like to train a german model on a domain-specific text: the amount of german words in the multilingual model is relatively small and so I cannot access hidden states for out-of-vocabulary words even when using synonyms generated using FastText, as also those synonyms are out of vocabulary. Is there any suggestion you can give me to alleviate this problem?
I see that Issue 405 has some suggestions together with Issue 9:
Can I really achieve my goal by appending my vocabulary to the end of vocab.txt and adjusting the config.json accordingly? Do I need to use the google bert model and subsequently convert_tf_checkpoint_to_pytorch.py or can I use this repo somehow directly?
Hi @gro1m
Any luck with adding vocab to bert_pytorch?
Hi @gro1m
Any luck with adding vocab to bert_pytorch?
I'm using https://github.com/kwonmha/bert-vocab-builder to build Vocab. Will share experience.
Hi, I am trying to use SciBert, the version with it's own vocab. I am wondering how to point to that vocab.txt file, and not the original.
Edit
Found the answer
https://github.com/huggingface/pytorch-transformers/issues/69#issuecomment-443215315
you can just do a direct path to it
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@bhoomit
To achieve BERT level results for Hinglish. would fine-tuning BERT English model with Hinglish data(approx 200 MB) could achieve good results?
or it would be best to train the model from scratch in case of hinglish ?
Most helpful comment
We did this for SciBERT, and you might find this discussion useful https://github.com/allenai/scibert/issues/29