Is there way to change (expand) vocab size for pretrained model?
When I input the new token id to model, it returns:
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1108 with torch.no_grad():
1109 torch.embedding_renorm_(weight, input, max_norm, norm_type)
-> 1110 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1111
1112
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorMath.cpp:352
Hi,
If you want to modify the vocabulary, you should refer to this part of the original repo README https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary
If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:
[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
(https://github.com/google-research/bert/issues/9)
I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.
@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.
If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:
[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.
@tholor I have exactly the same situation as you had. I'm wondering If you can tell me how your experiment with approach (a) went. Did it improve the accuracy. I really appreciate if you can share your conclusion.
@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.
Hi @thomwolf , for implementing models like VideoBERT we need to append thousands of entries to the word embedding lookup table. How could we do so in Pytorch/any such examples using the library?
@tholor Can you guide me on how you are counting 993 unused tokens? I see only first 100 places of unused tokens?
Most helpful comment
If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:
(https://github.com/google-research/bert/issues/9)
I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.