Transformers: How can I change vocab size for pretrained model?

Created on 30 Jan 2019 · 6Comments · Source: huggingface/transformers

Is there way to change (expand) vocab size for pretrained model?

When I input the new token id to model, it returns:

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1108 with torch.no_grad():
1109 torch.embedding_renorm_(weight, input, max_norm, norm_type)
-> 1110 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1111
1112

RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorMath.cpp:352

Source

hahmyg

Most helpful comment

If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(https://github.com/google-research/bert/issues/9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

tholor on 30 Jan 2019

👍3

All 6 comments

Hi,

If you want to modify the vocabulary, you should refer to this part of the original repo README https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

rodgzilla on 30 Jan 2019

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(https://github.com/google-research/bert/issues/9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

tholor on 30 Jan 2019

👍3

@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.

thomwolf on 5 Feb 2019

If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(google-research/bert#9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

@tholor I have exactly the same situation as you had. I'm wondering If you can tell me how your experiment with approach (a) went. Did it improve the accuracy. I really appreciate if you can share your conclusion.

chenshaolong on 20 Mar 2019

@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.

Hi @thomwolf , for implementing models like VideoBERT we need to append thousands of entries to the word embedding lookup table. How could we do so in Pytorch/any such examples using the library?