Transformers: Difference between base and large tokenizer?

Created on 28 Mar 2019  路  4Comments  路  Source: huggingface/transformers

I understand that a cased tokenizer and an uncased one are surely different because their vocabs are different in casing, but how does a base tokenizer different from a large tokenizer? Does a large tokenizer have a larger vocab?

Discussion wontfix

Most helpful comment

I did a diff on the two vocabulary files and there is no difference. As long as you use the uncased version at least. I haven't investigated others.

All 4 comments

I haven't looked in the details of the vocabularies for each model.
If you investigate this question, be sure to share the results here, it may interest others as well!

I did a diff on the two vocabulary files and there is no difference. As long as you use the uncased version at least. I haven't investigated others.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I can validate @mhattingpete 's research.
I tokenized a big collection of text with the uncased tokenizer from both the base and the large model and both tokenizations are identical.

Was this page helpful?
0 / 5 - 0 ratings