Transformers: Difference between base and large tokenizer?

Created on 28 Mar 2019 · 4Comments · Source: huggingface/transformers

I understand that a cased tokenizer and an uncased one are surely different because their vocabs are different in casing, but how does a base tokenizer different from a large tokenizer? Does a large tokenizer have a larger vocab?

Discussion wontfix

Source

ZhaofengWu

Most helpful comment

I did a diff on the two vocabulary files and there is no difference. As long as you use the uncased version at least. I haven't investigated others.

mhattingpete on 8 Apr 2019

👍2

All 4 comments

I haven't looked in the details of the vocabularies for each model.
If you investigate this question, be sure to share the results here, it may interest others as well!

thomwolf on 3 Apr 2019

I did a diff on the two vocabulary files and there is no difference. As long as you use the uncased version at least. I haven't investigated others.

mhattingpete on 8 Apr 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 7 Jun 2019

I can validate @mhattingpete 's research.
I tokenized a big collection of text with the uncased tokenizer from both the base and the large model and both tokenizations are identical.