Good evening!
After I have my RoBERTa model pre-trained, I get the list of the following files:
merges.txt, special_tokens_map.json, training_args.bin. I have also seen if you add extra tokens to the tokenizer, the file add_tokens.json appears. Could I ask to clarify the meaning of the first three files - how they are used and what they contain? And also how can I add extra tokens when pre-training RoBERTa or any BERT-type model? Million of thanks in advance!
Be safe,
Akim
Hi.
You will get an explanation about merges.txt in this post.
@piegu , thanks for you answer! I have already read this post, though still did not quite understand, does it contain all the possible tokens? If so, what is the purpose of it if we can simply take the keys from vocab.json? Thanks!
My understanding is that the file merges.txt is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.
For example, the first line can be 臓 d. Why? Because at the first iteration, the token most frequent is d (with a space in front of d) and the character 臓 means space.
What is the consequence in the vocabulary? The token 臓d is listed.
Hope I'm right. If not, please give me your explanation as I have not found any online.
@piegu thank you! So you mean this is the vocabulary sorted by the frequency on the training data, right?
And what about these lines (which are 3rd - 7th for RoBERTa-base, for instance):
h e
i n
r e
o n
I clearly see these are popular words if we stack them but why are they divided?
First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.
Then, we can check in this page that in the attribute vocab_files_names, there are 2 files
VOCAB_FILES_NAMES = {
"vocab_file": "vocab.json",
"merges_file": "merges.txt",
}
Let's open merges.txt of RoBERTa-base, for instance. The file starts like this:
#version: 0.2
脛 t
脛 a
h e
i n
r e
o n
脛 t he
e r
脛 s
a t
脛 w
脛 o
...
_Note: In this Roberta Tokenizer merge file, the special character 脛 is used for encoding space instead of 臓 that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the character 臓 is used. I do not know why._
The merge file shows what tokens will be merged at each iteration (thats' why there is a space between tokens in the merge file).
About your example: It means that at the third iteration, the tokens pair he formed by the 2 tokens h and e is the most frequent in the corpus (token he without space before the token h).
If at the end of iterations, there is at least one pair he left (not merged with other tokens), it will appear in the vocab file (depends as well of the min_freq rules and number of tokens in vocab). Here, the id of he in the vocab file is 700.
Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.
Then, we can check in this page that in the attribute
vocab_files_names, there are 2 filesLet's open merges.txt of RoBERTa-base, for instance. The file starts like this:
_Note: In this Roberta Tokenizer merge file, the special character
脛is used for encoding space instead of臓that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the character臓is used. I do not know why._The merge file shows what tokens will be merged at each iteration (thats' why there is a space between tokens in the merge file).
About your example: It means that at the third iteration, the tokens pair
heformed by the 2 tokenshandeis the most frequent in the corpus (tokenhewithout space before the tokenh).If at the end of iterations, there is at least one pair
heleft (not merged with other tokens), it will appear in the vocab file (depends as well of themin_freqrules and number of tokens in vocab). Here, the id ofhein the vocab file is 700.Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.