Transformers: The purpose of files merges.txt, special_tokens_map.json, training_args.bin and add_tokens.json

Created on 5 Jun 2020  路  6Comments  路  Source: huggingface/transformers

Good evening!

After I have my RoBERTa model pre-trained, I get the list of the following files:
merges.txt, special_tokens_map.json, training_args.bin. I have also seen if you add extra tokens to the tokenizer, the file add_tokens.json appears. Could I ask to clarify the meaning of the first three files - how they are used and what they contain? And also how can I add extra tokens when pre-training RoBERTa or any BERT-type model? Million of thanks in advance!

Be safe,
Akim

wontfix

Most helpful comment

First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.

Then, we can check in this page that in the attribute vocab_files_names, there are 2 files

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Let's open merges.txt of RoBERTa-base, for instance. The file starts like this:

#version: 0.2
脛  t
脛  a
h e
i n
r e
o n
脛 t he
e r
脛  s
a t
脛  w
脛  o
...

_Note: In this Roberta Tokenizer merge file, the special character is used for encoding space instead of that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the character is used. I do not know why._

The merge file shows what tokens will be merged at each iteration (thats' why there is a space between tokens in the merge file).

About your example: It means that at the third iteration, the tokens pair he formed by the 2 tokens h and e is the most frequent in the corpus (token he without space before the token h).

If at the end of iterations, there is at least one pair he left (not merged with other tokens), it will appear in the vocab file (depends as well of the min_freq rules and number of tokens in vocab). Here, the id of he in the vocab file is 700.

Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.

All 6 comments

Hi.

You will get an explanation about merges.txt in this post.

@piegu , thanks for you answer! I have already read this post, though still did not quite understand, does it contain all the possible tokens? If so, what is the purpose of it if we can simply take the keys from vocab.json? Thanks!

My understanding is that the file merges.txt is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.

For example, the first line can be 臓 d. Why? Because at the first iteration, the token most frequent is d (with a space in front of d) and the character means space.

What is the consequence in the vocabulary? The token 臓d is listed.

Hope I'm right. If not, please give me your explanation as I have not found any online.

@piegu thank you! So you mean this is the vocabulary sorted by the frequency on the training data, right?
And what about these lines (which are 3rd - 7th for RoBERTa-base, for instance):

h e
i n
r e
o n

I clearly see these are popular words if we stack them but why are they divided?

First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.

Then, we can check in this page that in the attribute vocab_files_names, there are 2 files

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Let's open merges.txt of RoBERTa-base, for instance. The file starts like this:

#version: 0.2
脛  t
脛  a
h e
i n
r e
o n
脛 t he
e r
脛  s
a t
脛  w
脛  o
...

_Note: In this Roberta Tokenizer merge file, the special character is used for encoding space instead of that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the character is used. I do not know why._

The merge file shows what tokens will be merged at each iteration (thats' why there is a space between tokens in the merge file).

About your example: It means that at the third iteration, the tokens pair he formed by the 2 tokens h and e is the most frequent in the corpus (token he without space before the token h).

If at the end of iterations, there is at least one pair he left (not merged with other tokens), it will appear in the vocab file (depends as well of the min_freq rules and number of tokens in vocab). Here, the id of he in the vocab file is 700.

Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

0x01h picture 0x01h  路  3Comments

alphanlp picture alphanlp  路  3Comments

siddsach picture siddsach  路  3Comments

guanlongtianzi picture guanlongtianzi  路  3Comments

iedmrc picture iedmrc  路  3Comments