Transformers: The purpose of files merges.txt, special_tokens_map.json, training_args.bin and add_tokens.json

Created on 5 Jun 2020 · 6Comments · Source: huggingface/transformers

Good evening!

After I have my RoBERTa model pre-trained, I get the list of the following files:
merges.txt, special_tokens_map.json, training_args.bin. I have also seen if you add extra tokens to the tokenizer, the file add_tokens.json appears. Could I ask to clarify the meaning of the first three files - how they are used and what they contain? And also how can I add extra tokens when pre-training RoBERTa or any BERT-type model? Million of thanks in advance!

Be safe,
Akim

wontfix

Source

Aktsvigun

👍1

Most helpful comment

First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.

Then, we can check in this page that in the attribute vocab_files_names, there are 2 files

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Let's open merges.txt of RoBERTa-base, for instance. The file starts like this:

#version: 0.2
Ä  t
Ä  a
h e
i n
r e
o n
Ä t he
e r
Ä  s
a t
Ä  w
Ä  o
...

_Note: In this Roberta Tokenizer merge file, the special character Ä is used for encoding space instead of Ġ that is used by GPT2 Tokenizer (explanation 1 and explanation 2) but in the corresponding RoBERTa vocab file, the character Ġ is used. I do not know why._

The merge file shows what tokens will be merged at each iteration (thats' why there is a space between tokens in the merge file).

About your example: It means that at the third iteration, the tokens pair he formed by the 2 tokens h and e is the most frequent in the corpus (token he without space before the token h).

If at the end of iterations, there is at least one pair he left (not merged with other tokens), it will appear in the vocab file (depends as well of the min_freq rules and number of tokens in vocab). Here, the id of he in the vocab file is 700.

Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.

piegu on 20 Jun 2020

👍4

All 6 comments

Hi.

You will get an explanation about merges.txt in this post.

piegu on 19 Jun 2020

@piegu , thanks for you answer! I have already read this post, though still did not quite understand, does it contain all the possible tokens? If so, what is the purpose of it if we can simply take the keys from vocab.json? Thanks!

Aktsvigun on 19 Jun 2020

My understanding is that the file merges.txt is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.

For example, the first line can be Ġ d. Why? Because at the first iteration, the token most frequent is d (with a space in front of d) and the character Ġ means space.

What is the consequence in the vocabulary? The token Ġd is listed.

Hope I'm right. If not, please give me your explanation as I have not found any online.

piegu on 20 Jun 2020

👍2

@piegu thank you! So you mean this is the vocabulary sorted by the frequency on the training data, right?
And what about these lines (which are 3rd - 7th for RoBERTa-base, for instance):

h e
i n
r e
o n

I clearly see these are popular words if we stack them but why are they divided?

Aktsvigun on 20 Jun 2020

First of all, like for GPT2, the Hugging Face (HF) tokenizer of RoBERTa is a Byte-level Byte-Pair-Encoding (BBPE) as written in the documentation.

Then, we can check in this page that in the attribute vocab_files_names, there are 2 files

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Let's open merges.txt of RoBERTa-base, for instance. The file starts like this:

#version: 0.2
Ä  t
Ä  a
h e
i n
r e
o n
Ä t he
e r
Ä  s
a t
Ä  w
Ä  o
...

The merge file shows what tokens will be merged at each iteration (thats' why there is a space between tokens in the merge file).

About your example: It means that at the third iteration, the tokens pair he formed by the 2 tokens h and e is the most frequent in the corpus (token he without space before the token h).

Hope it helps but that would be great to get the point of view of someone from Hugging Face like @sshleifer or @sgugger.

piegu on 20 Jun 2020

👍4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.