Transformers: RoBERTa/GPT2 tokenization

Created on 4 Sep 2019 · 6Comments · Source: huggingface/transformers

Hi,

I've one question regarding to the tokenization logic.

I'm using the RoBERTa tokenizer from fairseq:

In [15]: tokens = roberta.encode("Berlin and Munich have a lot of puppeteer to see .")                                                                                                                                                                

In [16]: tokens                                                                                                                                                                                                                                       
Out[16]: 
tensor([    0, 26795,  2614,     8, 10489,    33,    10,   319,     9, 32986,
         9306,   254,     7,   192,   479,     2])

Interestingly, Berlin will be splitted into two subwords (with ids 26795 and 2614).

When I use the pytorch-transformer implementation:

In [21]: tokens = tokenizer.tokenize("<s>Berlin and Munich have a lot of puppeteer to see .</s>")                                                                                                                                                    

In [22]: indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)                                                                                                                                                                                     

In [23]: indexed_tokens                                                                                                                                                                                                                               
Out[23]: [0, 5459, 8, 10489, 33, 10, 319, 9, 32986, 9306, 254, 7, 192, 479, 2]

Berlin is not splitted 😅

The roberta.encode method will return one subword for Berlin, when I start the sentence with a space - which tokenizer is correct here 🤔

Source

stefan-it

👍4

Most helpful comment

This is a more complex question than it may seem but in general, I think both will be pretty similar in practice.

This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it).

Now at the beginning of a string you don't have a space which can result in strange behaviors.

Here is an example of the resulting behavior on RoBERTa. You would expect that the strings Berlin and Munich and Munich and Berlin are tokenized similarly with only the order of the tokens modified but they are not:

>>> roberta.encode("Berlin and Munich")
tensor([    0, 26795,  2614,     8, 10489,     2])
>>> roberta.encode("Munich and Berlin")
tensor([   0,  448,  879, 1725,    8, 5459,    2])

In this example, the first word is split and not the second.

In our tokenizer, to avoid this behavior we decided to always add a space at the beginning of a string (multiple spaces doesn't have an effect so it's ok to always add one) so that the tokenization can be consistent.

A side effect of this (indicated in the doc/docstring) is that the encoding/decoding process doesn't preserve the absence of a space at the beginning of a string but on the other hand the resulting behavior is more consistent.

>>> tokenizer.encode("Berlin and Munich", add_special_tokens=True)
[0, 5459, 8, 10489, 2]
>>> tokenizer.encode("Munich and Berlin", add_special_tokens=True)
[0, 10489, 8, 5459, 2]

Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well.

thomwolf on 4 Sep 2019

👍15 🎉1

All 6 comments

This is a more complex question than it may seem but in general, I think both will be pretty similar in practice.

This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it).

Now at the beginning of a string you don't have a space which can result in strange behaviors.

>>> roberta.encode("Berlin and Munich")
tensor([    0, 26795,  2614,     8, 10489,     2])
>>> roberta.encode("Munich and Berlin")
tensor([   0,  448,  879, 1725,    8, 5459,    2])

In this example, the first word is split and not the second.

>>> tokenizer.encode("Berlin and Munich", add_special_tokens=True)
[0, 5459, 8, 10489, 2]
>>> tokenizer.encode("Munich and Berlin", add_special_tokens=True)
[0, 10489, 8, 5459, 2]

Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well.

thomwolf on 4 Sep 2019

👍15 🎉1

Thanks for your explanation :+1:

I just ran an experiment for a downstream task (English NER) and F1-score decreased around 0.5% 😟

I'll repeat that experiment with one commit before 0517e7a1cb4a70bdf32f8d11b56df8d3911d1792 (that introduced the whitespace rule) to find out where this performance drop comes from.

stefan-it on 4 Sep 2019

Update on that: I used 3bcbebd440c220adbaab657f2d13dac7c89f6453 and re-do my experiment on NER. Now the final F1-score is 92.26 (consistent with a prior result that was 92.31) - in contrast to 91.81 for the latest 1.2.0 version 🤔

Would it possible to add a flag that uses the "original" tokenization 🤔

stefan-it on 5 Sep 2019

We'll see what we can do (cc @LysandreJik @julien-c).

Is this difference significantly different with regards to seed run variability?

thomwolf on 5 Sep 2019

👍1

I made a few more experiments with the same dataset and different runs:

| Version | Run 1 | Run 2 | Run 3 | Avg.
| ------- | ----- | ----- | ----- | ----
| 1.2.0 | 91.81 | 91.82 | 91.78 | 91.80
| 3bcbebd | 92.31 | 92.26 | 92.38 | 92.32

On average, the difference is 0.52%.

stefan-it on 6 Sep 2019

Thanks a lot for the detailed experiments Stefan.

The comparison is pretty consistently in favor of the original tokenization so I guess we will switch back to the fairseq tokenization as default and add an option to use the "consistent-tokenization".

cc @LysandreJik @julien-c

thomwolf on 7 Sep 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings