Hi,
I've one question regarding to the tokenization logic.
I'm using the RoBERTa tokenizer from fairseq
:
In [15]: tokens = roberta.encode("Berlin and Munich have a lot of puppeteer to see .")
In [16]: tokens
Out[16]:
tensor([ 0, 26795, 2614, 8, 10489, 33, 10, 319, 9, 32986,
9306, 254, 7, 192, 479, 2])
Interestingly, Berlin will be splitted into two subwords (with ids 26795 and 2614).
When I use the pytorch-transformer
implementation:
In [21]: tokens = tokenizer.tokenize("<s>Berlin and Munich have a lot of puppeteer to see .</s>")
In [22]: indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
In [23]: indexed_tokens
Out[23]: [0, 5459, 8, 10489, 33, 10, 319, 9, 32986, 9306, 254, 7, 192, 479, 2]
Berlin is not splitted 馃槄
The roberta.encode
method will return one subword for Berlin, when I start the sentence with a space - which tokenizer is correct here 馃
This is a more complex question than it may seem but in general, I think both will be pretty similar in practice.
This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it).
Now at the beginning of a string you don't have a space which can result in strange behaviors.
Here is an example of the resulting behavior on RoBERTa. You would expect that the strings Berlin and Munich
and Munich and Berlin
are tokenized similarly with only the order of the tokens modified but they are not:
>>> roberta.encode("Berlin and Munich")
tensor([ 0, 26795, 2614, 8, 10489, 2])
>>> roberta.encode("Munich and Berlin")
tensor([ 0, 448, 879, 1725, 8, 5459, 2])
In this example, the first word is split and not the second.
In our tokenizer, to avoid this behavior we decided to always add a space at the beginning of a string (multiple spaces doesn't have an effect so it's ok to always add one) so that the tokenization can be consistent.
A side effect of this (indicated in the doc/docstring) is that the encoding/decoding process doesn't preserve the absence of a space at the beginning of a string but on the other hand the resulting behavior is more consistent.
>>> tokenizer.encode("Berlin and Munich", add_special_tokens=True)
[0, 5459, 8, 10489, 2]
>>> tokenizer.encode("Munich and Berlin", add_special_tokens=True)
[0, 10489, 8, 5459, 2]
Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well.
Thanks for your explanation :+1:
I just ran an experiment for a downstream task (English NER) and F1-score decreased around 0.5% 馃槦
I'll repeat that experiment with one commit before 0517e7a1cb4a70bdf32f8d11b56df8d3911d1792 (that introduced the whitespace rule) to find out where this performance drop comes from.
Update on that: I used 3bcbebd440c220adbaab657f2d13dac7c89f6453 and re-do my experiment on NER. Now the final F1-score is 92.26 (consistent with a prior result that was 92.31) - in contrast to 91.81 for the latest 1.2.0 version 馃
Would it possible to add a flag that uses the "original" tokenization 馃
We'll see what we can do (cc @LysandreJik @julien-c).
Is this difference significantly different with regards to seed run variability?
I made a few more experiments with the same dataset and different runs:
| Version | Run 1 | Run 2 | Run 3 | Avg.
| ------- | ----- | ----- | ----- | ----
| 1.2.0 | 91.81 | 91.82 | 91.78 | 91.80
| 3bcbebd | 92.31 | 92.26 | 92.38 | 92.32
On average, the difference is 0.52%.
Thanks a lot for the detailed experiments Stefan.
The comparison is pretty consistently in favor of the original tokenization so I guess we will switch back to the fairseq tokenization as default and add an option to use the "consistent-tokenization".
cc @LysandreJik @julien-c
Most helpful comment
This is a more complex question than it may seem but in general, I think both will be pretty similar in practice.
This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it).
Now at the beginning of a string you don't have a space which can result in strange behaviors.
Here is an example of the resulting behavior on RoBERTa. You would expect that the strings
Berlin and Munich
andMunich and Berlin
are tokenized similarly with only the order of the tokens modified but they are not:In this example, the first word is split and not the second.
In our tokenizer, to avoid this behavior we decided to always add a space at the beginning of a string (multiple spaces doesn't have an effect so it's ok to always add one) so that the tokenization can be consistent.
A side effect of this (indicated in the doc/docstring) is that the encoding/decoding process doesn't preserve the absence of a space at the beginning of a string but on the other hand the resulting behavior is more consistent.
Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well.