Transformers: New GPT2 tokenizer no longer encodes Unicode characters properly in Python 3

Created on 26 Apr 2019 · 7Comments · Source: huggingface/transformers

In commit 5afa497cbfc53c679a9b22997b6312fad57ee2f8, you changed token.encode('utf-8') to simply token.

This would make the code compatible with Python 2, but now it breaks in Python 3. You'll get a KeyError when you try to encode a Unicode character that requires more than 1 byte in UTF-8 encoding. For example, this raises a KeyError in Python 3:

from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.encode('你')

I think what you want to do is:

if sys.version_info[0] == 2:
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
else:
    token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))

wontfix

Source

alasdairtran

👍13 🎉5

Most helpful comment

I can confirm that this is happening, though it is a different dash.

jamestjw on 6 Jun 2019

👍2

All 7 comments

Just ran into this problem. This seems to be a regression from an earlier version of Huggingface.

For instance it fails when encoding the following wikipedia snippet

The dismemberment of the French socialist movement into many groups and—following the suppression

The dash here is "long dash" with unicode 8212. This worked in earlier version because it worked on bytes.

yaroslavvb on 29 May 2019

I can confirm that this is happening, though it is a different dash.

jamestjw on 6 Jun 2019

👍2

I can confirm that this is happening, though it is a different dash.

Same here:
This is also happening while using GPT2 tokenizer:

Traceback (most recent call last): File "run_lambada_gpt2.py", line 139, in tokenize_and_encode token_ids = tokenizer.encode(obj) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode return self.convert_tokens_to_ids(self.tokenize(text)) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize token = ''.join(self.byte_encoder[ord(b)] for b in token) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr> token = ''.join(self.byte_encoder[ord(b)] for b in token) KeyError: 8217

The sys version info is:
sys.version_info(major=3, minor=5, micro=5, releaselevel='final', serial=0)

lirongyuan on 8 Jun 2019

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 11 Aug 2019

Hi,
I'm about to use this tokenizer with python3 on wiki-text.
After seeing this issue - I'm not sure if it will work properly.

Can someone clarify please?
From reading along seems like the fix suggested above did not solve the problem, right?

saareliad on 12 Aug 2019

Hi, this looks fixed to me in the current implementation. As long as you're using a recent version of the library you should be fine. I had no problem running a fine-tuning script on wikitext-2 last week.

If you run into anything, please let me know and I'll look into it.

LysandreJik on 12 Aug 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 11 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings