Transformers: New GPT2 tokenizer no longer encodes Unicode characters properly in Python 3

Created on 26 Apr 2019  Â·  7Comments  Â·  Source: huggingface/transformers

In commit 5afa497cbfc53c679a9b22997b6312fad57ee2f8, you changed token.encode('utf-8') to simply token.

This would make the code compatible with Python 2, but now it breaks in Python 3. You'll get a KeyError when you try to encode a Unicode character that requires more than 1 byte in UTF-8 encoding. For example, this raises a KeyError in Python 3:

from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.encode('ä½ ')

I think what you want to do is:

if sys.version_info[0] == 2:
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
else:
    token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
wontfix

Most helpful comment

image
I can confirm that this is happening, though it is a different dash.

All 7 comments

Just ran into this problem. This seems to be a regression from an earlier version of Huggingface.

For instance it fails when encoding the following wikipedia snippet

The dismemberment of the French socialist movement into many groups and—following the suppression

The dash here is "long dash" with unicode 8212. This worked in earlier version because it worked on bytes.

image
I can confirm that this is happening, though it is a different dash.

image

I can confirm that this is happening, though it is a different dash.

Same here:
This is also happening while using GPT2 tokenizer:

Traceback (most recent call last): File "run_lambada_gpt2.py", line 139, in tokenize_and_encode token_ids = tokenizer.encode(obj) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode return self.convert_tokens_to_ids(self.tokenize(text)) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize token = ''.join(self.byte_encoder[ord(b)] for b in token) File "/data/anaconda/envs/py35/lib/python3.5/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr> token = ''.join(self.byte_encoder[ord(b)] for b in token) KeyError: 8217

The sys version info is:
sys.version_info(major=3, minor=5, micro=5, releaselevel='final', serial=0)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi,
I'm about to use this tokenizer with python3 on wiki-text.
After seeing this issue - I'm not sure if it will work properly.

Can someone clarify please?
From reading along seems like the fix suggested above did not solve the problem, right?

Hi, this looks fixed to me in the current implementation. As long as you're using a recent version of the library you should be fine. I had no problem running a fine-tuning script on wikitext-2 last week.

If you run into anything, please let me know and I'll look into it.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lcswillems picture lcswillems  Â·  3Comments

zhezhaoa picture zhezhaoa  Â·  3Comments

HanGuo97 picture HanGuo97  Â·  3Comments

siddsach picture siddsach  Â·  3Comments

delip picture delip  Â·  3Comments