The GPT-2 tokenizer's decoder now adds a space at the beginning of the string upon decoding.
(Potentially causing #1254)
Model I am using (Bert, XLNet....): GPT2
Language I am using the model on (English, Chinese....): English
The problem arise when using:
The tasks I am working on is:
Steps to reproduce the behavior:
from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.decode(tokenizer.encode("test phrase"))
The expected decoded string is "test phrase". However, currently it produces " test phrase".
PyTorch Transformers version (or branch): master (#e768f2322abd2a2f60a3a6d64a6a94c2d957fe89)
Using GPU ? No
Also getting this effect when using the reproduction code on my system.
It's not a bug. This is an artefact produced by BPE as explained here https://github.com/huggingface/pytorch-transformers/blob/d483cd8e469126bed081c59473bdf64ce74c8b36/pytorch_transformers/tokenization_gpt2.py#L106
I think the solution is to process whitespaces after the tokeniser.
Most helpful comment
It's not a bug. This is an artefact produced by BPE as explained here https://github.com/huggingface/pytorch-transformers/blob/d483cd8e469126bed081c59473bdf64ce74c8b36/pytorch_transformers/tokenization_gpt2.py#L106
I think the solution is to process whitespaces after the tokeniser.