Transformers: GPT2 Tokenizer Decoding Adding Space

Created on 18 Sep 2019 · 2Comments · Source: huggingface/transformers

🐛 Bug

The GPT-2 tokenizer's decoder now adds a space at the beginning of the string upon decoding.

(Potentially causing #1254)

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

[ ] the official example scripts: (give details)
[x] my own modified scripts: (give details)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

Run the following code:

from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.decode(tokenizer.encode("test phrase"))

Expected behavior

The expected decoded string is "test phrase". However, currently it produces " test phrase".

Environment

OS: OSX
Python version: 3.7.3
PyTorch version: 1.1.0
PyTorch Transformers version (or branch): master (#e768f2322abd2a2f60a3a6d64a6a94c2d957fe89)
Using GPU ? No
Distributed of parallel setup ? No
Any other relevant information:

Source