Transformers: GPT2 Tokenizer Decoding Adding Space

Created on 18 Sep 2019  路  2Comments  路  Source: huggingface/transformers

馃悰 Bug

The GPT-2 tokenizer's decoder now adds a space at the beginning of the string upon decoding.

(Potentially causing #1254)

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • [ ] the official example scripts: (give details)
  • [x] my own modified scripts: (give details)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

  1. Run the following code:
from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.decode(tokenizer.encode("test phrase"))

Expected behavior

The expected decoded string is "test phrase". However, currently it produces " test phrase".

Environment

  • OS: OSX
  • Python version: 3.7.3
  • PyTorch version: 1.1.0
  • PyTorch Transformers version (or branch): master (#e768f2322abd2a2f60a3a6d64a6a94c2d957fe89)

  • Using GPU ? No

  • Distributed of parallel setup ? No
  • Any other relevant information:

Most helpful comment

It's not a bug. This is an artefact produced by BPE as explained here https://github.com/huggingface/pytorch-transformers/blob/d483cd8e469126bed081c59473bdf64ce74c8b36/pytorch_transformers/tokenization_gpt2.py#L106

I think the solution is to process whitespaces after the tokeniser.

All 2 comments

Also getting this effect when using the reproduction code on my system.

It's not a bug. This is an artefact produced by BPE as explained here https://github.com/huggingface/pytorch-transformers/blob/d483cd8e469126bed081c59473bdf64ce74c8b36/pytorch_transformers/tokenization_gpt2.py#L106

I think the solution is to process whitespaces after the tokeniser.

Was this page helpful?
0 / 5 - 0 ratings