Transformers: GPT2 -- build_inputs_with_special_tokens lacking BOS and EOS tokens.

Created on 17 Mar 2020  ·  10Comments  ·  Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...): GPT-2

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Script:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
encoded_dict = tokenizer.encode_plus(text="Hello I am Moin", add_special_tokens=True, \
    max_length=512, truncation_strategy="longest_first", pad_to_max_length=False, \
    return_tensors=None, return_token_type_ids=True, return_attention_mask=True, \
    return_overflowing_tokens=False, return_special_tokens_mask=False)

print(tokenizer.bos_token_id)
print(encoded_dict['input_ids'])

You should see that the input_ids do not include the bos_token_id. Shouldn't encode_plus be doing this?

Expected behavior

The <|endoftext|> token would appear, since I included to add_special_tokens.

Environment info

  • transformers version:
  • Platform: Linux-4.15.0-54-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.2
  • PyTorch version (GPU?): 1.3.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No
wontfix

Most helpful comment

IMO this is something that should be written by the user for their specific needs (option 3). We can document more that the tokenizers are pre-set for the most common tasks the corresponding models are used for, to avoid any user being too surprised.

I feel that if we add a method, it will cover some use cases but not all and it will either be overly too complex or only used by a small percentage of the users.

All 10 comments

Hi @moinnadeem,

Thanks for posting this!
As it is implemented in the moment, you are right, GPT2 Tokenizer does not add the BOS in the beginning nor the EOS token in the end.
You can see e.g. that the XLNet tokenizer has a method that adds special tokens to the encoded input string (see https://github.com/huggingface/transformers/blob/4e4403c9b44324671cb795df2ef30e70fe3b606e/src/transformers/tokenization_xlnet.py#L241), whereas the GPT2 tokenizer does not have such a function and thus uses the default one which does not add any special tokens.

As far as I can see this could be a feature request, where a build_inputs_with_special_tokens() would be added to tokenization_gpt2.py.

The expected behavior could be:
input_string -> BOS + encoded(input_string) + EOS in the case of GPT2.

Feel free to open a PR to include this feature :-) In the meantime you can obviously just manually add the BOS and EOS token before encoding.

@mfuntowicz do you think such a PR would make sense?

I don't think this has been fixed, right?

It's not really a bug because the default behavior of GPT2 is to just not add bos or eos tokens. GPT2 is mainly used to generate text so it would not make a lot of sense to add a EOS of a input prompt. If one wants he could just manually add gpt2_tokenizer.eos_token to the input and the eos_token_id will be added

It's not really a bug because the default behavior of GPT2 is to just not add bos or eos tokens. GPT2 is mainly used to generate text so it would not make a lot of sense to add a EOS of a input prompt. If one wants he could just manually add gpt2_tokenizer.eos_token to the input and the eos_token_id will be added

I think in the original GPT2 model, there are special tokens for bos and eos, both of which are <|endoftext|>, right? So if I want to finetune it, we should do the same thing -- add both bos and eos to the corpus for finetune, right?

@zhujl1991 - yes this is correct.
We also set bos and eos token to <|endoftet|> for GPT2 as you can verify as follows:

from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2")
print(tok.eos_token)
print(tok.bos_token)

However, I don't think we plan on adding these tokens automatically when tokenizing an input string because the main use case for GPT2 is open-domain text generation where these tokens should not be added.
I agree that they could /should be added for fine-tuning.

So I'm not sure if we want to add any special "fine-tune" behavior to the GPT2Tokenizer. @LysandreJik - what do you think?

@zhujl1991 - yes this is correct.
We also set bos and eos token to <|endoftet|> for GPT2 as you can verify as follows:

from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2")
print(tok.eos_token)
print(tok.bos_token)

However, I don't think we plan on adding these tokens automatically when tokenizing an input string because the main use case for GPT2 is open-domain text generation where these tokens should not be added.
I agree that they could /should be added for fine-tuning.

So I'm not sure if we want to add any special "fine-tune" behavior to the GPT2Tokenizer. @LysandreJik - what do you think?

The behavior of "set add_special_tokens to True but no special tokens are added while there are special tokens in the tokenizer" looks like a bug to me anyway. If the user doesn't want to add special tokens when tokenizing, e.g., as you said, when generating text, the user should set add_special_tokens to False.

I see what you mean @zhujl1991 -> Thinking about backwards compatibility and that by default add_special_tokens is set to True, I still do not think that we should add this feature to the __call__ or encode_plus functions for GPT2. On the other hand such a functionality would be very useful for training/fine-tuning.

I see three options:

1) overwrite the __call__ method in GPT2 to have add_special_tokens=False by default and append BOS and EOS if set to True => I don't like this option as it's quite hacky and would still not be 100% backward compatible

2) Add a new method prepare_for_training where the input is prepared for fine-tuning / training as you said.

3) Don't do anything about it and let the user overwrite such a method himself.

I would be fine with option 2), but also don't think it's that important of a feature (option 3))....let's see what @LysandreJik @sgugger, @thomwolf and @sshleifer think

IMO this is something that should be written by the user for their specific needs (option 3). We can document more that the tokenizers are pre-set for the most common tasks the corresponding models are used for, to avoid any user being too surprised.

I feel that if we add a method, it will cover some use cases but not all and it will either be overly too complex or only used by a small percentage of the users.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Ran into this too – this seems like a bug to me, or at the least not intuitive behaviour.

If there's a tokeniser that has an EOS token, and I encode with add_special_tokens=True, I'd expect it to include the eos token at the end of sentence.

Was this page helpful?
0 / 5 - 0 ratings