Transformers: T5 special tokens not mapped to unique indices in vocabulary

Created on 19 Jun 2020  ·  16Comments  ·  Source: huggingface/transformers

The docs recommend adding the special eos_token <\s> to the end of each string when encoding/decoding with T5Tokenizer. However, this (and the other special tokens e.g. unk_token, pad_token aren't assigned unique ids in the lookup vocabulary (they are mapped to {0,1,2}, which are indices for other common words in the vocab). In practice, I find my model fails to properly produce the eos_token since it is associated with blank spaces, so the model produces run-ons during generation

To reproduce

>>> from transformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
>>> tokenizer.pad_token
'<pad>'
>>> tokenizer.pad_token_id
0
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
1
>>> tokenizer.unk_token
'<unk>'
>>> tokenizer.unk_token_id
2
>>> tokenizer.decode([0])
''
>>> tokenizer.decode([1])
''
>>> tokenizer.decode([2])
' ⁇ '



md5-c81215c56d91f1963d128347581cfa53



>>> tokenizer.decode([0])
'<pad>'
>>> tokenizer.decode([1])
'</s>'
>>> tokenizer.decode([2])
'<unk>'

Environment info

  • transformers version: 2.9.1
Tokenization

Most helpful comment

cc @danyaljj
I'm just going to consolidate discussion from (#7796) here. (Also relevant is HF forum)

max_src_len above is the maximum length of any input sequence, counted in...wait for it...number of characters. Whoops. That was dumb. I intended to go through and find the maximum sequence length in tokens. I'll fix that, but I don't think it affects other things: it turns out that max_src, max_tgt_len = (250, 250) for the inputs I was using. But that just means we had a lot of padding.

I was using finetune.py just last month, so I don't think it was the EOS token.

The "gibberish" generation still occurs if I just use finetune_t5.sh as written. If I do either of the following, the outputs are correct:
1) Comment out use_task_specific_params(self.model, "summarization") in finetune.py
2) Add min_len to the generate call:

        generated_ids = self.model.generate(
            batch["input_ids"],
            attention_mask=batch["attention_mask"],
            use_cache=True,
            decoder_start_token_id=self.decoder_start_token_id,
            num_beams=self.eval_beams,
            max_length=self.eval_max_length,
            min_length=0
        )

This is because config.json for t5-small and t5-base has the following (@danyaljj this is also the answer to our question about where prefix is getting picked up in the HF forum)

  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

But it looks like the only param that really matters was the min_length. Beam size, max_length, prefix, etc all weren't causing the problem. I verified on both the (sent) => (sent) copy and the (sent) => (first word of sent) tasks.

So at least for my use case it seems like the tokenizer decode bug was not causing a problem? It seems like even though we, as ~users couldn't decode tokens correctly, the model still knew that 1==EOS and that after an EOS it should print PAD. The problem was that we were forcing it to generate at least 30 tokens, hence all the gibberish that I was seeing.

@sshleifer, does this make sense with your understanding of the finetune script? i.e., that failing to decode EOS shouldn't matter?

@danyaljj, given that you wanted relatively short outputs of the answers to questions, this seems like it might fix the issue for you? Give it a try and see what happens?

All 16 comments

Hey @sarahwie,

Thanks for your issue. I can reproduce the problem and see the reason for it. Currently, we rely on Google's sentencepiece tokenizer: https://github.com/google/sentencepiece for encoding and decoding in T5. What happens is that the tokenizer.decode(tokens) depends on the function

sp_model.decode_pieces(tokens) with sp_model being an instance of sentencepiece.SentencePieceProcessor(). To correctly convert a string of tokens: ["<unk>", "</s>"] to one string we thus rely on sp_model.decode_pieces, so it is a bit out of our control to do the correct decoding here.

To quickly see the problem @thomwolf @mfuntowicz @n1t0 one can run the following code

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-base')
tokenizer.convert_tokens_to_string(["<unk>", "</s>"])  # gives ' ⁇ '

What do you think how we should handle this problem at the moment @thomwolf @n1t0 @mfuntowicz ?

For anyone looking for a quick, temporary fix to the unending-generation problem: override the EOS token with a custom one (note this fix does not work for unk_token or pad_token; for some reason they can't be re-mapped)

tokenizer = T5Tokenizer.from_pretrained('t5-base')
tokenizer.add_special_tokens({'eos_token':'[EOS]'})

model.resize_token_embeddings(len(tokenizer))

>>> tokenizer.eos_token_id
32100

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Is there any update on this? Does the bug still exist in version 3.4?

Hey guys, I would recommend using our new T5TokenizerFast which solves this problem as can be seen below:

>>> from transformers import T5TokenizerFast
>>> tokenizer = T5TokenizerFast.from_pretrained('t5-base')
>>> tokenizer.pad_token
'<pad>'
>>> tokenizer.pad_token_id
0
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
1
>>> tokenizer.unk_token
'<unk>'
>>> tokenizer.unk_token_id
2
>>> tokenizer.decode([0])
'<pad>'
>>> tokenizer.decode([1])
'</s>'
>>> tokenizer.decode([2])
'<unk>'

I also made a PR to fix the slow T5Tokenizer. It probably won't make it into v3.5, but into the next version.

@patrickvonplaten

Two quick questions:

  • Is there any downside to using fasttokenizer?
  • What's the best way to patch this fix to slowtokenizer into an existing transformers install?

Bigger question:
I ran into this no-EOS generation problem when using finetune.py, but when I set up my own T5 trainer, I somehow managed to sidestep the issue. Here are the details. Any idea why I wasn't affected once I set it up on my own?

Each item of my data set (source and target) is configured as

# max_src_len is length of longest sentence in input set
tokenized_inputs = self.tokenizer.batch_encode_plus(
         [src], max_length=max_src_len, padding="max_length", return_tensors="pt")

where each src is a string of words, with no EOS token appended (since batch_encode will append it).

I then train with this forward function:

def forward(model, device, batch):
    src_ids = batch["source_ids"].to(device, dtype=torch.long)
    src_mask = batch["source_mask"].to(device, dtype=torch.long)
    tgt_ids = batch["target_ids"].to(device, dtype=torch.long)

    # padded ids (pad=0) are set to -100, which means ignore for loss calculation
    tgt_ids[tgt_ids[: ,:] == 0 ] = -100
    label_ids = tgt_ids.to(device)
    out_dict = model(src_ids, attention_mask=src_mask, labels=label_ids, return_dict=True)
    loss, logits = out_dict['loss'], out_dict['logits']
    return loss, logits

# then do appropriate zero_grad(), loss.backward, etc

Models I train in this way do learn to generate a final token with ID=1. In particular I wrote the following verification function:

def masked_token_match(tgt_ids: torch.tensor, outputs: torch.tensor,
                       return_indices=False) -> Union[Tuple[int,int], Tuple[int, int, torch.tensor]]:
    # left-shift
    output_shifted = outputs[:,1:]

    # create output_padded, which truncates output at tgt_ids size, filling with pad tokens
    if output_shifted.shape <= tgt_ids.shape:
        output_padded = torch.zeros_like(tgt_ids)
        output_padded[:output_shifted.shape[0], :output_shifted.shape[1]] = output_shifted
    else:       # output_shifted is bigger
        # so copy only up to the target IDs length
        output_padded = output_shifted[:,:tgt_ids.shape[1]]     # copy all rows (bs) and up to tgt_ids length

    # compare where tokens are > 1 (i.e. not pad or EOS)
    match_indices = output_padded == tgt_ids          # either they match
    matches_no_eos = torch.logical_or(match_indices, tgt_ids < 2)   # or we ignore them (pad and eos)
    matches_with_eos = torch.logical_or(match_indices, tgt_ids < 1) # or we ignore them (just pad)
    total_matches_no_eos = torch.sum(torch.all(matches_no_eos, axis=1))
    total_matches_with_eos = torch.sum(torch.all(matches_with_eos, axis=1))

    return total_matche_no_eos, total_matches_with_eos

For a copy task (I was debugging the original finetune behavior), where I ask T5 to
1) copy src => src (i.e. just copy word for word)
2) copy src => (first word of source); (i.e. just copy the first word and then generate EOS)

The model learns to complete both tasks and append an EOS token after only 15-20k training examples.

So why did this setup work? Maybe we think that the model could still be generating additional non-zero (i.e. non-pad) tokens after EOS==1 in these sequences. But I also seem to have verified that isn't occurring because I use this generation code during eval:

generated_ids = model.generate(src_ids, attention_mask=src_mask)       # (batch x seq length)
outputs_decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

and the outputs_decoded do correctly stop where they are supposed to stop

No real downside using the fast tokenizers if you don't have to look into the code.

You can take a look into the PR to see what one would have to change to make it work with an existing code base

@sshleifer , @patrickvonplaten

I still don't understand why my tweaked version worked and did appropriately truncate generations (see above details). sshleifer, maybe you can see easily what finetune.py is doing differently?

@jsrozner I don't know either, but interested to find out.
1) what is max_src_len
2) when were you using finetune.py? Do you remember your command? T5Tokenizer started adding </s> to inputs a few months ago, so maybe you were before that? Can you reproduce the breakage with current finetune.py/current transformers?
Antoher random idea: finetune.py automatically uses config.task_specific_params['summarization'] for generation, which might be bad for your use case.

cc @danyaljj
I'm just going to consolidate discussion from (#7796) here. (Also relevant is HF forum)

max_src_len above is the maximum length of any input sequence, counted in...wait for it...number of characters. Whoops. That was dumb. I intended to go through and find the maximum sequence length in tokens. I'll fix that, but I don't think it affects other things: it turns out that max_src, max_tgt_len = (250, 250) for the inputs I was using. But that just means we had a lot of padding.

I was using finetune.py just last month, so I don't think it was the EOS token.

The "gibberish" generation still occurs if I just use finetune_t5.sh as written. If I do either of the following, the outputs are correct:
1) Comment out use_task_specific_params(self.model, "summarization") in finetune.py
2) Add min_len to the generate call:

        generated_ids = self.model.generate(
            batch["input_ids"],
            attention_mask=batch["attention_mask"],
            use_cache=True,
            decoder_start_token_id=self.decoder_start_token_id,
            num_beams=self.eval_beams,
            max_length=self.eval_max_length,
            min_length=0
        )

This is because config.json for t5-small and t5-base has the following (@danyaljj this is also the answer to our question about where prefix is getting picked up in the HF forum)

  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

But it looks like the only param that really matters was the min_length. Beam size, max_length, prefix, etc all weren't causing the problem. I verified on both the (sent) => (sent) copy and the (sent) => (first word of sent) tasks.

So at least for my use case it seems like the tokenizer decode bug was not causing a problem? It seems like even though we, as ~users couldn't decode tokens correctly, the model still knew that 1==EOS and that after an EOS it should print PAD. The problem was that we were forcing it to generate at least 30 tokens, hence all the gibberish that I was seeing.

@sshleifer, does this make sense with your understanding of the finetune script? i.e., that failing to decode EOS shouldn't matter?

@danyaljj, given that you wanted relatively short outputs of the answers to questions, this seems like it might fix the issue for you? Give it a try and see what happens?

Thanks, @jsrozner! 🙏

In theory, this explains my issue as well since my outputs were quite short. I will repeat it and report the results here!

yes that makes sense and thanks for consolidating.
we should link to this discussion in the t5 docs !

What files should I change to update docstrings?

Also can you take a look at a few more questions / related issues so that we can clean things up? These are roughly the same questions I had in a post in the HF thread

decoder_input_ids vs labels

  • When would we want to pass both?
  • Here's an example (that has been linked to in HF forums) that seems to do it wrong. In particular, passes both decoder_input_ids and lm_labels but does not right_shift the decoder_input_ids. This seems like it does not give the effect that we want since we never right_shift. He probably wants to pass only labels and omit decoder_input_ids?
  • Finally, Documentation for T5forConditionalGeneration says that if decoder_input_ids are not provided then input_ids will be used. But actually labels will be used?

in Finetune.py

  • _step is manually right_shifting rather than letting model do it for us by just passing label. Why?
  • _step calculates the loss manually, but I want to confirm that if we had also passed labels into the self(..) call that we would have gotten the same loss output when label_smoothing == 0

@jsrozner

  • Docstrings are in modeling_t5.py and https://github.com/huggingface/transformers/blob/master/docs/source/model_doc/t5.rst
  • I can't think of a good reason to pass both decoder_input_ids and labels
  • correct that example is wrong.
  • Documentation for T5forConditionalGeneration is wrong, as you suggest.

    in finetune.py

  • we pass decoder_input_ids to avoid allowing the model to calculate the loss. The reasoning is that, in some environments (namely TPU), replacing pad_token_id with -100 is expensive and we do not want the loss to consider pad_token_id.

  • You would not get the same loss if you passed labels to the model, because it would not ignore pad_token_id
Was this page helpful?
0 / 5 - 0 ratings