Transformers: T5 Docs training example has shifted labels

Created on 19 Oct 2020  路  11Comments  路  Source: huggingface/transformers

https://github.com/huggingface/transformers/blob/master/docs/source/model_doc/t5.rst#L42

Here is that link quoted:

Unsupervised denoising training

In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels.

In this setup spans of the input sequence are masked by so-called sentinel tokens (a.k.a unique mask tokens) and the output sequence is formed as a concatenation of the same sentinel tokens and the real masked tokens.

Each sentinel token represents a unique mask token for this sentence and should start with <extra_id_0>, <extra_id_1>, ... up to <extra_id_99>. As a default, 100 sentinel tokens are available in transformers.T5Tokenizer.

For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:

  input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
  labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
  # the forward function automatically creates the correct decoder_input_ids
  model(input_ids=input_ids, labels=labels)

1) Shouldn't the labels be unshifted, given that decoder_input_ids = shift_right(labels) @patrickvonplaten @patil-suraj ?

2) @craffel does this look correct to you?

All 11 comments

Hey Sam, it looks like the labels in the example you quoted are not shifted - can you be more specific about you think the labels are shifted?

yes, I think the labels should be unshifted here (i.e labels should be same as input_ids) since shift_right takes care of preparing shifted decoder_input_ids.

@craffel I assumed the labels were shifted because:

  • Original: The cute dog walks in the park
  • Input_ids: The <extra_id_0> walks in <extra_id_1> park
  • Labels: <extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>

input_ids starts with unmasked "The", whereas labels starts with a sentinel token.

I'm still not following - are you think the sentinel token <extra_id_0> is the same as the start-of-sequence token? They are different tokens.

@sshleifer - I don't really understand the problem here either. In the example the labels are provided as:

<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>

which means that decoder_input_ids will be automatically created as:

<s> <extra_id_0> cute dog <extra_id_1> the <extra_id_2>

=> This looks correct to me

+1 to Patrick's take

Aah, yes, for t5 we just predict the masked out spans, unlike BART. So this looks correct.

In the docs, the </s> is omitted from input_ids, but will be silently added due to #5866. Is this also the correct behavior?

@ahoho => good point - I will update the docs to reflect this behavior

@patrickvonplaten, thanks! Does this mean the docs were incorrect before? I guess my question is, for the denoising training, is it correct to append the </s> token to the input_ids (not labels) or isn't it?

</s> should be appended IMO -> It's just that this is done automatically since #5866 as you mentioned above :-)

Was this page helpful?
0 / 5 - 0 ratings