Transformers: T5 Docs training example has shifted labels

Created on 19 Oct 2020 · 11Comments · Source: huggingface/transformers

https://github.com/huggingface/transformers/blob/master/docs/source/model_doc/t5.rst#L42

Here is that link quoted:

Unsupervised denoising training

In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels.

In this setup spans of the input sequence are masked by so-called sentinel tokens (a.k.a unique mask tokens) and the output sequence is formed as a concatenation of the same sentinel tokens and the real masked tokens.

Each sentinel token represents a unique mask token for this sentence and should start with <extra_id_0>, <extra_id_1>, ... up to <extra_id_99>. As a default, 100 sentinel tokens are available in transformers.T5Tokenizer.

For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:

  input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
  labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
  # the forward function automatically creates the correct decoder_input_ids
  model(input_ids=input_ids, labels=labels)

1) Shouldn't the labels be unshifted, given that decoder_input_ids = shift_right(labels) @patrickvonplaten @patil-suraj ?

2) @craffel does this look correct to you?

Source

sshleifer

All 11 comments

Hey Sam, it looks like the labels in the example you quoted are not shifted - can you be more specific about you think the labels are shifted?

craffel on 19 Oct 2020

yes, I think the labels should be unshifted here (i.e labels should be same as input_ids) since shift_right takes care of preparing shifted decoder_input_ids.

patil-suraj on 19 Oct 2020

@craffel I assumed the labels were shifted because:

Original: The cute dog walks in the park
Input_ids: The <extra_id_0> walks in <extra_id_1> park
Labels: <extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>

input_ids starts with unmasked "The", whereas labels starts with a sentinel token.

sshleifer on 19 Oct 2020

I'm still not following - are you think the sentinel token <extra_id_0> is the same as the start-of-sequence token? They are different tokens.

craffel on 19 Oct 2020

@sshleifer - I don't really understand the problem here either. In the example the labels are provided as:

<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>

which means that decoder_input_ids will be automatically created as:

<s> <extra_id_0> cute dog <extra_id_1> the <extra_id_2>

=> This looks correct to me

patrickvonplaten on 19 Oct 2020

👍1

+1 to Patrick's take

craffel on 19 Oct 2020

Aah, yes, for t5 we just predict the masked out spans, unlike BART. So this looks correct.

patil-suraj on 19 Oct 2020

In the docs, the </s> is omitted from input_ids, but will be silently added due to #5866. Is this also the correct behavior?

ahoho on 19 Oct 2020

@ahoho => good point - I will update the docs to reflect this behavior

patrickvonplaten on 19 Oct 2020

👍1

@patrickvonplaten, thanks! Does this mean the docs were incorrect before? I guess my question is, for the denoising training, is it correct to append the </s> token to the input_ids (not labels) or isn't it?

ahoho on 19 Oct 2020

</s> should be appended IMO -> It's just that this is done automatically since #5866 as you mentioned above :-)

patrickvonplaten on 19 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings