This has been useful in MarianTokenizer.prepare_translation_batch and MBartTokenizer.prepare_translation_batch.
and would also be a useful addition for T5Tokenizer, and BartTokenizer.
@LysandreJik mentioned this in a PR so I wanted to see what others thought before starting work.
@patrickvonplaten @sgugger @mfuntowicz @patil-suraj ?
Proposed signature (I would not add this to PretrainedModel, just rename the Marian/MBart implems and add implems for Bart, T5. When pegasus, blenderbot get merged, I would also add implems there consider a helper function.)
def prepare_seq2seq_batch(
self,
src_texts: List[str],
tgt_texts: Optional[List[str]] = None,
max_length: Optional[int] = None,
max_target_length:Optional[int] = None,
padding: str = "longest",
**kwargs,
) -> BatchEncoding:
if max_length is None:
max_length = self.max_len
if max_target_length is None: # default to max_length
max_target_length = max_length# no need to specify twice for translation
model_inputs: BatchEncoding = self(
src_texts,
add_special_tokens=True,
return_tensors=return_tensors,
max_length=max_length,
padding=padding,
truncation=True,
**kwargs,
)
if tgt_texts is None:
return model_inputs
# Here classes can implement logic to put decoder_start_token_id at the front if they want,
# or deal with language codes for multilingual models.
decoder_inputs: BatchEncoding = self(
tgt_texts,
add_special_tokens=True,
return_tensors=return_tensors,
padding=padding,
max_length=max_target_length,
truncation=True,
**kwargs,
)
for k, v in decoder_inputs.items():
model_inputs[f"decoder_{k}"] = v
return model_inputs # still a BatchEncoding
Motivation:
less important: only call tokenizer once
more important: the best time to think about special tokens is when you are writing the tokenizer, not when you are designing your training loop. We've had a lot of bugs recently involving special tokens in seq2seq training loops that likely could have been avoided if when the tokenizer was added the author had written a method that said: "This is how you make a batch".
This will be actually really helpful. Many people run into bugs by forgetting to add special tokens, or shift labels to right. Should this also prepare labels along with decoder_input_ids as needed by seq2seq models ?
This sounds useful, and looks like it could then directly be used as a data_collator in Trainer.
yes, optionally making labels would be very useful!
What do you think @thomwolf ?
I am not really in favor of adding this code...I think the user should have understood the library (tokenizers) well enough to not need such a helper function. Think we could add it in a notebook / example and point to it if people ask, but IMO the tokenizers already have too many functions exposed to the user...
I think the user should have understood the library (tokenizers) well enough to not need such a helper function.
1) I got this wrong the first time for marian and the only way to support finetuning multilingual models was to add prepare_translation_batch.
2) Suraj just fixed a bug (that was also my fault, but nobody noticed) about adding decoder_start_token_id to the beginning of t5.decoder_input_ids. Both of us know the library pretty well and had trouble getting this right.
3) The tokenizers are already setup to create model-ready batches for GLUE tasks (an awesome feature IMO). Why shouldn't they create model-ready batches for seq2seq tasks?
@patrickvonplaten said that he is ok with this change if @mfuntowicz and @n1t0 are ok with it.
What do you guys think?
@patil-suraj we can move forward on your PR once we get some consensus that this is an OK direction to go in.
Im feeling pretty strongly about this as I just messed up decoder special tokens for pegasus finetuning in a very avoidable way :(; the tokenizers should handle special tokens, not some random barely tested examples/processor* code.
After thinking a bit more about this and talking to @sshleifer, I am fine with the PR. I agree now that there are a lot of use cases when prepare_seq2seq_batch is used and since it's going to be added to each model specifically, it's clean as well. Also since it will replace prepare_translation_batch, it does not really increase function exposure.
Two things that I'm still a bit uncertain about is:
a) It does add a "layer" on top of the __call__ function. But I guess we do the same with generate() on top of forward()
b) will we have to add this functionality in Rust then as well? Will it be difficult to add? @n1t0
Overall I'm in favor of this design as well though now
Moving forward, as discussed with @LysandreJik .
Most helpful comment
Motivation:
less important: only call tokenizer once
more important: the best time to think about special tokens is when you are writing the tokenizer, not when you are designing your training loop. We've had a lot of bugs recently involving special tokens in seq2seq training loops that likely could have been avoided if when the tokenizer was added the author had written a method that said: "This is how you make a batch".