Transformers: Proposal: seq2seq tokenizers expose a prepare_seq2seq_batch method

Created on 28 Jul 2020  路  11Comments  路  Source: huggingface/transformers

This has been useful in MarianTokenizer.prepare_translation_batch and MBartTokenizer.prepare_translation_batch.

and would also be a useful addition for T5Tokenizer, and BartTokenizer.
@LysandreJik mentioned this in a PR so I wanted to see what others thought before starting work.
@patrickvonplaten @sgugger @mfuntowicz @patil-suraj ?

Proposed signature (I would not add this to PretrainedModel, just rename the Marian/MBart implems and add implems for Bart, T5. When pegasus, blenderbot get merged, I would also add implems there consider a helper function.)

def prepare_seq2seq_batch(
        self,
        src_texts: List[str],
        tgt_texts: Optional[List[str]] = None,
        max_length: Optional[int] = None,
        max_target_length:Optional[int] = None,
        padding: str = "longest",
        **kwargs,
    ) -> BatchEncoding:
        if max_length is None:
            max_length = self.max_len
        if max_target_length is None: # default to max_length
            max_target_length =  max_length# no need to specify twice for translation
        model_inputs: BatchEncoding = self(
            src_texts,
            add_special_tokens=True,
            return_tensors=return_tensors,
            max_length=max_length,
            padding=padding,
            truncation=True,
            **kwargs,
        )
        if tgt_texts is None:
            return model_inputs
        # Here classes can implement logic to put decoder_start_token_id at the front if they want,
        # or deal with language codes for multilingual models.
        decoder_inputs: BatchEncoding = self(
            tgt_texts,
            add_special_tokens=True,
            return_tensors=return_tensors,
            padding=padding,
            max_length=max_target_length,
            truncation=True,
            **kwargs,
        )
        for k, v in decoder_inputs.items():
            model_inputs[f"decoder_{k}"] = v
        return model_inputs # still a BatchEncoding
Tokenization

Most helpful comment

Motivation:
less important: only call tokenizer once
more important: the best time to think about special tokens is when you are writing the tokenizer, not when you are designing your training loop. We've had a lot of bugs recently involving special tokens in seq2seq training loops that likely could have been avoided if when the tokenizer was added the author had written a method that said: "This is how you make a batch".

All 11 comments

Motivation:
less important: only call tokenizer once
more important: the best time to think about special tokens is when you are writing the tokenizer, not when you are designing your training loop. We've had a lot of bugs recently involving special tokens in seq2seq training loops that likely could have been avoided if when the tokenizer was added the author had written a method that said: "This is how you make a batch".

This will be actually really helpful. Many people run into bugs by forgetting to add special tokens, or shift labels to right. Should this also prepare labels along with decoder_input_ids as needed by seq2seq models ?

This sounds useful, and looks like it could then directly be used as a data_collator in Trainer.

yes, optionally making labels would be very useful!

What do you think @thomwolf ?

I am not really in favor of adding this code...I think the user should have understood the library (tokenizers) well enough to not need such a helper function. Think we could add it in a notebook / example and point to it if people ask, but IMO the tokenizers already have too many functions exposed to the user...

I think the user should have understood the library (tokenizers) well enough to not need such a helper function.

1) I got this wrong the first time for marian and the only way to support finetuning multilingual models was to add prepare_translation_batch.
2) Suraj just fixed a bug (that was also my fault, but nobody noticed) about adding decoder_start_token_id to the beginning of t5.decoder_input_ids. Both of us know the library pretty well and had trouble getting this right.
3) The tokenizers are already setup to create model-ready batches for GLUE tasks (an awesome feature IMO). Why shouldn't they create model-ready batches for seq2seq tasks?

@patrickvonplaten said that he is ok with this change if @mfuntowicz and @n1t0 are ok with it.
What do you guys think?

@patil-suraj we can move forward on your PR once we get some consensus that this is an OK direction to go in.

Im feeling pretty strongly about this as I just messed up decoder special tokens for pegasus finetuning in a very avoidable way :(; the tokenizers should handle special tokens, not some random barely tested examples/processor* code.

After thinking a bit more about this and talking to @sshleifer, I am fine with the PR. I agree now that there are a lot of use cases when prepare_seq2seq_batch is used and since it's going to be added to each model specifically, it's clean as well. Also since it will replace prepare_translation_batch, it does not really increase function exposure.

Two things that I'm still a bit uncertain about is:
a) It does add a "layer" on top of the __call__ function. But I guess we do the same with generate() on top of forward()
b) will we have to add this functionality in Rust then as well? Will it be difficult to add? @n1t0

Overall I'm in favor of this design as well though now

Moving forward, as discussed with @LysandreJik .

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iedmrc picture iedmrc  路  3Comments

ereday picture ereday  路  3Comments

hsajjad picture hsajjad  路  3Comments

alphanlp picture alphanlp  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments