Transformers: about encoder and decoder input when using seq2seq model

Created on 14 Aug 2020 · 6Comments · Source: huggingface/transformers

❓ Questions & Help

Details

Hello, I'm trying to using seq2seq model (such as bart and EncoderDecoderModel(bert2bert))
And I'm little bit confused about input_ids, decoder_input_ids, tgt in model inputs.

As I know in seq2seq model, decoder_input should have special token(\ ~~or something) before the sentence and target should have special token(\~~ or somethin) after the sentence. for example, decoder_input = <s> A B C D E , target = A B C D E</s>

so my question is

Should I put the these special tokens in decoder_inputs_ids and tgt_ids when using seq2seq model in this library?
or can i just pass the decoder_input_ids and tgt_ids without any special token ids?

Also, should I put add_special_tokens=True for encoder input_ids and put \ or \ token after target ids?
for example, input = a b c d e, decoder_input = <s>A B C D E, target = A B C D E</s>

Source

jungwhank

Most helpful comment

@jungwhank Great ! Consider joining the awesome HF forum , if you haven't already :) It's the best place to ask such questions. The whole community is there to help you and your questions will also help the community.

patil-suraj on 18 Aug 2020

👍2

All 6 comments

Hi @jungwhank
for Bert2Bert, pad_token is used as decoder_start_token_id and the input_ids and labels begin with cls_token_id ([CLS] for bert ) and end with sep_token_id ([SEP] for bert).

For training all you need to do is

input_text = "some input text"
target_text = "some target text"
input_ids = tokenizer(input_text,  add_special_tokens=True, return_tensors="pt")["input_ids"]
target_ids = tokenizer(target_text, add_special_tokens=True, return_tensors="pt")["input_ids"]
model(input_ids=input_ids, decoder_input_ids=target_ids, labels=target_ids)

The EncoderDecoderModel class takes care adding pad_token to the decoder_input_ids.

for inference

model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)

Hope this clarifies your question. Also pinging @patrickvonplaten for more info.

patil-suraj on 14 Aug 2020

👍1

Hi, @patil-suraj
Thanks for answering.
is it same for BartForConditionalGeneration?
Actually, I wanna do kind of translation task and is it same decoder_inputs_ids and labels?

jungwhank on 16 Aug 2020

@patil-suraj's answer is correct! For the EncoderDecoder framework, one should set model.config.decoder_start_token_id to the BOS token (which in BERT's case does not exist so that we simply use CLS token).

Bart is a bit different:

if you want to generate from a pretrained model, all you have to do is: model.generate(input_ids). input_ids always refer to the encoder input tokens for Seq2Seq models and it depends on you if you want to add special tokens or not - this is not done automatically in the generate function.
if you want to have more control and just do one forward pass, you should define both input_ids and decoder_input_ids and in this case the decoder_input_ids should start with Bart's decoder_start_token_id model.config.decoder_start_token_id:

model(input_ids, decoder_input_ids=decoder_input_ids)

patrickvonplaten on 16 Aug 2020

@patrickvonplaten
thanks for answering!
But I have a question that Is there decoder_start_token_id in BartConfig?
Should I just make my decoder_input_ids start with Bart's model.config.bos_token_id or set model.config.decoder_start_token_id = token_id?

jungwhank on 16 Aug 2020

I think I solved the problem. Thanks

jungwhank on 18 Aug 2020

patil-suraj on 18 Aug 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings