Transformers: How to train a custom seq2seq model with BertModel

Created on 22 May 2020 · 24Comments · Source: huggingface/transformers

How to train a custom seq2seq model with BertModel,

I would like to use some Chinese pretrained model base on BertModel

so I've tried using Encoder-Decoder Model, but it seems theEncoder-Decoder Model is not used for conditional text generation

and I saw that BartModel seems to be the model I need, but I cannot load pretrained BertModel weight with BartModel.

by the way, could I finetune a BartModel for seq2seq with custom data ?

any suggestion, thanks

seq2seq

Source

chenjunweii

Most helpful comment

Yeah, the code is ready in this PR: https://github.com/huggingface/transformers/tree/more_general_trainer_metric .
The script to train an Encoder-Decoder model can be assessed here: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/bert_encoder_decoder_summary.py

And in order for the script to work, you need to use this Trainer class:
https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/trainer.py

I'm currently training the model myself. When the results are decent, I will publish a little notebook.

patrickvonplaten on 15 Jul 2020

👍4

All 24 comments

Hi @chenjunweii - thanks for your issue! I will take a deeper look at the EncoderDecoder framework at the end of this week and should add a google colab on how to fine-tune it.

patrickvonplaten on 24 May 2020

👍4

Using Bert - Bert model for seq2seq task should work using simpletransformers library, there is an working code.
But there is one strange thing that the saved models loads wrong weight's.
Predicting the same string multiple times works correctly, loading the model each time again it's generating a new result every time @patrickvonplaten

flozi00 on 27 May 2020

Hi @flozi00,
could you add a code snippet here that reproduces this bug?

patrickvonplaten on 29 May 2020

Of course, it should be reproduceable using this code:

import logging

import pandas as pd
from simpletransformers.seq2seq import Seq2SeqModel

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


train_data = [
    ["one", "1"],
    ["two", "2"],
]

train_df = pd.DataFrame(train_data, columns=["input_text", "target_text"])

eval_data = [
    ["three", "3"],
    ["four", "4"],
]

eval_df = pd.DataFrame(eval_data, columns=["input_text", "target_text"])

model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 10,
    "train_batch_size": 2,
    "num_train_epochs": 10,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,
    "evaluate_generated_text": True,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "max_length": 15,
    "manual_seed": 4,
}

encoder_type = "roberta"

model = Seq2SeqModel(
    encoder_type,
    "roberta-base",
    "bert-base-cased",
    args=model_args,
    use_cuda=True,
)

model.train_model(train_df)

results = model.eval_model(eval_df)

print(model.predict(["five"]))


model1 = Seq2SeqModel(
    encoder_type,
    encoder_decoder_name="outputs",
    args=model_args,
    use_cuda=True,
)
print(model1.predict(["five"])

It the sample code in documentation of simpletransformers library.
The dataset size doesn't matter.

https://github.com/ThilinaRajapakse/simpletransformers/blob/master/README.md#encoder-decoder

flozi00 on 29 May 2020

Hey @flozi00, I think #4680 fixes the error.

@chenjunweii - a Bert2Bert model using the EncoderDecoder framework should be the right approach here! You can use one Bert model as an encoder and the other Bert model as a decoder. You will have to fine-tune the EncoderDecoder model a bit, but it should work fine!

You can load the model via:

from transformers import EncoderDecoder

model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert

and train it on conditional language text generation providing the input_ids as context, the decoder_input_ids as the text to generate and lm_labels as your shifted text to generate. Think of it as decoder_input_ids and lm_labels being your normal inputs for causal text generation inputs and input_ids as your context to condition the model on. I will soon provide a notebook that makes this clearer.

patrickvonplaten on 30 May 2020

😕1 👍1

Thank you for working on this problem and thank you for 🤗 !
It looks like it is finally possible to write seq2seq models in under 10 lines of code, yay!

But I still have some questions and concerns about the EncoderDecoder.

It is not clear now, how masking now works in the decoder implementation. I spent quite some time to get into it.

Documentation says that "Causal mask will also be used by default", but I did not find how to change it. E.g. what if I am training model without teacher forcing (just generating words one by one during training) or if I am doing inference?

I would suggest to add one more argument to the forward that would make it both more clear when causal masking is used and how to enable/disable it. What do you think?

It is not clear what is the default decoder class.

It just feels weird to use BERT as a decoder. BERT is a mode that is a) non-autoregressive b) pre-trained without cross-attention modules. It is also unclear at which point the cross-attention modules are created. It would be great, if it is possible, to add something like TransformerDecoder model.

Guitaricet on 3 Jun 2020

👍1

Hey @Guitaricet :-) ,

First, at the moment only Bert2Bert works with the encoder-decoder framework. Also, if you use Bert as a decoder you will always use a causal mask. At the moment I cannot think of an encoder-decoder in which the decoder does not use a causal mask, so I don't see a reason why one would want to disable it. Can you give me an example where the decoder should not have a causal mask?
Do you mean auto-regressive language generation by "generating words one by one"? Auto-regressive language modeling always requires a causal mask...

Currently, only Bert works as a decoder. We might add GPT2 in a couple of weeks. Note that no model has cross-attention layers if it is not already an encoder-decoder model (like Bart or T5) and in this case it does not make sense to use the encoder-decoder wrapper. The model is initialized with random weights for the cross attention layers which will have to be fine-tuned. I agree, that this should be made clearer in the documentation!

patrickvonplaten on 3 Jun 2020

👍2

I'm trying to build a Bert2Bert model using EncoderDecoder, but I have a couple quick questions regarding the format of inputs and targets for the BERT decoder.

What exactly is a good way to format the conditional mask to the decoder. For example, if I want to feed the decoder [I, am] and make it output [I, am, happy], how exactly do I mask the input? Do I give the decoder [CLS, I, am, MASK, ...., MASK, SEP] where the number of MASKs is such that the total number of tokens is a fixed length (like 512)? Or do I just input [CLS, I, am, MASK, SEP, PAD, ..., PAD]?

Similarly, what should the decoder's output be? Does the first token (the "output" of CLS) be the token "I"?

Lastly, is there a website or resource that explains the input and output representations of text given to the decoder in Bert2Bert? I don't think the authors of the paper have released their code yet.

Thanks!

AshOlogn on 3 Jun 2020

I will soon release a bert2bert notebook that will show how to do this. You can also take a look at this:
https://github.com/huggingface/transformers/issues/4647

Maybe it helps.

patrickvonplaten on 3 Jun 2020

👍2

Thank you @patrickvonplaten for clarification

I see why not using a causal mask seems weird and I agree with you. I can think of two reasons not to use a causal mask for generation: 1) inference: you don't have any future to look into, thus the mask is not strictly needed (you won't be able to cache the decoder states though) 2) you can train a model without teacher forcing, i.e. during training forwarding your decoder tgt_len times only using the words that has been predicted by the model instead of feeding the ground truth.

It is very possible that both of these cases are rare, so the library may not need causal_masking argument, but at least some clarification may be needed. This is the reason why I found this issue in the first place.

Yes, improving the documentation would help a lot! Still, I would argue that a designated Decoder class is a much more clear way if you want to train it from scratch.

I also noticed that config.is_decoder option is only documented in BertModel and not in BertConfig class. Adding it would help a lot. (I only found it because I thought that it is not documented at all and wanted to check my claim via searching for "is_decoder" in the source code)

Again, thank you for you work, 🤗 is what NLP community needed for quite some time!

UPD: more reasons to use a different attention mask (not for seq2seq though) XLNet-like or ULM-like pre-training

Guitaricet on 3 Jun 2020

I will soon release a bert2bert notebook that will show how to do this. You can also take a look at this:

4647

Maybe it helps.

Hi @patrickvonplaten ,

Thanks for the clarification on this topic and for the great work you've been doing on those seq2seq models.
Is this notebook you mentioned here already available?

Thanks.

antoniomrfranco on 6 Jul 2020

And in order for the script to work, you need to use this Trainer class:
https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/trainer.py

I'm currently training the model myself. When the results are decent, I will publish a little notebook.

patrickvonplaten on 15 Jul 2020

👍4

Hi @patrickvonplaten, thanks for sharing the scripts. However, the second link for training an encoder-decoder model is not found. Could you please upload this script? Thanks.

mingzi151 on 31 Jul 2020

You

ghost on 31 Jul 2020

Sorry, I deleted the second link. You can see all the necessary code on this model page:
https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16#bert2bert-summarization-with-%F0%9F%A4%97-encoderdecoder-framework

patrickvonplaten on 3 Aug 2020

Thanks for sharing this, Patrick.

mingzi151 on 4 Aug 2020

I am trying to implement a encoder decoder with BART but I have no idea how to do so, and I need to fine tune the decoder model, so eventually I need to train my decoder model. I am trying to use the EncoderDecoder model in my script but I don't know how to access the decoder model for training it. Instead of using the module, I initialized BartModel as encoder,whereas for decoder I used BartForConditionalGeneration. Here's the model I initialized

encoder = BartModel.from_pretrained('facebook/bart-base)
decoder = BartForConditionalGeneration.from_pretrained('facebook/bart-base)

And here's how I am using it.

for epoch in range(epochs):
        #------------------------training------------------------
        decoder.train()
        losses = 0
        times = 0
        print('\n'+'-'*20 + f'epoch {epoch}' + '-'*20)
        for batch in tqdm(train_dataloader):
            batch = [item.to(device) for item in batch]

            encoder_input, decoder_input, mask_encoder_input, mask_decoder_input = batch

            lhs,hs,att,_,_,_ = encoder(input_ids = encoder_input, attention_mask = mask_encoder_input,output_attentions = True,output_hidden_states = True)
            past = (lhs,hs,att)



            logits,_,_,_= decoder(input_ids = decoder_input, attention_mask = mask_decoder_input, encoder_outputs = past)


            out = logits[:, :-1].contiguous()
            target = decoder_input[:, 1:].contiguous()
            target_mask = mask_decoder_input[:, 1:].contiguous()


            loss = util.sequence_cross_entropy_with_logits(out, target, target_mask, average="token")
            loss.backward()

            losses += loss.item()
            times += 1

            update_count += 1

            if update_count % num_gradients_accumulation == num_gradients_accumulation - 1:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

I am calculating perplexity from the loss, and I am getting a perplexity score of 1000+, which is bad. I would like to know whats my model is lacking and is it possible that I could use EncoderDecoder module

AmbiTyga on 10 Aug 2020

@AmbiTyga from what I know, BART is already a encoder-decoder model, with a BERT as a encoder and a GPT as a decoder. So you are encoding-decoding in encoder and encoding-decoding in decoder, which I don t think is a good idea. For the moment EncoderDecoderModel supports only BERT.

iliemihai on 10 Aug 2020

@iliemihai So can you refer me how to use BART in such cases like I have coded above?

AmbiTyga on 10 Aug 2020

@patrickvonplaten is Bert the only model that is supported as a decoder? I was hoping to train a universal model so wanted to use xlm-roberta (xlmr) as both encoder and decoder; Is this possible given the current EncoderDecoder framework? I know bert has a multilingual checkpoint but performance-wise an xlm-roberta model should be better. I noticed the notebook https://github.com/huggingface/transformers/blob/16e38940bd7d2345afc82df11706ee9b16aa9d28/model_cards/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16/README.md does roberta2roberta; is this same code applicable to xlm-roberta?
I tried following the same template with xlmr but I noticed that the output is the same regardless of the input - the is_decoder flag is properly set to True in the decoder but this issue persists.

spookypineapple on 29 Aug 2020

Hey @spookypineapple - good question! Here is the PR that adds XLM-Roberta to the EncoderDecoder models: https://github.com/huggingface/transformers/pull/6878

will not make it to 3.1.0 but should be available on master in ~1,2 days

patrickvonplaten on 1 Sep 2020

❤1

Im pulling from master so I should get at least the neccessary code artifacts to get bert2bert to work. However Im seeing (for a bert2bert setup using bert-base-multilingual-cased) that the output of the decoder remains unchanged regardless of the input to the encoder; this behavior seems to persist with training... The code im using to initialize the EncoderDecoder model is as follows:

import torch
from transformers import (
    MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
    AdamW,
    get_linear_schedule_with_warmup,
    AutoConfig,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    EncoderDecoderModel
)
model_type = 'bert'
model_name = config_name = tokenizer_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_name,
    do_lower_case=False,
    cache_dir=None,
    force_download=False
)
config = AutoConfig.from_pretrained(
    config_name,
    cache_dir=None,
    force_download=False
)
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    model_name,  # encoder
    model_name,  # decoder
    from_tf=bool(".ckpt" in model_name),
    config=config,
    cache_dir=None,
)
if model_type in ['bert']:
    tokenizer.bos_token = tokenizer.cls_token
    tokenizer.eos_token = tokenizer.sep_token
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.tie_weights()
model.decoder.config.use_cache = False

input_str1 = "this is the first example"
input_str2 = "and heres another example for you"
input_encodings1 = tokenizer.encode_plus(input_str1,
                                         padding="max_length",
                                         truncation=True,
                                         max_length=512,
                                         return_tensors="pt")
input_encodings2 = tokenizer.encode_plus(input_str2,
                                         padding="max_length",
                                         truncation=True,
                                         max_length=512,
                                         return_tensors="pt")
gen1 = model.generate(input_ids=input_encodings1.input_ids,
                      attention_mask=input_encodings1.attention_mask,
                      max_length=25,
                      decoder_start_token_id=model.config.decoder_start_token_id
                      )
gen2 = model.generate(input_ids=input_encodings2.input_ids,
                      attention_mask=input_encodings2.attention_mask,
                      max_length=25,
                      decoder_start_token_id=model.config.decoder_start_token_id
                      )
dec1 = [tokenizer.decode(ids, skip_special_tokens=True) for ids in gen1]
dec2 = [tokenizer.decode(ids, skip_special_tokens=True) for ids in gen2]
print(dec1)
print(dec2)

# the outputs are identical even though the inputs are different

spookypineapple on 2 Sep 2020

Hey @spookypineapple,

A couple of things regarding your code:

1) .from_encoder_decoder_pretrained() usually does not need a config. The way you use this function with a conifg inserted means that you are overwriting the encoder config, which is not recommended when loading an encoder decoder model from two pretrained "bert-base-multilingual-cased" checkpoints. Also from_tf will also only apply to the encoder. You would additionally have to pass decoder_from_tf.

2) An encoder decoder model initialized from two pretrained "bert-base-multilingual-cased" checkpoints needs to be fine-tuned before any meaningful results can be seen.

=> You might want to check these model cards of bert2bert which explain how to fine-tune such an encoder decoder model: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

Hope this helps!

patrickvonplaten on 2 Sep 2020

👍1

Hey @spookypineapple,

A couple of things regarding your code:

.from_encoder_decoder_pretrained() usually does not need a config. The way you use this function with a conifg inserted means that you are overwriting the encoder config, which is not recommended when loading an encoder decoder model from two pretrained "bert-base-multilingual-cased" checkpoints. Also from_tf will also only apply to the encoder. You would additionally have to pass decoder_from_tf.

An encoder decoder model initialized from two pretrained "bert-base-multilingual-cased" checkpoints needs to be fine-tuned before any meaningful results can be seen.

=> You might want to check these model cards of bert2bert which explain how to fine-tune such an encoder decoder model: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

Hope this helps!