How to train a custom seq2seq model with BertModel
,
I would like to use some Chinese pretrained model base on BertModel
so I've tried using Encoder-Decoder Model
, but it seems theEncoder-Decoder Model
is not used for conditional text generation
and I saw that BartModel seems to be the model I need, but I cannot load pretrained BertModel weight with BartModel.
by the way, could I finetune a BartModel for seq2seq with custom data ?
any suggestion, thanks
Hi @chenjunweii - thanks for your issue! I will take a deeper look at the EncoderDecoder framework at the end of this week and should add a google colab on how to fine-tune it.
Using Bert - Bert model for seq2seq task should work using simpletransformers library, there is an working code.
But there is one strange thing that the saved models loads wrong weight's.
Predicting the same string multiple times works correctly, loading the model each time again it's generating a new result every time @patrickvonplaten
Hi @flozi00,
could you add a code snippet here that reproduces this bug?
Of course, it should be reproduceable using this code:
import logging
import pandas as pd
from simpletransformers.seq2seq import Seq2SeqModel
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_data = [
["one", "1"],
["two", "2"],
]
train_df = pd.DataFrame(train_data, columns=["input_text", "target_text"])
eval_data = [
["three", "3"],
["four", "4"],
]
eval_df = pd.DataFrame(eval_data, columns=["input_text", "target_text"])
model_args = {
"reprocess_input_data": True,
"overwrite_output_dir": True,
"max_seq_length": 10,
"train_batch_size": 2,
"num_train_epochs": 10,
"save_eval_checkpoints": False,
"save_model_every_epoch": False,
"evaluate_generated_text": True,
"evaluate_during_training_verbose": True,
"use_multiprocessing": False,
"max_length": 15,
"manual_seed": 4,
}
encoder_type = "roberta"
model = Seq2SeqModel(
encoder_type,
"roberta-base",
"bert-base-cased",
args=model_args,
use_cuda=True,
)
model.train_model(train_df)
results = model.eval_model(eval_df)
print(model.predict(["five"]))
model1 = Seq2SeqModel(
encoder_type,
encoder_decoder_name="outputs",
args=model_args,
use_cuda=True,
)
print(model1.predict(["five"])
It the sample code in documentation of simpletransformers library.
The dataset size doesn't matter.
https://github.com/ThilinaRajapakse/simpletransformers/blob/master/README.md#encoder-decoder
Hey @flozi00, I think #4680 fixes the error.
@chenjunweii - a Bert2Bert model using the EncoderDecoder
framework should be the right approach here! You can use one Bert
model as an encoder and the other Bert
model as a decoder. You will have to fine-tune the EncoderDecoder
model a bit, but it should work fine!
You can load the model via:
from transformers import EncoderDecoder
model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert
and train it on conditional language text generation providing the input_ids
as context, the decoder_input_ids
as the text to generate and lm_labels
as your shifted text to generate. Think of it as decoder_input_ids
and lm_labels
being your normal inputs for causal text generation inputs and input_ids
as your context to condition the model on. I will soon provide a notebook that makes this clearer.
Thank you for working on this problem and thank you for 馃 !
It looks like it is finally possible to write seq2seq models in under 10 lines of code, yay!
But I still have some questions and concerns about the EncoderDecoder
.
Documentation says that "Causal mask will also be used by default", but I did not find how to change it. E.g. what if I am training model without teacher forcing (just generating words one by one during training) or if I am doing inference?
I would suggest to add one more argument to the forward that would make it both more clear when causal masking is used and how to enable/disable it. What do you think?
It just feels weird to use BERT as a decoder. BERT is a mode that is a) non-autoregressive b) pre-trained without cross-attention modules. It is also unclear at which point the cross-attention modules are created. It would be great, if it is possible, to add something like TransformerDecoder
model.
Hey @Guitaricet :-) ,
First, at the moment only Bert2Bert works with the encoder-decoder framework. Also, if you use Bert as a decoder you will always use a causal mask. At the moment I cannot think of an encoder-decoder in which the decoder does not use a causal mask, so I don't see a reason why one would want to disable it. Can you give me an example where the decoder should not have a causal mask?
Do you mean auto-regressive language generation by "generating words one by one"? Auto-regressive language modeling always requires a causal mask...
cross-attention
layers if it is not already an encoder-decoder model (like Bart or T5) and in this case it does not make sense to use the encoder-decoder wrapper. The model is initialized with random weights for the cross attention layers which will have to be fine-tuned. I agree, that this should be made clearer in the documentation! I'm trying to build a Bert2Bert model using EncoderDecoder, but I have a couple quick questions regarding the format of inputs and targets for the BERT decoder.
What exactly is a good way to format the conditional mask to the decoder. For example, if I want to feed the decoder [I, am] and make it output [I, am, happy], how exactly do I mask the input? Do I give the decoder [CLS, I, am, MASK, ...., MASK, SEP] where the number of MASKs is such that the total number of tokens is a fixed length (like 512)? Or do I just input [CLS, I, am, MASK, SEP, PAD, ..., PAD]?
Similarly, what should the decoder's output be? Does the first token (the "output" of CLS) be the token "I"?
Lastly, is there a website or resource that explains the input and output representations of text given to the decoder in Bert2Bert? I don't think the authors of the paper have released their code yet.
Thanks!
I will soon release a bert2bert notebook that will show how to do this. You can also take a look at this:
https://github.com/huggingface/transformers/issues/4647
Maybe it helps.
Thank you @patrickvonplaten for clarification
It is very possible that both of these cases are rare, so the library may not need causal_masking
argument, but at least some clarification may be needed. This is the reason why I found this issue in the first place.
Decoder
class is a much more clear way if you want to train it from scratch.I also noticed that config.is_decoder
option is only documented in BertModel and not in BertConfig
class. Adding it would help a lot. (I only found it because I thought that it is not documented at all and wanted to check my claim via searching for "is_decoder" in the source code)
Again, thank you for you work, 馃 is what NLP community needed for quite some time!
UPD: more reasons to use a different attention mask (not for seq2seq though) XLNet-like or ULM-like pre-training
I will soon release a bert2bert notebook that will show how to do this. You can also take a look at this:
4647
Maybe it helps.
Hi @patrickvonplaten ,
Thanks for the clarification on this topic and for the great work you've been doing on those seq2seq models.
Is this notebook you mentioned here already available?
Thanks.
Yeah, the code is ready in this PR: https://github.com/huggingface/transformers/tree/more_general_trainer_metric .
The script to train an Encoder-Decoder model can be assessed here: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/bert_encoder_decoder_summary.py
And in order for the script to work, you need to use this Trainer class:
https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/trainer.py
I'm currently training the model myself. When the results are decent, I will publish a little notebook.
Hi @patrickvonplaten, thanks for sharing the scripts. However, the second link for training an encoder-decoder model is not found. Could you please upload this script? Thanks.
You
Sorry, I deleted the second link. You can see all the necessary code on this model page:
https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16#bert2bert-summarization-with-%F0%9F%A4%97-encoderdecoder-framework
Thanks for sharing this, Patrick.
I am trying to implement a encoder decoder with BART but I have no idea how to do so, and I need to fine tune the decoder model, so eventually I need to train my decoder model. I am trying to use the EncoderDecoder
model in my script but I don't know how to access the decoder model for training it. Instead of using the module, I initialized BartModel
as encoder,whereas for decoder I used BartForConditionalGeneration
. Here's the model I initialized
encoder = BartModel.from_pretrained('facebook/bart-base)
decoder = BartForConditionalGeneration.from_pretrained('facebook/bart-base)
And here's how I am using it.
for epoch in range(epochs):
#------------------------training------------------------
decoder.train()
losses = 0
times = 0
print('\n'+'-'*20 + f'epoch {epoch}' + '-'*20)
for batch in tqdm(train_dataloader):
batch = [item.to(device) for item in batch]
encoder_input, decoder_input, mask_encoder_input, mask_decoder_input = batch
lhs,hs,att,_,_,_ = encoder(input_ids = encoder_input, attention_mask = mask_encoder_input,output_attentions = True,output_hidden_states = True)
past = (lhs,hs,att)
logits,_,_,_= decoder(input_ids = decoder_input, attention_mask = mask_decoder_input, encoder_outputs = past)
out = logits[:, :-1].contiguous()
target = decoder_input[:, 1:].contiguous()
target_mask = mask_decoder_input[:, 1:].contiguous()
loss = util.sequence_cross_entropy_with_logits(out, target, target_mask, average="token")
loss.backward()
losses += loss.item()
times += 1
update_count += 1
if update_count % num_gradients_accumulation == num_gradients_accumulation - 1:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
I am calculating perplexity from the loss, and I am getting a perplexity score of 1000+, which is bad. I would like to know whats my model is lacking and is it possible that I could use EncoderDecoder
module
@AmbiTyga from what I know, BART is already a encoder-decoder model, with a BERT as a encoder and a GPT as a decoder. So you are encoding-decoding in encoder and encoding-decoding in decoder, which I don t think is a good idea. For the moment EncoderDecoderModel supports only BERT.
@iliemihai So can you refer me how to use BART in such cases like I have coded above?
@patrickvonplaten is Bert the only model that is supported as a decoder? I was hoping to train a universal model so wanted to use xlm-roberta (xlmr) as both encoder and decoder; Is this possible given the current EncoderDecoder framework? I know bert has a multilingual checkpoint but performance-wise an xlm-roberta model should be better. I noticed the notebook https://github.com/huggingface/transformers/blob/16e38940bd7d2345afc82df11706ee9b16aa9d28/model_cards/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16/README.md does roberta2roberta; is this same code applicable to xlm-roberta?
I tried following the same template with xlmr but I noticed that the output is the same regardless of the input - the is_decoder flag is properly set to True in the decoder but this issue persists.
Hey @spookypineapple - good question! Here is the PR that adds XLM-Roberta to the EncoderDecoder models: https://github.com/huggingface/transformers/pull/6878
will not make it to 3.1.0 but should be available on master in ~1,2 days
Im pulling from master so I should get at least the neccessary code artifacts to get bert2bert to work. However Im seeing (for a bert2bert setup using bert-base-multilingual-cased) that the output of the decoder remains unchanged regardless of the input to the encoder; this behavior seems to persist with training... The code im using to initialize the EncoderDecoder model is as follows:
import torch
from transformers import (
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
AdamW,
get_linear_schedule_with_warmup,
AutoConfig,
AutoTokenizer,
AutoModelForSeq2SeqLM,
EncoderDecoderModel
)
model_type = 'bert'
model_name = config_name = tokenizer_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name,
do_lower_case=False,
cache_dir=None,
force_download=False
)
config = AutoConfig.from_pretrained(
config_name,
cache_dir=None,
force_download=False
)
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
model_name, # encoder
model_name, # decoder
from_tf=bool(".ckpt" in model_name),
config=config,
cache_dir=None,
)
if model_type in ['bert']:
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.tie_weights()
model.decoder.config.use_cache = False
input_str1 = "this is the first example"
input_str2 = "and heres another example for you"
input_encodings1 = tokenizer.encode_plus(input_str1,
padding="max_length",
truncation=True,
max_length=512,
return_tensors="pt")
input_encodings2 = tokenizer.encode_plus(input_str2,
padding="max_length",
truncation=True,
max_length=512,
return_tensors="pt")
gen1 = model.generate(input_ids=input_encodings1.input_ids,
attention_mask=input_encodings1.attention_mask,
max_length=25,
decoder_start_token_id=model.config.decoder_start_token_id
)
gen2 = model.generate(input_ids=input_encodings2.input_ids,
attention_mask=input_encodings2.attention_mask,
max_length=25,
decoder_start_token_id=model.config.decoder_start_token_id
)
dec1 = [tokenizer.decode(ids, skip_special_tokens=True) for ids in gen1]
dec2 = [tokenizer.decode(ids, skip_special_tokens=True) for ids in gen2]
print(dec1)
print(dec2)
# the outputs are identical even though the inputs are different
Hey @spookypineapple,
A couple of things regarding your code:
1) .from_encoder_decoder_pretrained()
usually does not need a config. The way you use this function with a conifg
inserted means that you are overwriting the encoder config, which is not recommended when loading an encoder decoder model from two pretrained "bert-base-multilingual-cased" checkpoints. Also from_tf
will also only apply to the encoder. You would additionally have to pass decoder_from_tf
.
2) An encoder decoder model initialized from two pretrained "bert-base-multilingual-cased" checkpoints needs to be fine-tuned before any meaningful results can be seen.
=> You might want to check these model cards of bert2bert which explain how to fine-tune such an encoder decoder model: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16
Hope this helps!
Hey @spookypineapple,
A couple of things regarding your code:
.from_encoder_decoder_pretrained()
usually does not need a config. The way you use this function with aconifg
inserted means that you are overwriting the encoder config, which is not recommended when loading an encoder decoder model from two pretrained "bert-base-multilingual-cased" checkpoints. Alsofrom_tf
will also only apply to the encoder. You would additionally have to passdecoder_from_tf
.- An encoder decoder model initialized from two pretrained "bert-base-multilingual-cased" checkpoints needs to be fine-tuned before any meaningful results can be seen.
=> You might want to check these model cards of bert2bert which explain how to fine-tune such an encoder decoder model: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16
Hope this helps!
It does help indeed! Thankyou @patrickvonplaten
Most helpful comment
Yeah, the code is ready in this PR: https://github.com/huggingface/transformers/tree/more_general_trainer_metric .
The script to train an Encoder-Decoder model can be assessed here: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/bert_encoder_decoder_summary.py
And in order for the script to work, you need to use this Trainer class:
https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/trainer.py
I'm currently training the model myself. When the results are decent, I will publish a little notebook.