Transformers: Encode-Decode after training, generation gives the same results regardless of the input

Created on 28 May 2020 · 12Comments · Source: huggingface/transformers

❓ Questions & Help

Hi, everyone. I need help with the encoding-decoding model. I'm trying to train the model to create a title for a small text.

I'm creating a basic Encode-Decode model with Bert

from transformers import EncoderDecoderModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased')

After training on my data, when generate I get the same results independent of the input data in model.eval () mode. If you convert model to train, then different results will be generated.

The code I use for training.

tokenized_texts = [tokenizer.tokenize(sent) for sent in train_sentences]
tokenized_gt = [tokenizer.tokenize(sent) for sent in train_gt]

input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(
    input_ids,
    maxlen=max_len_abstract,
    dtype="long",
    truncating="post",
    padding="post"
)
attention_masks = [[float(i>0) for i in seq] for seq in input_ids]
input_ids_decode = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_gt]
input_ids_decode = pad_sequences(
    input_ids_decode,
    maxlen=max_len_title,
    dtype="long",
    truncating="post",
    padding="post"
)

attention_masks_encode = [[float(i>0) for i in seq] for seq in input_ids]
attention_masks_decode = [[float(i>0) for i in seq] for seq in input_ids_decode]

input_ids = torch.tensor(input_ids)
input_ids_decode = torch.tensor(input_ids_decode)
attention_masks_encode = torch.tensor(attention_masks_encode)
attention_masks_decode = torch.tensor(attention_masks_decode)

train_data = TensorDataset(input_ids, input_ids_decode, attention_masks_encode, attention_masks_decode)
train_dataloader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=4)

model.cuda()

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)

model.train()
train_loss_set = []
train_loss = 0
for i in range(4):
    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_ids_de, b_attention_masks_encode, b_attention_masks_decode = batch
        optimizer.zero_grad()
        model.zero_grad()
        loss, outputs = model(input_ids=b_input_ids, decoder_input_ids=b_input_ids_de, lm_labels=b_input_ids_de)[:2]
        train_loss_set.append(loss.item())
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

        clear_output(True)
        plt.plot(train_loss_set)
        plt.title("Training loss")
        plt.xlabel("Batch")
        plt.ylabel("Loss")
        plt.show()
        if step != 0 and step % 20 == 0:
              torch.save(model.state_dict(), model_weigth)
    print(f'Epoch {i}')

Maybe I'm doing something wrong? I would be grateful for any advice.

wontfix

Source

Mantisus

Most helpful comment

Hey @HodorTheCoder,

Sorry for the late reply. I have been working on the encoder-decoder framework and verified
that it works, but only on single GPU training.

This model + model card shows how to train a Bert2Bert model and how it should be used:
https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

Regarding your code, why do you do

bert2bert.module.generate(...)

instead of just doing

bert2bert.generate(...)

The encoder decoder inherits from PretrainedModel and thus has direct access to generate(...), see here:
https://github.com/huggingface/transformers/blob/0b6c255a95368163d2b1d37635e5ce5bdd1b9423/src/transformers/modeling_encoder_decoder.py#L29
.
Also no need to wrap everything into the torch.no_grad() context -> generate() is always in no_grad mode.

Hope this helps! I will be off for the next two weeks - if it's urgent feel free to ping @sshleifer (hope it's fine to ping you here Sam ;-) )

patrickvonplaten on 17 Jul 2020

❤1 👍1

All 12 comments

I trained a bert model from pretrained models. and the output embedding are all the same regardless of the input and attention mask during prediction. But when set model.train(), the model will give different embeddings for different input. I'm quite confused to be honest. I suppose that's the same problem?

zangchendi on 29 May 2020

Hi @Mantisus,
Multiple bugs were fixed in #4680 . Can you please take a look whether this error persists?

patrickvonplaten on 30 May 2020

Hi, @patrickvonplaten

Yes, the latest update fixed the generation issue.

But I have suspicions that I am not training the model correctly.

As the parameters decoder_input_is and lm_labels, I supplied the same values, the text to be generated. But logic suggests that in lm_labels we should submit text shifted 1 token to the right and starting with Pad.
I tried to train the model in this way, but in this case the loss drops almost immediately to almost 0 and the model does not learn.

I am somewhat confused about what format the training data should be organized in. I will be glad of any advice from you

However, when training the model decoder_input_is == lm_labels, I get pretty good results even on a small dataset (12500), but I think they can be better.

Mantisus on 30 May 2020

Hi @Mantisus,

Doing decoder_input_is = lm_labels is correct. Let's say you want to fine-tune a Bert2Bert for summarization. Then you should do the following (untested example):

from transformers import EncoderDecoder, BertTokenizerFast
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

context = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

summary = "'Liana Barrientos has been married 10 times, sometimes within two weeks of each other. Prosecutors say the marriages were part of an immigration scam. She is believed to still be married to four men, and at one time, she was married to eight men at once. Her eighth husband was deported in 2006 to his native Pakistan."

input_ids = tokenizer.encode(context, return_tensors="pt")
decoder_input_ids = tokenizer.encode(summary, return_tensors="pt")

loss, *args = bert2bert(input_ids=input_ids, decoder_input_ids=decoder_input_ids, lm_labels=decoder_input_ids)

The reason that you don't have to shift the lm_labels is that Bert does that automatically for you here: https://github.com/huggingface/transformers/blob/0866669e751bef636fa693b704a28c1fea9a17f3/src/transformers/modeling_bert.py#L951

BTW, the summary example was just taken from: https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb

patrickvonplaten on 30 May 2020

The best way for us to check your code if it's a longer training setup is to provide a google colab which we can copy and tweak ourselves :-)

patrickvonplaten on 30 May 2020

Great, thanks for the example @patrickvonplaten

It is convenient that BERT takes care of everything.

The code that I use for training is not much different from the example that I threw above. The only thing is that since I use Google Ecolab for training, I wrapped the creation of input Tensors in generators, in order to reduce RAM consumption on large datasets.

https://colab.research.google.com/drive/1uVP09ynQ1QUmSE2sjEysHjMfKgo4ssb7?usp=sharing

Mantisus on 30 May 2020

👍1

Great, thanks for the example @patrickvonplaten

It is convenient that BERT takes care of everything.

The code that I use for training is not much different from the example that I threw above. The only thing is that since I use Google Ecolab for training, I wrapped the creation of input Tensors in generators, in order to reduce RAM consumption on large datasets.

https://colab.research.google.com/drive/1uVP09ynQ1QUmSE2sjEysHjMfKgo4ssb7?usp=sharing

I am doing something similar to Mantisus, but fine tuning on a large dataset and trying to do it in parallel. My code is actually quite similar to his google colab-- but I am trying to wrap the model in torch.nn.DataParallel so that I can up the batch size to 32 and use two GPU'S. I can get the training to work, as far as i can tell, but since the generate function is only exposed to the underlying model, when I try to run generate, I get blank tokens as output. I must be doing something wrong.

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
if(multi_gpu):
    bert2bert_o = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "bert-base-cased")
    bert2bert = torch.nn.DataParallel(bert2bert_o)
else:
    bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "bert-base-cased")

# convert to GPU model
#bert2bert.to(device)
torch.cuda.set_device(0)
bert2bert.cuda(0)
# put in training mode
bert2bert.train()

then the rest of the code essentially looks like the @Mantisus code from google colab. How do I access the generate properly, and does anybody know if the same parameters pass all the way through to the underlying model (I would assume .train() and .eval() work?

Here's the training block, I've adapted it to look like the @Mantisus code-- but the other goofy thing I don't understand is how to access the right loss, since the wrapped parallel model returns a squeezed tensor, so I've been doing this and I don't know if it's right:

       loss, outputs = bert2bert(input_ids = input_ids_encode,
                                  decoder_input_ids = input_ids_decode,
                                  attention_mask = attention_mask_encode,
                                  decoder_attention_mask = attention_mask_decode,
                                  labels = labels)[:2]

        if(multi_gpu):
            loss = loss[0]

And finally here's the code that's been augmented to attempt to use generate by accessing the module subclass of the wrapped model, that I am not sure is working properly:

bert2bert.eval()
test_input = tokenizer.encode(["This is a test!"], return_tensors='pt')
with torch.no_grad(): 
    generated = bert2bert.module.generate(test_input, 
                                          decoder_start_token_id=bert2bert.module.config.decoder.pad_token_id,
                                          do_sample=True, 
                                          max_length=100, 
                                          top_k=200, 
                                          top_p=0.75, 
                                          num_return_sequences=10)

Thank you. This is all really great stuff by the way.

HodorTheCoder on 10 Jul 2020

Hey @HodorTheCoder,

Sorry for the late reply. I have been working on the encoder-decoder framework and verified
that it works, but only on single GPU training.

This model + model card shows how to train a Bert2Bert model and how it should be used:
https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

Regarding your code, why do you do

bert2bert.module.generate(...)

instead of just doing

bert2bert.generate(...)

Hope this helps! I will be off for the next two weeks - if it's urgent feel free to ping @sshleifer (hope it's fine to ping you here Sam ;-) )

patrickvonplaten on 17 Jul 2020

❤1 👍1

Thank you so much for your work @patrickvonplaten

iliemihai on 17 Jul 2020

👍1

Yes, all pings are welcome. We also have the https://discuss.huggingface.co/ if you want some hyperparameter advice!

sshleifer on 17 Jul 2020

@patrickvonplaten

Thanks for your response! I successfully trained a bert2bert EncoderDecoderModel wrapped in torch.nn.DataParallel. I could only fit a batchsize of 16 on a single Titan XP, but was able to train a batchsize of 32 using two of them.

You may well be right about the generate propagating properly, and I think when I tried that initially I wasn't training properly (I wasn't updating the loss and optimizer in between batches and there was zero conversion.)

What I ended up doing was training, saving the modeule state dict, and then reloading on a single GPU for inference. Worked great.

Coming from mainly Tensorflow it took me a while to understand how to get torch to do what I wanted but the huggingface documentation has been great, and hopefully, anybody searching will come across these posts.

ALSO:

To anybody else who it was not immediately obvious to when converting to use a parallel model, you have to mean() the loss or it won't take the loss of both GPU's into account when calculating grads/opt. So, in my previous example, I erroneously had loss[0] which isn't right-- I changed it to the following training block that properly uses the loss. It is setup on a flag that I set as input if I want to train on one or two gpu's (multigpu). Below is an abstracted code block.

FYI: I definitely get better results training on batchsize=32 as opposed to 16. Couldn't fit batchsize=64 on the GPU's, might be time to upgrade to some Titan RTX. Anybody got $5k?

tokenizer = BertTokenizer.from_pretrained(case_selection)
if(multi_gpu):
    bert2bert_o = EncoderDecoderModel.from_encoder_decoder_pretrained(case_selection, case_selection)
    bert2bert = torch.nn.DataParallel(bert2bert_o)
else:
    bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(case_selection, case_selection)

# set up adam optimizer
param_optimizer = list(bert2bert.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
# seperate decay
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
# create optimizer object
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5)

num_epochs=4
for epoch in range(num_epochs):
    start = datetime.datetime.now()
    batches = batch_generator(tokenizer, input_text, target_text, batch_size=batch_size)

    # enumerate over the batch yield function
    for step, batch in enumerate(batches):

        batch = tuple(t.to(device) for t in batch)

        input_ids_encode, attention_mask_encode, input_ids_decode, attention_mask_decode, labels = batch

        optimizer.zero_grad()
        bert2bert.zero_grad()

        loss, outputs = bert2bert(input_ids = input_ids_encode,
                                  decoder_input_ids = input_ids_decode,
                                  attention_mask = attention_mask_encode,
                                  decoder_attention_mask = attention_mask_decode,
                                  labels = labels)[:2]

        if(multi_gpu):
            train_loss_set.append(loss.mean().item())
            loss.mean().backward()
            display_loss = loss.mean().item()

        else:
            train_loss_set.append(loss.item())
            loss.backward()
            display_loss = loss.item()

        optimizer.step()

HodorTheCoder on 17 Jul 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.