Transformers: Training a new language model with custom loss and input representation

Created on 28 Apr 2020  ยท  9Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

I'm following https://huggingface.co/blog/how-to-train which describe an overview of training a new language model. However, the file the guide points to run_language_modeling.py abstracts away a lot of things. It's not clear if it's possible to train with a custom loss/input representation.

For example, what if I want to train using 3 sequences concatenated together instead of 2 as in the original Bert paper? (e.g. [context, context, question] or [sent1, sent2, sent3] where the task is whether sentences sent1, sent2, sent3 are 3 consecutive sentences or not.)

Do I need to modify the source code to achieve this? Is there any documentation to modify the underlying model or loss functions?

wontfix

Most helpful comment

We recently implemented a new Trainer which should allow to easily change the training loop. We don't have example scripts showing how to override this yet. Here's the trainer file.

If I wanted to modify the way the loss is handled, what I would do is create a specific trainer that inherits from Trainer. I would then simply override the _training_step method to handle the loss/losses.

In order to modify the input representation to build inputs made up of three sequences, the best would be for you to create a Dataset similar to TextDataset, which builds your inputs as you wish.

You can then modify the run_language_modeling.py file to use your dataset. Let me know if I can help further!

All 9 comments

We recently implemented a new Trainer which should allow to easily change the training loop. We don't have example scripts showing how to override this yet. Here's the trainer file.

If I wanted to modify the way the loss is handled, what I would do is create a specific trainer that inherits from Trainer. I would then simply override the _training_step method to handle the loss/losses.

In order to modify the input representation to build inputs made up of three sequences, the best would be for you to create a Dataset similar to TextDataset, which builds your inputs as you wish.

You can then modify the run_language_modeling.py file to use your dataset. Let me know if I can help further!

Hi, this is super useful advice for getting me started! After looking at the files you pointed out, it seems like in order for me to implement the input representation and custom loss function, I need to modify transformers.modeling_bert.py.

I have 2 questions.

  1. If I implement my own local version of modeling_bert.py, how should I instantiate the BertForMaskedLM class? The way the example does it is with AutoModelWithLMHead.from_pretrained - this obscures how to actually instantiate a particular model class.

  2. For concatenating the 3 sequences in the input, how would I make sure a [SEP] token is inserted between each sequence? My line_by_line data file looks as follows:

sequence 1 \t sequence 2 \t sequence 3 \n
sequence 1 \t sequence 2 \t sequence 3 \n
sequence 1 \t sequence 2 \t sequence 3 \n
sequence 1 \t sequence 2 \t sequence 3 \n
.
.
.

I think my desired input looks like this:

[sequence 1's tokens] [sep] [sequence 2's tokens] [sep] [sequence 3's tokens]

and I'd like to apply position embedding to each sequence 1, 2, 3.

I don't think you would have to modify the modeling_bert.py file. You may be thinking that because the models like BertForMaskedLM can compute the losses if you give them labels.

However, all those classes only compute the losses if you hand them the labels, and the loss function cannot be changed. I don't know what is your specific use-case, but if you want to use a custom loss function you could retrieve the model's hidden states, and compute your loss then (outside the model). The tensor containing the model's last hidden states is the first value in the model's output tuple if you don't specify labels.

Seeing as the way you construct your dataset is different from the "traditional" way of handling sequences, I think you would have to build your own encoding method relying on encode, encode_plus or batch_encode_plus. Unfortunately, the logic we have for building inputs is very specific to the model's pre-trainings and sequence classification, so these work only for two sequences.

We did discuss at one point handling more than 2 sequences when building inputs but never got to it.

Our models handle creating position embeddings on their own so except if you use a different type of positional embeddings than the model's, you shouldn't have to do anything to handle those.

Thanks for the pointers. I'll have an attempt at implementing encode/encode_plus/batch_encode_plus. One question before I do so.

It seems like a lot of changes have been made in the previous 3 weeks since 2.8.0 came out. These changes seem to be affecting the files I want to manipulate. Specifically, I don't think trainer.py was even used by run_language_modeling.py 3 weeks ago.

Do you recommend moving forward with my project using the latest code changes on the master branch, or using the March 02 2020 snapshot (which I'm guessing is the 2.8.0 release snapshot)?

The files you referred me to were all on master. It seems like you can't run them unless transformer is installed from source (pip install version isn't compatible). I'm a bit concerned with using master - I tried training a tokenizer on it and it seemed slower which impleis the latest changes don't seem to have gone through thorough testing.

Hi, I've thought over your advice a bit more and I think there's an easier solution. Suppose the 3 sequences of my input have disjoint vocabulary (I think this is a decent assumption for my particular dataset/usecase).

E.g. each line is (seq1, seq2, seq3). seq1 is english, seq2 is french, seq3 is spanish.

Could I just train 3 different tokenizers and tell BertForMaskedLM the total vocab size is the sum of the 3 tokenizer's vocab sizes?

I realized there's the token_type_ids parameter in the BertEmbeddings which has been implemented for an arbitrary value of config.type_vocab_size. It seems like I can then just set config.type_vocab_size=3 and pass in token_type_ids=[0, 0, ... 1, 1, ... 2, 2].

Does this seem reasonable?

Thanks so much for your help!

Indeed, there has been a lot of changes in the last few weeks! Since the last release, we've moved to a new Trainer class, which abstracts most of the code that isn't especially important for users (fp16, multi-GPU, gradient accumulation etc). We've also completely moved to using tokenizers, which uses rust as a backend for tokenizing. Until now the support was ok, now it's fully supported.

You would use the previous snapshot only if you want to have a script that does not rely on any abstraction. The previous scripts were mostly standalone, which means that it would be a good learning experience to understand every small detail regarding training. However, it may be lengthy to adapt to your particular usecase. That's what we're trying to fix with the new Trainer, where re-implementing a few methods is simple.

I find it weird that you tried to train a tokenizer on master and it was slower. We do test thoroughly, and the new tokenizers have undergone a lot of trial and error and a lot of tests to be in the position they are now.

I think training three different tokenizers would work, but keep in mind that this would require a lot of memory. The embeddings take up a big portion of your model's memory, so beware of the total vocabulary size.

You would also need to think about how you want to separate the three tokenizers, and especially how to make sure you have no overlap. Using separate tokenizers for each sequence, and then shifting the token indices would probably be the most robust (cc @n1t0).

Yup, that's the implementation I had in mind for separating the tokenizers!

As a new user, I think the new abstractions that were introduced makes calling the API/running the script a lot easier but it obscures some of the underlying code - especially someone who doesn't have experience with the abstraction libraries you are using. I think I will go with the previous snapshot and probably switch over to master if I get stuck.

Please disregard the comment on training the tokenizer is slower, I did something wrong on my end.

Indeed, we plan to add examples showing how to use the Trainer to custom tasks down the road!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings