Transformers: resize_token_embeddings error for Transformer-XL

Created on 31 Mar 2020  路  14Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using : Transformer-XL

Language I am using the model on : English

The problem arises when using:

  • [ ] my own modified scripts: a fine-tuning script for TransfoXLLMHeadModel

To reproduce

The following code aims to add two new tokens to the vocabulary, 'wug' and 'wugs'. After doing so to the tokenizer, we call resize_token_embeddings with the model in order to update its input embeddings to have correct dimension to account for the new tokens.

import torch
from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel

model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')

tokenizer.add_tokens(['wug', 'wugs'])
model.resize_token_embeddings(len(tokenizer))

Running the above gives the following error

Traceback (most recent call last):
  File "bug.py", line 9, in <module>
    model.resize_token_embeddings(len(tokenizer))
  File "/home/AD/rdsie/anaconda3/envs/lign251/lib/python3.7/site-packages/transformers/modeling_utils.py", line 198, in resize_token_embeddings
    model_embeds = base_model._resize_token_embeddings(new_num_tokens)
  File "/home/AD/rdsie/anaconda3/envs/lign251/lib/python3.7/site-packages/transformers/modeling_utils.py", line 213, in _resize_token_embeddings
    new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
  File "/home/AD/rdsie/anaconda3/envs/lign251/lib/python3.7/site-packages/transformers/modeling_utils.py", line 234, in _get_resized_embeddings
    old_num_tokens, old_embedding_dim = old_embeddings.weight.size()
  File "/home/AD/rdsie/anaconda3/envs/lign251/lib/python3.7/site-packages/torch/nn/modules/module.py", line 576, in __getattr__
    type(self).__name__, name))
AttributeError: 'AdaptiveEmbedding' object has no attribute 'weight'

It seems that the function resize_token_embeddings() does not currently account for the particulars of the input embeddings used for the TransformerXLLMHeadModel.

Expected behavior

We expect that resize_token_embeddings should handle the appropriate updating of the embedding layers for the new vocabulary size, so that the model can be correctly used with the new tokens.

Thank you in advance

All 14 comments

Hi @vsieplus ,

This is a known bug and sadly we don't have a solution for this now. TransfoXLLMHead uses adaptive weight embeddings which makes it not very easy to implement this function. Should be implemented in the long run though - I will note it down. @thomwolf @LysandreJik

@patrickvonplaten Does the same problem apply to XLNet?

No it should not. XLNet uses the standard nn.embedding - so it should be fine.

Hi, I faced the same issue and wrote some dirty code as a workaround in modeling_utils.py. The main idea is to just operate on the last embedding layer:

def _resize_token_embeddings(self, new_num_tokens):
    old_embeddings = self.get_input_embeddings()

    if type(self).__name__ == 'TransfoXLModel':
        # since the 'TransfoXLModel' has multiple embedding layers, the last layer is resized
        new_num_tokens_last = new_num_tokens
        for emb_layer in old_embeddings.emb_layers[:-1]:
            new_num_tokens_last -= emb_layer.weight.size(0)

        new_embeddings_last = self._get_resized_embeddings(old_embeddings.emb_layers[-1], new_num_tokens_last)
        new_embeddings = old_embeddings
        new_embeddings.emb_layers[-1] = new_embeddings_last
    else:
        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)

    self.set_input_embeddings(new_embeddings)
    return self.get_input_embeddings()

It workes for me (at least I get no error). Can someone confirm that this makes sense? Maybe @patrickvonplaten ?

Sorry for bothering again @patrickvonplaten, but this is important for me: Can you or someone else comment on my "fix" above wether it makes sense?
Thanks in advance!

This looks okay to me, though I think you can patch a custom _resize_token_embeddings(self, new_num_tokens) to TransfoXLPreTrainedModel to avoid making the test (and leave the default behavior for other models).

Actually adding such a method to TransfoXLPreTrainedModel would solve this issue AFAICT. Since you wrote it @RafaelWO, you should make a PR with it :-)

Thanks for your feedback @sgugger ! I will move the logic into the TransfoXLPreTrainedModel and make my first PR :)

Out of curiosity, why do you go with

for emb_layer in old_embeddings.emb_layers[:-1]:
            new_num_tokens_last -= emb_layer.weight.size(0)

Wouldn't just emb_layer = old_embeddings.emb_layers[-1] work out ? Also are wug and wugs often used ? If they're syntax tokens, which are frequent, you might want to add them to the corresponding embedding group.

I think the for loop is to make sure new_num_tokens_last is accurate by substracting the other embedding sizes.

I agree that ideally, the method written on TransfoXLPreTrainedModel should have an argument to decide to which embedding layer add the new tokens (with a default to the last one).

Yes that's correct @sgugger, thanks for answering.

I understand the idea of your introduced parameter, but for me the question is whether this makes sense? Because if you add the new token into e.g. the first layer, you would have to insert it also at the same position in your tokenizer and shift all tokens after that.

@TevenLeScao

Also are wug and wugs often used ?

In my case I want to a cls_token which is not included in the pretrained tokenizer.

Ah my bad, misread the :-1 into -1:. I've looked again at the ProjectedAdaptiveLogSoftmax and adding elsewhere should be fine if you update the cutoffs attribute to make sure it takes into account the changed embedding size.

Adding at the end is a good baseline; the only issue is that you're going to lose out on some of the benefits of the adaptive softmax as you're often going to have to access the bigger softmax layer whereas you usually want to have the frequent tokens (such as cls) on smaller ones.

update the cutoffs attribute to make sure it takes into account the changed embedding size.

Adding at the end is a good baseline; the only issue is that you're going to lose out on some of the benefits of the adaptive softmax as you're often going to have to access the bigger softmax layer whereas you usually want to have the frequent tokens (such as cls) on smaller ones.

Yes and yes, that's true.

But as I mentioned above: if you add such a common token into the first smaller layer and adjust the cutoffs (which would be the preferred way to do), you have a conflict with the tokenizer, because there the new token is at the end and not at position 20001 as in your model (default cutoffs [20000, 40000, 200000]).

Or am I missing something?

Yes, that is also going to be a problem, but it shouldn't be too hard to solve with a simple conversion function that shifts the other tokens. The cleanest way to do it would probably be to update the tokenizer yourself but I am not sure how easy that would be.

Thanks a lot @sgugger for answering here! As @sgugger mentioned, it'd be great if you can add a _resize_token_embeddings() function to TransfoXLPreTrainedModel.

The solution looks great to me @vsieplus :-)

You could make it a bit more compact, but that's a nitpick:

    embeddings = self.get_input_embeddings()
    new_num_tokens_last = new_num_tokens - sum([emb.shape[0] for emb in embeddings.emb_layers[:-1])
    new_embeddings_last = self._get_resized_embeddings(embeddings.emb_layers[-1], new_num_tokens_last)
    embeddings.emb_layers[-1] = new_embeddings_last

    self.set_input_embeddings(embeddings)
Was this page helpful?
0 / 5 - 0 ratings