Transformers: ❓ How to finetune `token_type_ids` of RoBERTa ?

Created on 10 Sep 2019 · 17Comments · Source: huggingface/transformers

❓ Questions & Help

RoBERTa model does not use token_type_ids.

However it is mentioned in the documentation :

you will have to train it during finetuning

Indeed, I would like to train it during finetuning. I tried to load the model with :

model = RobertaModel.from_pretrained('roberta-base', type_vocab_size=2)

But I received the error :

RuntimeError: Error(s) in loading state_dict for RobertaModel:
size mismatch for roberta.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([1, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).

So how can I create my RoBERTa model from the pretrained checkpoint, in order to finetune the use of token ids ?

wontfix

Source

astariul-colanim

Most helpful comment

You are right, we've dived deeply into this issue with @LysandreJik and unfortunately there no solution that would, at the same time, keep full backward compatibility for people who have been using RoBERTa up to now and allow to train and fine-tune token type embeddings for RoBERTa.

So, unfortunately, it won't be possible to fine-tune token type embeddings with RoBERTa.
We'll remove the pointers to this possibility in the doc and docstring.

thomwolf on 2 Oct 2019

❤3

All 17 comments

What I have done is :

model = RobertaModel.from_pretrained('roberta-base')
model.config.type_vocab_size = 2
single_emb = model.embeddings.token_type_embeddings
model.embeddings.token_type_embeddings = torch.nn.Embedding(2, single_emb.embedding_dim)
model.embeddings.token_type_embeddings.weight = torch.nn.Parameter(single_emb.weight.repeat([2, 1]))

But it seems quite clumsy...

What is the 'official' way to go ?

astariul-colanim on 10 Sep 2019

Just using it without doing anything special doesn't work?

model = RobertaModel.from_pretrained('roberta-base')
model(inputs_ids, token_type_ids=token_type_ids)

thomwolf on 17 Sep 2019

Roberta does not use segment IDs in pre-training.

As you mentioned in #1114, we can use it as BERT, but we should pass only 0 (if token_type_ids contain 1, it will throw an error).

I would like to fine-tune RoBERTa using a vocabulary of 2 for the token_type_ids (so the token_type_ids can contain 0 or 1).

Hopefully by doing this, RoBERTa can learn the difference between token_type_id = 0 and token_type_id = 1 after fine-tuning.

Did I misunderstand issue #1114 ?

astariul-colanim on 18 Sep 2019

Yes, just feed token_type_ids during finetuning.

The embeddings for 2 token type ids are there, they are just not trained.

Nothing special to do to activate them.

thomwolf on 18 Sep 2019

@thomwolf

I'm sorry, I still don't get it, and I still think we need to modify the model after loading the pretrained checkpoint..

Can you try this code and see if we have the same output ?

from pytorch_transformers import XLNetModel, XLNetTokenizer, RobertaTokenizer, RobertaModel
import torch

model = RobertaModel.from_pretrained('roberta-base')
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
print("Config show size of {}\n".format(model.config.type_vocab_size))

src = torch.tensor([tokenizer.encode("<s> My name is Roberta. </s>")])
segs = torch.zeros_like(src)
print("Using segment ids : {}".format(segs))
outputs = model(src, token_type_ids=segs)
print("Output = {}\n".format(outputs[0].size()))

segs[:, 4:] = torch.tensor([1, 1, 1, 1])
print("Using segment ids : {}".format(segs))
outputs = model(src, token_type_ids=segs)
print("Output = {}".format(outputs[0].size()))

My output show :

Config show size of 1
Using segment ids : tensor([[0, 0, 0, 0, 0, 0, 0, 0]])
Output = torch.Size([1, 8, 768])
Using segment ids : tensor([[0, 0, 0, 0, 1, 1, 1, 1]])

RuntimeError Traceback (most recent call last)
in ()
14 segs[:, 4:] = torch.tensor([1, 1, 1, 1])
15 print("Using segment ids : {}".format(segs))
---> 16 outputs = model(src, token_type_ids=segs)
17 print("Output = {}".format(outputs[0].size()))
8 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1504 # remove once script supports set_grad_enabled
1505 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1506 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1507
1508
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:193

Which in my opinion makes sense because :

print(model.embeddings.token_type_embeddings.weight.size())

show :

torch.Size([1, 768])

(And we need [2, 768] if we want to use 2 types of segment IDs)

astariul-colanim on 19 Sep 2019

Yes. The problem is in the config file of Roberta model the type_vocab_size = 1 while for bert it's 2. This cause the problem. I'm trying to set it manually to 2 to see what happens.

tbright17 on 27 Sep 2019

So, unfortunately, it won't be possible to fine-tune token type embeddings with RoBERTa.
We'll remove the pointers to this possibility in the doc and docstring.

thomwolf on 2 Oct 2019

❤3

I just simply set all token_type_ids to 0 and I can finetune on SQuAD 2.0. I can achive 86.8 F1 score, which looks reasonable though still worse than the reported 89.4 F1 score.

tbright17 on 2 Oct 2019

❤1 👍1

Thanks for the investigation @thomwolf !

It makes sense to not allow finetuning token type embeddings with RoBERTa (because of pretraining).

However, it's still possible to load the pretrained model and manually modify it to allow finetuning, right ?

If so, maybe we can add an example of how to do such a thing. My code for this is :

# Load pretrained model
model = RobertaModel.from_pretrained('roberta-base')

# Update config to finetune token type embeddings
model.config.type_vocab_size = 2 

# Create a new Embeddings layer, with 2 possible segments IDs instead of 1
model.embeddings.token_type_embeddings = nn.Embedding(2, model.config.hidden_size)

# Initialize it
model.embeddings.token_type_embeddings.weight.data.normal_(mean=0.0, std=model.config.initializer_range)

_It seems to work, but I would like some feedback, if I missed something :)_

astariul-colanim on 2 Oct 2019

👍2

@tbright17
By setting all token types IDs to 0, you're not actually using it. It's fine, because anyway RoBERTa does not use it, but people might need to use it for some downstream tasks. This issue is about this case :)

astariul-colanim on 2 Oct 2019

👍2

@Colanim
I think Thomas's fix is okay. If you need token_type_ids for some tasks, you can always add new arguments to the forward method. There is no need to use token_type_ids as an argument for the RobertaModel class.

tbright17 on 2 Oct 2019

I think there is some confusion here ^^

As I understood, Thomas didn't fix anything. The current API of RoBERTa already handle token_type_ids in the forward method, but to use it you need to set all token_type_ids to 0 (as you mentioned).

It makes sense (see pretraining of RoBERTa) and should not be changed, as Thomas mentioned. Only documentation may need to be updated.

But I opened this issue because for my task I need to use 2 types of token_type_ids (0 and 1). I was asking how to do this with the current API, what do I need to modify, etc...

astariul-colanim on 2 Oct 2019

Okay I see. Sorry for the confusion.

tbright17 on 2 Oct 2019

@Colanim
Thanks for raising this issue. I was experiencing it too recently where I tried to use the token type ids created by RobertaTokenizer.create_token_type_ids_from_sequences() but when I used it as the model's input, I will get an index out of range error.

I like the way you manually fixed the token type embedding layer. Do you by any chance have a comparison of performance with and without the adjustment you made? And if so, what was the downstream task that you were using Roberta for? I am curious as I would like to do relationship classification for two sequence inputs.

wise-east on 25 Oct 2019

👍1

@wise-east
Sorry I didn't compare with and without. I used RoBERTa for text summarization, and I think it has only little impact on performance (for my task).

astariul-colanim on 28 Oct 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.