Transformers: embeddings after fine tuning

Created on 25 Mar 2019  路  19Comments  路  Source: huggingface/transformers

Hi,
I have fine tune 'bert base uncased' using run_lm_finetuning.py script on my domain specific text corpus.

I have got pytorch_model.bin file after fine tuning.

Now how to load that model and get embeddings.

And if I can do that, does the embeddings get changed because I have fine tuned?

Please can someone guide me on this?

Discussion wontfix

Most helpful comment

I'm assuming that @harmanpreet93 means 50k steps.

All 19 comments

I used this code and it worked.

 output_model_file = /path/to/pytorch_mode.bin/
 model_state_dict = torch.load(output_model_file) 
 model = BertModel.from_pretrained(bert_model, state_dict=model_state_dict)

Hi @KavyaGujjala

I was fine-tuning the 'Bert base uncased' model as per the Google Bert Repository on my domain specific data. I used the existing WordPiece vocab and ran pre-training for 50000 steps on the in-domain text to learn the compositionality. I realized that the updated embeddings were improved by manual evaluation. But I really want to add my domain words to the vocab, as they carry importance for my downstream tasks.

Can you please tell how did you handle the domain vocab while fine-tuning? Did you add your domain words to the vocab file? If yes, did you append the words to the vocab file or just replaced the [unusedXX] words from the file?

Did you see any improvements after fine-tuning your models? How did you evaluate the embeddings quality for your domain, if not evaluated manually?

Hi @harmanpreet93

I haven't used fine tuning code from actual google bert repo but used the pretraining code. I used finetuning code from pytorch repo but the embeddings were getting changed everytime I loaded the model. So I am just pretraining for domain specific data.

My first question, you are doing fine tuning or pretraining?

I have pretrained using bert base uncased model for 10000 steps. After that I got the sentence representation using [CLS] token from the final hidden layer. Compared the cosine similarity between sentences. I didnt get good results though. Dont know what I am missing. May be as you have mentioned I need to add words to vocab file.

Hi @KavyaGujjala

My first question, you are doing fine tuning or pretraining?

Yes, I too used pre-training code from bert repo. After pretraining on domain data for 50k steps, the embeddings got updated.

Compared the cosine similarity between sentences. I didnt get good results though.

Q: How did you evaluate the quality of sentence embeddings? By not getting good results, do you mean that the similar sentences weren't scored at the top or the cosine scores weren't good?

In my case the similarity scores decreased, but the relevant sentences started showing up in the topk.

May be as you have mentioned I need to add words to vocab file.

I'm referring to this issue from bert repo. Its suggests the following approaches:

But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

Q: Did you try any of these approaches for the bert repo? I tried approach (b), by adding my domain words to the vocab file and updated the vocab_size as per repo. I ran into following error:
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((33297, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.

Q: How were the sentence embeddings calculated? By averaging the word embeddings or any other approach?

Hi @harmanpreet93

Yes, I too used pre-training code from bert repo. After pretraining on domain data for 50k steps, the embeddings got updated.

What is the size of dataset you have used? My dataset has like 1 million sentences domain specific.

Q: How did you evaluate the quality of sentence embeddings? By not getting good results, do you mean that the similar sentences weren't scored at the top or the cosine scores weren't good?

yeah cosine scores aren't good enough. Even dissimilar sentences are getting better similarity scores than the similar ones. Does training for more steps solve this issue?

Q: Did you try any of these approaches for the bert repo? I tried approach (b), by adding my domain words to the vocab file and updated the vocab_size as per repo. I ran into following error:
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((33297, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.

I haven't tried adding domain specific words to vocab but are you using the pretrained model bert_config.json file along with the domain specific trained model checkpoint. If so have you changed vocab size? And did you do it the way jacobdevlin has mentioned like writing script to initialize random weights? I didnt really understand that part, I mean what exactly are we supposed to do to initialize weights.
Also why didnt you try replacing unusedX tokens?

Q: How were the sentence embeddings calculated? By averaging the word embeddings or any other approach?

I used the [CLS] token embeddings from the hidden layer output ( as they have mentioned in the paper that [CLS] token can be used for sequence level classification ). I initially tried max pooling and mean pooling of word embeddings but got really bad results ( every cosine similarity score was above 0.7 ).
Which approach did you follow?

To add to this, in their paper they mention they get the best results by concatenating the last four layers.

#! In your model setup
# Indices of layers to concatenate
self.bert_layers = [-1, -2, -3, -4]
self.bert = BertModel.from_pretrained('your-checkpoint.pth')

#! In your forward method
all_bert_layers, _ = self.bert_layer(bert_ids, attention_mask=bert_mask)
bert_concat = torch.cat(tuple([all_bert_layers[i] for i in self.bert_layers]), dim=-1)

# If you use a mask:
## Pooling by also setting masked items to zero
bert_mask = torch.FloatTensor(bert_mask).unsqueeze(2)
## Multiply output with mask to only retain non-paddding tokens
bert_pooled = torch.mul(bert_concat, bert_mask)

# First item ['CLS'] is sentence representation.
# Use bert_concat instead of bert_pooled if you didn't use a mask
final_bert = bert_pooled[:, 0, :]

To add to this, in their paper they mention they get the best results by concatenating the last four layers.

#! In your model setup
# Indices of layers to concatenate
self.bert_layers = [-1, -2, -3, -4]
self.bert = BertModel.from_pretrained('your-checkpoint.pth')

#! In your forward method
all_bert_layers, _ = self.bert_layer(bert_ids, attention_mask=bert_mask)
bert_concat = torch.cat(tuple([all_bert_layers[i] for i in self.bert_layers]), dim=-1)

# If you use a mask:
## Pooling by also setting masked items to zero
bert_mask = torch.FloatTensor(bert_mask).unsqueeze(2)
## Multiply output with mask to only retain non-paddding tokens
bert_pooled = torch.mul(bert_concat, bert_mask)

# First item ['CLS'] is sentence representation.
# Use bert_concat instead of bert_pooled if you didn't use a mask
final_bert = bert_pooled[:, 0, :]

@BramVanroy Thanks for the info. I am using original BERT repo pretraining code and used extract_features.py code to get last four layers output for all the tokens in a sequence. By concatenating the last four layers does it mean adding all four layers embeddings of each token?

@KavyaGujjala If you look at my code, you can see that I mean concatenating across the hidden dim axis. So let's say the output of a single bert layer is batch_size * seq_len * hidden_dim, then concatenating the last four ones ends up with batch_size * seq_len * (hidden_dim*4).

Hi @harmanpreet93 ,

You ran 50k epochs for fine tuning?

Thanks
Mahesh

@search4mahesh Yes, I ran 50k epochs for fine-tuning!

@harmanpreet93 that must be lot of compute time, example only says about 3 epochs.

I'm assuming that @harmanpreet93 means 50k steps.

@harmanpreet93 that must be a lot of compute time, example only says about 3 epochs.

You are right @BramVanroy I meant 50k steps.

@harmanpreet93 Have you solve the problem of "ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((33297, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader."?

Hi @Firmiana1220

I was following solutions mentioned here from bert repo. It suggests the following two approaches:

But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

I was going for option (b). But I couldn't find any progress. Therefore, I decided to pre-train the model just on my domain data, and not leveraging the already pre-trained models. I'm still looking for a better approach.

@KavyaGujjala Thanks for the code sample. Is the masking in it redundant? Unless I'm misunderstanding, you mask out all the unused tokens, but then simply grab the 'CLS' token. The masking would be required if you use all the values, but seems redundant if you're just using the sentence representation.

@KavyaGujjala Thanks for the code sample. Is the masking in it redundant? Unless I'm misunderstanding, you mask out all the unused tokens, but then simply grab the 'CLS' token. The masking would be required if you use all the values, but seems redundant if you're just using the sentence representation.

Hi @snard6 , I didnt use that masking code as such, for now I got a finetuned model for my domain specific data and using the cls token for representation which gave comparatively better results.

I used pytorch bert codes for getting a finetuned model and then extract features code to get the cls token as sentence representation.

Hi @harmanpreet93
I tried the option (a), it works. I replaced the word in vocab.txt with the word in my own domain, other words are [unusedX]. If you don't change the size of vocab.txt, it works. But if your domain words are larger than 30522, maybe you should try the option (b).

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HanGuo97 picture HanGuo97  路  3Comments

fabiocapsouza picture fabiocapsouza  路  3Comments

fyubang picture fyubang  路  3Comments

yspaik picture yspaik  路  3Comments

adigoryl picture adigoryl  路  3Comments