Transformers: embeddings after fine tuning

Created on 25 Mar 2019 · 19Comments · Source: huggingface/transformers

Hi,
I have fine tune 'bert base uncased' using run_lm_finetuning.py script on my domain specific text corpus.

I have got pytorch_model.bin file after fine tuning.

Now how to load that model and get embeddings.

And if I can do that, does the embeddings get changed because I have fine tuned?

Please can someone guide me on this?

Discussion wontfix

Source

KavyaGujjala

Most helpful comment

I'm assuming that @harmanpreet93 means 50k steps.

BramVanroy on 30 Apr 2019

👍2

All 19 comments

I used this code and it worked.

 output_model_file = /path/to/pytorch_mode.bin/
 model_state_dict = torch.load(output_model_file) 
 model = BertModel.from_pretrained(bert_model, state_dict=model_state_dict)

KavyaGujjala on 27 Mar 2019

Hi @KavyaGujjala

I was fine-tuning the 'Bert base uncased' model as per the Google Bert Repository on my domain specific data. I used the existing WordPiece vocab and ran pre-training for 50000 steps on the in-domain text to learn the compositionality. I realized that the updated embeddings were improved by manual evaluation. But I really want to add my domain words to the vocab, as they carry importance for my downstream tasks.

Can you please tell how did you handle the domain vocab while fine-tuning? Did you add your domain words to the vocab file? If yes, did you append the words to the vocab file or just replaced the [unusedXX] words from the file?

Did you see any improvements after fine-tuning your models? How did you evaluate the embeddings quality for your domain, if not evaluated manually?

harmanpreet93 on 29 Mar 2019

Hi @harmanpreet93

I haven't used fine tuning code from actual google bert repo but used the pretraining code. I used finetuning code from pytorch repo but the embeddings were getting changed everytime I loaded the model. So I am just pretraining for domain specific data.

My first question, you are doing fine tuning or pretraining?

I have pretrained using bert base uncased model for 10000 steps. After that I got the sentence representation using [CLS] token from the final hidden layer. Compared the cosine similarity between sentences. I didnt get good results though. Dont know what I am missing. May be as you have mentioned I need to add words to vocab file.

KavyaGujjala on 29 Mar 2019

Hi @KavyaGujjala

My first question, you are doing fine tuning or pretraining?

Yes, I too used pre-training code from bert repo. After pretraining on domain data for 50k steps, the embeddings got updated.

Compared the cosine similarity between sentences. I didnt get good results though.

Q: How did you evaluate the quality of sentence embeddings? By not getting good results, do you mean that the similar sentences weren't scored at the top or the cosine scores weren't good?

In my case the similarity scores decreased, but the relevant sentences started showing up in the topk.

May be as you have mentioned I need to add words to vocab file.

I'm referring to this issue from bert repo. Its suggests the following approaches:

But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

Q: Did you try any of these approaches for the bert repo? I tried approach (b), by adding my domain words to the vocab file and updated the vocab_size as per repo. I ran into following error:
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((33297, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.

Q: How were the sentence embeddings calculated? By averaging the word embeddings or any other approach?

harmanpreet93 on 29 Mar 2019

Hi @harmanpreet93

Yes, I too used pre-training code from bert repo. After pretraining on domain data for 50k steps, the embeddings got updated.

What is the size of dataset you have used? My dataset has like 1 million sentences domain specific.

Q: How did you evaluate the quality of sentence embeddings? By not getting good results, do you mean that the similar sentences weren't scored at the top or the cosine scores weren't good?

yeah cosine scores aren't good enough. Even dissimilar sentences are getting better similarity scores than the similar ones. Does training for more steps solve this issue?

Q: Did you try any of these approaches for the bert repo? I tried approach (b), by adding my domain words to the vocab file and updated the vocab_size as per repo. I ran into following error:
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((33297, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.

I haven't tried adding domain specific words to vocab but are you using the pretrained model bert_config.json file along with the domain specific trained model checkpoint. If so have you changed vocab size? And did you do it the way jacobdevlin has mentioned like writing script to initialize random weights? I didnt really understand that part, I mean what exactly are we supposed to do to initialize weights.
Also why didnt you try replacing unusedX tokens?

Q: How were the sentence embeddings calculated? By averaging the word embeddings or any other approach?

I used the [CLS] token embeddings from the hidden layer output ( as they have mentioned in the paper that [CLS] token can be used for sequence level classification ). I initially tried max pooling and mean pooling of word embeddings but got really bad results ( every cosine similarity score was above 0.7 ).
Which approach did you follow?

KavyaGujjala on 29 Mar 2019

To add to this, in their paper they mention they get the best results by concatenating the last four layers.

#! In your model setup
# Indices of layers to concatenate
self.bert_layers = [-1, -2, -3, -4]
self.bert = BertModel.from_pretrained('your-checkpoint.pth')

#! In your forward method
all_bert_layers, _ = self.bert_layer(bert_ids, attention_mask=bert_mask)
bert_concat = torch.cat(tuple([all_bert_layers[i] for i in self.bert_layers]), dim=-1)

# If you use a mask:
## Pooling by also setting masked items to zero
bert_mask = torch.FloatTensor(bert_mask).unsqueeze(2)
## Multiply output with mask to only retain non-paddding tokens
bert_pooled = torch.mul(bert_concat, bert_mask)

# First item ['CLS'] is sentence representation.
# Use bert_concat instead of bert_pooled if you didn't use a mask
final_bert = bert_pooled[:, 0, :]

BramVanroy on 4 Apr 2019

👍2

To add to this, in their paper they mention they get the best results by concatenating the last four layers.

#! In your model setup
# Indices of layers to concatenate
self.bert_layers = [-1, -2, -3, -4]
self.bert = BertModel.from_pretrained('your-checkpoint.pth')

#! In your forward method
all_bert_layers, _ = self.bert_layer(bert_ids, attention_mask=bert_mask)
bert_concat = torch.cat(tuple([all_bert_layers[i] for i in self.bert_layers]), dim=-1)

# If you use a mask:
## Pooling by also setting masked items to zero
bert_mask = torch.FloatTensor(bert_mask).unsqueeze(2)
## Multiply output with mask to only retain non-paddding tokens
bert_pooled = torch.mul(bert_concat, bert_mask)

# First item ['CLS'] is sentence representation.
# Use bert_concat instead of bert_pooled if you didn't use a mask
final_bert = bert_pooled[:, 0, :]

@BramVanroy Thanks for the info. I am using original BERT repo pretraining code and used extract_features.py code to get last four layers output for all the tokens in a sequence. By concatenating the last four layers does it mean adding all four layers embeddings of each token?

KavyaGujjala on 10 Apr 2019

@KavyaGujjala If you look at my code, you can see that I mean concatenating across the hidden dim axis. So let's say the output of a single bert layer is batch_size * seq_len * hidden_dim, then concatenating the last four ones ends up with batch_size * seq_len * (hidden_dim*4).

BramVanroy on 10 Apr 2019

Hi @harmanpreet93 ,

You ran 50k epochs for fine tuning?

Thanks
Mahesh

search4mahesh on 16 Apr 2019

@search4mahesh Yes, I ran 50k epochs for fine-tuning!

harmanpreet93 on 25 Apr 2019

@harmanpreet93 that must be lot of compute time, example only says about 3 epochs.

search4mahesh on 29 Apr 2019

I'm assuming that @harmanpreet93 means 50k steps.

BramVanroy on 30 Apr 2019

👍2

@harmanpreet93 that must be a lot of compute time, example only says about 3 epochs.

You are right @BramVanroy I meant 50k steps.

harmanpreet93 on 3 May 2019

@harmanpreet93 Have you solve the problem of "ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((33297, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader."?

Firmiana1220 on 5 May 2019

Hi @Firmiana1220

I was following solutions mentioned here from bert repo. It suggests the following two approaches:

But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

I was going for option (b). But I couldn't find any progress. Therefore, I decided to pre-train the model just on my domain data, and not leveraging the already pre-trained models. I'm still looking for a better approach.

harmanpreet93 on 6 May 2019

@KavyaGujjala Thanks for the code sample. Is the masking in it redundant? Unless I'm misunderstanding, you mask out all the unused tokens, but then simply grab the 'CLS' token. The masking would be required if you use all the values, but seems redundant if you're just using the sentence representation.

snard6 on 7 May 2019

@KavyaGujjala Thanks for the code sample. Is the masking in it redundant? Unless I'm misunderstanding, you mask out all the unused tokens, but then simply grab the 'CLS' token. The masking would be required if you use all the values, but seems redundant if you're just using the sentence representation.

Hi @snard6 , I didnt use that masking code as such, for now I got a finetuned model for my domain specific data and using the cls token for representation which gave comparatively better results.

I used pytorch bert codes for getting a finetuned model and then extract features code to get the cls token as sentence representation.

KavyaGujjala on 7 May 2019

Hi @harmanpreet93
I tried the option (a), it works. I replaced the word in vocab.txt with the word in my own domain, other words are [unusedX]. If you don't change the size of vocab.txt, it works. But if your domain words are larger than 30522, maybe you should try the option (b).

Firmiana1220 on 8 May 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.