Bert: Sentence embedding for STS task by fine-tuning bert

Created on 17 Dec 2018  路  67Comments  路  Source: google-research/bert

I tried to fine-tune the BERT model as an embedding model, which maps sentences to a space where the cosine similarity between two sentence embedding vectors can be interpreted as the sentence similarity level. Specifically, the embedding vectors (corresponds to [CLS]) of every sentence pair is first normalized to have unit L-2 norm. Next, the dot product (i.e. cosine similarity) between each embedding pair is multiplied by 5, then aligned with the labels by using MSE loss. Learning rate has been changed to 1e-5 to mitigate convergence problem.

After 20 epochs of fine-tuning, the resulting 12-layer model only get a 0.66 pearson correlation coefficient on the validation set, which performs far worse than a BERT model fine-tuned in the simple regression manner (12-layer, pearson = 0.89).

My question is, is there any suggestion about fine-tuning BERT for sentence embeddings? Thanks!

Most helpful comment

| method | PPMCC (STS-B dev) |
|---|---|
|bert, no FT, cosine similarity between sentence embedding ([CLS]) | 0.29 |
|bert, FT, simple regression | 0.89 |
|bert, FT, cosine similarity between sentence embedding ([CLS]) | 0.66 |
|bert, no FT, cosine similarity between mean-pooled sequence embeddings (mean_pool([CLS], tok1, ..., [SEP])) | 0.59 |
| average word vector (spaCy, en_core_web_lg) | 0.54 |

All 67 comments

| method | PPMCC (STS-B dev) |
|---|---|
|bert, no FT, cosine similarity between sentence embedding ([CLS]) | 0.29 |
|bert, FT, simple regression | 0.89 |
|bert, FT, cosine similarity between sentence embedding ([CLS]) | 0.66 |
|bert, no FT, cosine similarity between mean-pooled sequence embeddings (mean_pool([CLS], tok1, ..., [SEP])) | 0.59 |
| average word vector (spaCy, en_core_web_lg) | 0.54 |

From here :

I quote from google-research/bert#164 (comment)

And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

Suggestion from my side, if you want to use cosine distance anyway, then please focus on the rank not the absolute value. Namely, do not use:

cosine(A, B) > 0.9 then A and B are similar

please consider the following instead:

if cosine(A, B) > cosine(A, C) then A is more similar to B than C.

You can't use cosine distance directly between the embeddings produced by BERT. You need to apply another layer between the distance and BERT.

So on top of BERT, you put your custom layer (I used LSTM, but a simple Feed Forward network might be enough depending on what you're trying to achieve), and use the output of this custom layer to compute the cosine distance.

@Colanim Thank you very much for this hint. BTW have you validated the resulting embeddings on STS-B validation set after adding a custom layer such as LSTM? How much is the pearson correlation coefficient?

I used a Siamese-type network, separating the sentence representation by BERT.

Basically, I input one sentence into BERT, get the feature vectors for this sentence, input this feature vectors into LSTM. Do the same for the second sentence, then compare the output of LSTMs with distance (I used Manhattan distance, but cosine distance works as well).

My vanilla network is less good than the end-to-end BERT : I could reach 0.78 (Pearson correlation). This is because in my model, BERT represent sentence one by one, and not the sentence pair together.

After a few improvements (ensembling), I could reach 0.82, but that's all.

Why exactly are you using cosine similarity ? I believe this approach is not good compared to the end-to-end BERT.

I used a Siamese-type network, separating the sentence representation by BERT.

Basically, I input one sentence into BERT, get the feature vectors for this sentence, input this feature vectors into LSTM. Do the same for the second sentence, then compare the output of LSTMs with distance (I used Manhattan distance, but cosine distance works as well).

My vanilla network is less good than the end-to-end BERT : I could reach 0.78 (Pearson correlation). This is because in my model, BERT represent sentence one by one, and not the sentence pair together.

0.78 looks better than my method.

After a few improvements (ensembling), I could reach 0.82, but that's all.

Why exactly are you using cosine similarity ? I believe this approach is not good compared to the end-to-end BERT.

I have a dataset which comprises 1e6 sentences and I need to be able to calculate the similarity between any two samples. If I use the end-to-end BERT, I'll have to forward 1e12 sentence pairs which takes hundreds of years. But if I choose to calculate cosine similarity after extracting embeddings for every sentence, an amount of time could be saved. That's because matrix multiplication is much faster compared to BERT forward.

0.78 looks better than my method.

This come from the fact that you use your distance metric directly on BERT embeddings. You need an additional layer.

I have a dataset which comprises 1e6 sentences and I need to be able to calculate the similarity between any two samples. If I use the end-to-end BERT, I'll have to forward 1e12 sentence pairs which takes hundreds of years. But if I choose to calculate cosine similarity after extracting embeddings for every sentence, an amount of time could be saved. That's because matrix multiplication is much faster compared to BERT forward.

Do you actually have a label for the 1e12 sentence pairs ??
You right, if you extract embeddings for every sentence and use it later to compare between 2 sentences, you will save time. However, by doing so, you will also loose accuracy, because BERT will treat each sentence independently and not as a pair anymore (so information between sentence parity will be lost).

Just out of curiosity, how long time is need to extract 1e6 sentences embeddings from BERT ? What about memory ?

0.78 looks better than my method.

This come from the fact that you use your distance metric directly on BERT embeddings. You need an additional layer.

I have a dataset which comprises 1e6 sentences and I need to be able to calculate the similarity between any two samples. If I use the end-to-end BERT, I'll have to forward 1e12 sentence pairs which takes hundreds of years. But if I choose to calculate cosine similarity after extracting embeddings for every sentence, an amount of time could be saved. That's because matrix multiplication is much faster compared to BERT forward.

Do you actually have a label for the 1e12 sentence pairs ??

I don't have label, what I need to do is to predict similarity values for the 1e12 pairs.

You right, if you extract embeddings for every sentence and use it later to compare between 2 sentences, you will save time. However, by doing so, you will also loose accuracy, because BERT will treat each sentence independently and not as a pair anymore (so information between sentence parity will be lost).

Just out of curiosity, how long time is need to extract 1e6 sentences embeddings from BERT ? What about memory ?

Less than 8 hours. Using Titian X (Pascal) x1. Max sequence length was 32, IIRC.

You indeed have a lot of data ^^
Then I think you can try adding a custom layer after BERT. But expect Pearson correlation decrease.

fyi, starting from 1.6.1, bert-as-service supports serving fine-tuned model out of the box. See: https://github.com/hanxiao/bert-as-service#serving-a-fine-tuned-bert-model

To upgrade, please do:

pip install -U bert-serving-server bert-serving-client

I tried to fine-tune the BERT model as an embedding model, which maps sentences to a space where the cosine similarity between two sentence embedding vectors can be interpreted as the sentence similarity level. Specifically, the embedding vectors (corresponds to [CLS]) of every sentence pair is first normalized to have unit L-2 norm. Next, the dot product (i.e. cosine similarity) between each embedding pair is multiplied by 5, then aligned with the labels by using MSE loss. Learning rate has been changed to 1e-5 to mitigate convergence problem.

After 20 epochs of fine-tuning, the resulting 12-layer model only get a 0.66 pearson correlation coefficient on the validation set, which performs far worse than a BERT model fine-tuned in the simple regression manner (12-layer, pearson = 0.89).

My question is, is there any suggestion about fine-tuning BERT for sentence embeddings? Thanks!

I am doing something similar. I used bert embedding with max pooling and compute cosine similarity, and it is better than using glove embedding.

I wonder what is your current approach to sentence similarity now. I have similar concern, and I do not want to input two sentences to bert model at a time cause that will dramatically increase my training sample number.

I used a Siamese-type network, separating the sentence representation by BERT.

Basically, I input one sentence into BERT, get the feature vectors for this sentence, input this feature vectors into LSTM. Do the same for the second sentence, then compare the output of LSTMs with distance (I used Manhattan distance, but cosine distance works as well).

My vanilla network is less good than the end-to-end BERT : I could reach 0.78 (Pearson correlation). This is because in my model, BERT represent sentence one by one, and not the sentence pair together.

After a few improvements (ensembling), I could reach 0.82, but that's all.

Why exactly are you using cosine similarity ? I believe this approach is not good compared to the end-to-end BERT.

Did you use the last state of LSTM output, or you did a pooling over all the state of LSTM output? Thanks!

I wonder what is your current approach to sentence similarity now. I have similar concern, and I do not want to input two sentences to bert model at a time cause that will dramatically increase my training sample number.

Have you seen the table below the original post? I also tried ELMo (without FT, without adding network layers) and mean-pooled sentence vectors yield a pearson score of 0.66

I wonder what is your current approach to sentence similarity now. I have similar concern, and I do not want to input two sentences to bert model at a time cause that will dramatically increase my training sample number.

Have you seen the table below the original post? I also tried ELMo (without FT, without adding network layers) and mean-pooled sentence vectors yield a pearson score of 0.66

No, I haven't seen the table of the original post. I just evaluated it on my end-to-end task. The overall accuracy increase after I turned to bert embedding.
So I assume that CLS embedding doesn't represent the sentence embedding? I am trying CNN or LSTM with pooling or last state. I wonder what is the default solution to calculate sentence similarity using bert embedding.

Did you use the last state of LSTM output, or you did a pooling over all the state of LSTM output?

I just used the last state of the LSTM


I tried several things, and for my problem, the best solution was to use Reduce-mean pooling between all the feature vectors outputted by BERT. Then I fed this into a LSTM (I believe a simple Feed Forward network would work equally good, if not better) and get the last state of the LSTM.

This last state of the LSTM is my sentence representation. I can then compare it to other sentence representation with a distance. So at the end I have a Siamese network.

_Note : CLS embedding is a better solution for sentence representation if you can fine tune your model._

Did you use the last state of LSTM output, or you did a pooling over all the state of LSTM output?

I just used the last state of the LSTM

I tried several things, and for my problem, the best solution was to use Reduce-mean pooling between all the feature vectors outputted by BERT. Then I fed this into a LSTM (I believe a simple Feed Forward network would work equally good, if not better) and get the last state of the LSTM.

This last state of the LSTM is my sentence representation. I can then compare it to other sentence representation with a distance. So at the end I have a Siamese network.

_Note : CLS embedding is a better solution for sentence representation if you can fine tune your model._

Ummm...I am confused. After reduce-mean pooling, you get a batch_size * bert_dim size output? So for each sample, you feed a vector to LSTM? How is this reduce-mean done in your solution?

After reduce-mean pooling, you get a batch_size * bert_dim size output?

Right.

So for each sample, you feed a vector to LSTM?

Actually, my problem is about document representation. So I need to input several vectors to the LSTM. But it is working with only 1 vector (a document of only 1 sentence).

How is this reduce-mean done in your solution?

I used BERT-as-service.

After reduce-mean pooling, you get a batch_size * bert_dim size output?

Right.

So for each sample, you feed a vector to LSTM?

Actually, my problem is about document representation. So I need to input several vectors to the LSTM. But it is working with only 1 vector (a document of only 1 sentence).

How is this reduce-mean done in your solution?

I used BERT-as-service.

Oh I see. It seems I should try several approaches to find the best one. My problem is that my training set is really small. I also thought about multiply CLS vector of two sentences directly, but that seems introduced many variables.
Still trying to find the best approach.

Keep us updated :)

Hi @Colanim, sorry to crash into your conversation, but I had this question about your statement below,

After a few improvements (ensembling), I could reach 0.82, but that's all.

When you say ensembling, what exactly have you done? So is it like getting output from different models and doing a majority vote (or some similar heuristic), or is something fancier than that?

Hi,
Thanks you all for sharing your results here. I am working on the same task, calculating similarity scores of sentences. I am wondering if someone had tried to fine-tune BERT using Conneau et al. approach [arXiv:1705.02364 figure-1]?

When you say ensembling, what exactly have you done? So is it like getting output from different models and doing a majority vote (or some similar heuristic), or is something fancier than that?

I've done 2 things : ensembling as you said (get the output from different models and average the results). I also use more than 1 distance metric. Instead of using only 1 distance, I used 2 different metrics (Manhattan distance and Cosine distance) and use a FF network to compute the final result from these 2 metrics.

I used a Siamese-type network, separating the sentence representation by BERT.

Basically, I input one sentence into BERT, get the feature vectors for this sentence, input this feature vectors into LSTM. Do the same for the second sentence, then compare the output of LSTMs with distance (I used Manhattan distance, but cosine distance works as well).

My vanilla network is less good than the end-to-end BERT : I could reach 0.78 (Pearson correlation). This is because in my model, BERT represent sentence one by one, and not the sentence pair together.

After a few improvements (ensembling), I could reach 0.82, but that's all.

Why exactly are you using cosine similarity ? I believe this approach is not good compared to the end-to-end BERT.

hi can you help to share your code? I have the same problem, I need to encode a set of sentences to embedding, and given another sentence, find the nearest top n sentence.

I'm sorry but I can't share the code.

Simply use Bert as service to encode your sentences. Then you have several choices for comparing the top sentences : Siamese network, k nearest neighbors

I'm sorry but I can't share the code.

Simply use Bert as service to encode your sentences. Then you have several choices for comparing the top sentences : Siamese network, k nearest neighbors

Thank you all the same.
We have try to use bert encode sentences directly, but it failed and only get 0.5 AUC in the evaluation dataset.

Hi,
Would anyone mind open-sourcing their code for a cosine-similarity approach to ranking the similarity of sentences. We've tried using "BERT as a Service" but we got an MRR score of 0.02 vs a 10-20 that you'd expect.

The advantage (i think) with cosine similarity is that you can save the embedding for a sentence, instead of re-running a similarity algorithm again and again.

Hi @cdluminate When you are using embedding vector for [CLS], you are using the vector from the last layer (shape: 1 * 768), or from all the layers (shape: 12 * 768)? Thanks.

@ymcdull If you are going to reproduce my result in https://github.com/google-research/bert/issues/276#issuecomment-447838902, use the vector from only the last layer.

After reduce-mean pooling, you get a batch_size * bert_dim size output?

Right.

So for each sample, you feed a vector to LSTM?

Actually, my problem is about document representation. So I need to input several vectors to the LSTM. But it is working with only 1 vector (a document of only 1 sentence).

How is this reduce-mean done in your solution?

I used BERT-as-service.

Hi, we are also working on similar task, have you tried the Conv approach besides LSTM, if so how is the result?

@cadobe

I didn't tried Conv, only LSTM :/

@cadobe

I didn't tried Conv, only LSTM :/

@Colanim Thanks for your quick response! Just checked the "BERT AS a service code", Still a bit confused that, for single batch, the reduce-mean outputs one vector, how did you feed this vector into LSTM's hidden state to fetch the last hidden one?

As you said, the output of Bert as service gives one vector (for one sentence.)

My goal was to be able to represent a whole document, so at each timestep, I input 1 sentence representation (the vector returned by Bert as service). So the LSTM is here to gather several sentences representation into a document representation.

Maybe LSTM is not the right way to go, and Conv is better, I don't know about that.

As you said, the output of Bert as service gives one vector (for one sentence.)

My goal was to be able to represent a whole document, so at each timestep, I input 1 sentence representation (the vector returned by Bert as service). So the LSTM is here to gather several sentences representation into a document representation.

Maybe LSTM is not the right way to go, and Conv is better, I don't know about that.

i c, thanks! For one sentence embedding (not a document embedding), how about feeding all token embeddings of BERT last layer output sequentially as input of LSTM/RNN's hidden state input.(Not a mean pooling way), and take the last hidden state of the LSTM to treat as sentence embedding rather than the CLS embedding. Here is also one question, can the BERT output embeddings represent all terms/tokens' embedding?

Here is also one question, can the BERT output embeddings represent all terms/tokens' embedding?

How do you obtain the sentence embedding with mean pooling from BERT if it doesn't produce per-token embeddings?

@cadobe

i c, thanks! For one sentence embedding (not a document embedding), how about feeding all token embeddings of BERT last layer output sequentially as input of LSTM/RNN's hidden state input.(Not a mean pooling way), and take the last hidden state of the LSTM to treat as sentence embedding rather than the CLS embedding.

It's definitely possible. You have to consider that you will have to train the RNN you use to create sentence embeddings. Using Mean-pooling (or other strategies proposed by BERT as service) give you the big advantage of no need to train it, and faster training/inference (RNN are not parallelisable).

But anyway the author of BERT specifically said that BERT was not meant to represent sentences, either with CLS token or by mean of each token's embeddings. So maybe using LSTM is a good idea to have a good sentence representation, you have to try yourself and compare. But if you want something that work fast and easy to implement, I would advice against ^^

But anyway the author of BERT specifically said that BERT was not meant to represent sentences, either with CLS token or by mean of each token's embeddings. So maybe using LSTM is a good idea to have a good sentence representation, you have to try yourself and compare. But if you want something that work fast and easy to implement, I would advice against ^^

Were did they say that? was it on twitter ?

I used a Siamese-type network, separating the sentence representation by BERT.

Basically, I input one sentence into BERT, get the feature vectors for this sentence, input this feature vectors into LSTM. Do the same for the second sentence, then compare the output of LSTMs with distance (I used Manhattan distance, but cosine distance works as well).

My vanilla network is less good than the end-to-end BERT : I could reach 0.78 (Pearson correlation). This is because in my model, BERT represent sentence one by one, and not the sentence pair together.

After a few improvements (ensembling), I could reach 0.82, but that's all.

Why exactly are you using cosine similarity ? I believe this approach is not good compared to the end-to-end BERT.

@Colanim sorry to interrupt. i am trying to use BERT for paraphrasing. So, I have two sentences (A and B) and, have encoded each sentence using getting-elmo-like-contextual-word-embedding. After encoding each sentence separately, LSTM layer are attached at the outputs of both the sentences. Later on, I concatenate both the encoded sentences and add dense layers over it for classification (whether both sentences are related or not).

Issue: my network is not learning i.e. accuracy and loss are not improving.

Could you please suggest, what's wrong in this process?

Thanks :)

Were did they say that? was it on twitter ?

71


i am trying to use BERT for paraphrasing. So, I have two sentences (A and B) and, have encoded each sentence using getting-elmo-like-contextual-word-embedding. After encoding each sentence separately, LSTM layer are attached at the outputs of both the sentences. Later on, I concatenate both the encoded sentences and add dense layers over it for classification (whether both sentences are related or not).

In your case, you have 2 sentences. Which means the length of both sentences will likely not be higher than 512. So you should use BERT end to end, no need to find something complicated with building LSTM over BERT, which is not going to increase your score.

In my case I used LSTM over BERT because I had no other choices : I was working with documents, too big to be inputted into the limit of 512 given by BERT.

Your task is very similar to the MNLI dataset. I would advice end to end BERT training.

@cadobe

i c, thanks! For one sentence embedding (not a document embedding), how about feeding all token embeddings of BERT last layer output sequentially as input of LSTM/RNN's hidden state input.(Not a mean pooling way), and take the last hidden state of the LSTM to treat as sentence embedding rather than the CLS embedding.

It's definitely possible. You have to consider that you will have to train the RNN you use to create sentence embeddings. Using Mean-pooling (or other strategies proposed by BERT as service) give you the big advantage of no need to train it, and faster training/inference (RNN are not parallelisable).

But anyway the author of BERT specifically said that BERT was not meant to represent sentences, either with CLS token or by mean of each token's embeddings. So maybe using LSTM is a good idea to have a good sentence representation, you have to try yourself and compare. But if you want something that work fast and easy to implement, I would advice against ^^

Hi,
Even I'm working on a similar task. I'm using 'bert_multi_cased_L-12_H-768_A-12/1' model for japanese texts. I followed tensorflow_hub method and got both pooled_output and sequence_output for one sentence.
Do you have any idea on how good this 'pooled_output' vector is for sentence representation.
Otherwise is it better to give feature vectors to LSTM and compare similarity between sentences?

Hi,
Even I'm working on a similar task. I'm using 'bert_multi_cased_L-12_H-768_A-12/1' model for japanese texts. I followed tensorflow_hub method and got both pooled_output and sequence_output for one sentence.
Do you have any idea on how good this 'pooled_output' vector is for sentence representation.
Otherwise is it better to give feature vectors to LSTM and compare similarity between sentences?

If you can use BERT end-to-end to classify your sentences, I believe it will always be better this way.

However, because of the input size limit of BERT, it might be necessary to first represent each sentences through BERT, and then use a LSTM (or something else) to create a representation for long documents. In that case you anyway don't have much choices.

Hello @cdluminate

If you can share any code hints for this

bert, FT, simple regression | 0.89

that would be awesome!
My task is simple: sentence similarity, so less than 512 symbols. I assume I don't need additional network in this case. Current solution: Google's Universal Sentence Encoder + knn (hnswlib). Works not bad (dunno how to measure accuracy), but could be better. I'm trying to understand how to perform FT and see some numbers. Thanks!

@realsergii My code for that experiment was simply based on https://github.com/Colanim/BERT_STS-B (IIRC). Fine-tuning is required to adapt BERT to the regression task on STS-B.

Hi,
Even I'm working on a similar task. I'm using 'bert_multi_cased_L-12_H-768_A-12/1' model for japanese texts. I followed tensorflow_hub method and got both pooled_output and sequence_output for one sentence.
Do you have any idea on how good this 'pooled_output' vector is for sentence representation.
Otherwise is it better to give feature vectors to LSTM and compare similarity between sentences?

If you can use BERT end-to-end to classify your sentences, I believe it will always be better this way.

However, because of the input size limit of BERT, it might be necessary to first represent each sentences through BERT, and then use a LSTM (or something else) to create a representation for long documents. In that case you anyway don't have much choices.

Hey, Thanks for the reply.

In case I want to use the feature vectors on LSTM layer and get sentence embeddings what weights can I give for LSTM layer?
Because without weights output from LSTM layer will change everytime I run the code.

I tried mean and also max of feature vectors both gave bad results. ( I am using the run_pretraining code to get domain specific contextual embeddings and that returns only word level embeddings )
Any idea how to do this?

In case I want to use the feature vectors on LSTM layer and get sentence embeddings what weights can I give for LSTM layer?
Because without weights output from LSTM layer will change everytime I run the code.

You have to train your model on a dataset, the network will learn the weights by himself. And indeed, the weights are going to be different with each training.

Hi,
Even I'm working on a similar task. I'm using 'bert_multi_cased_L-12_H-768_A-12/1' model for japanese texts. I followed tensorflow_hub method and got both pooled_output and sequence_output for one sentence.
Do you have any idea on how good this 'pooled_output' vector is for sentence representation.
Otherwise is it better to give feature vectors to LSTM and compare similarity between sentences?

If you can use BERT end-to-end to classify your sentences, I believe it will always be better this way.

However, because of the input size limit of BERT, it might be necessary to first represent each sentences through BERT, and then use a LSTM (or something else) to create a representation for long documents. In that case you anyway don't have much choices.

Yeah, sounds correct.

How do I get the [CLS] token from bert model at code level?
Any idea on how to do this?

How do I get the [CLS] token from bert model at code level?

This will give you the last layer of the transformer, just take the first token.

How do I get the [CLS] token from bert model at code level?

This will give you the last layer of the transformer, just take the first token.

Thanks a lot!!

@realsergii My code for that experiment was simply based on https://github.com/Colanim/BERT_STS-B (IIRC). Fine-tuning is required to adapt BERT to the regression task on STS-B.

Done! Got Pearson 0.897. Looks like awesome result! @cdluminate could you please explain why this fine-tuned model is not good for your task?

@realsergii My code for that experiment was simply based on https://github.com/Colanim/BERT_STS-B (IIRC). Fine-tuning is required to adapt BERT to the regression task on STS-B.

Done! Got Pearson 0.897. Looks like awesome result! @cdluminate could you please explain why this fine-tuned model is not good for your task?

Simply because I need the embedding vectors. The regression model suffers from combinatorial explosion in the number of sentence pairs. In one of my datasets it takes more than 100 years to calculate all similarity scores.

@realsergii My code for that experiment was simply based on https://github.com/Colanim/BERT_STS-B (IIRC). Fine-tuning is required to adapt BERT to the regression task on STS-B.

Done! Got Pearson 0.897. Looks like awesome result! @cdluminate could you please explain why this fine-tuned model is not good for your task?

How did you get this? Using run_classifier.py code on STS-B data from glue gives a fine tuned model?
Can we use that model for finding similairty between two sentences from any other data?

@KavyaGujjala I was using the code from here https://github.com/Colanim/BERT_STS-B, so for STS-B I was using run_scorer.py. I tried (via Bert-as-service) to use this fine-tuned model to find similarity between two sentences on my data, but the results were worse than by using Universal Sentence Encoder. My guess it's because of specifics of my data: very short sentences, rather noun phrases and verb phrases. BERT should be working fine on normal sentences, like in STS-B dataset.

@KavyaGujjala I was using the code from here https://github.com/Colanim/BERT_STS-B, so for STS-B I was using run_scorer.py. I tried (via Bert-as-service) to use this fine-tuned model to find similarity between two sentences on my data, but the results were worse than by using Universal Sentence Encoder. My guess it's because of specifics of my data: very short sentences, rather noun phrases and verb phrases. BERT should be working fine on normal sentences, like in STS-B dataset.

@realsergii How were you finding the similarity between sentences? using [CLS] token as sentence representation? Also by saying fine-tuning on your data were you using labelled corpus? If so, can you please state one example?

@KavyaGujjala I was using L2 space of https://github.com/nmslib/hnswlib
I tried [CLS] token as well as pooled tokens options.
I did not fine-tune on my data, only on STS-B. I do not have a labeled corpus. My tasks is not related to any domain, rather to general language understanding. So I expected fine-tuning on STS-B should give good results.

@KavyaGujjala I was using L2 space of https://github.com/nmslib/hnswlib
I tried [CLS] token as well as pooled tokens options.
I did not fine-tune on my data, only on STS-B. I do not have a labeled corpus. My tasks is not related to any domain, rather to general language understanding. So I expected fine-tuning on STS-B should give good results.

@realsergii Thanks for the information. So you are saying a fine-tuned model on STS-B dataset from GLUE should work well in your case (finding similarity between any other sentences ), because its not domain related, correct?
I have domain specific data. I want to fine tune bert base model on this data because using a [CLS] token without fine-tuning doesn't help much. But I dont have labelled corpus as well. So trying to figure out how to do this. My task is solely related to finding similarity between sentences.
@cdluminate @Colanim Any idea on how to do this?

So you are saying a fine-tuned model on STS-B dataset from GLUE should work well in your case (finding similarity between any other sentences ), because its not domain related, correct?

I think it should work well on NOT domain related sentences similar to ones in STS-B dataset. But my case is a bit different: it's rather noun phrases than sentences, so BERT fine-tuned on STS-B works, but not always good enough.

i concatenate a first token [cls] , mean pooler, max pooler and attention pooler , all features above of layer 9 of bert and feed vecto concat to a Linear , and use cosine distance to evaluate similar of 2 sentences .
I think these can help you to encode a sentence in vecto space

I fine-tuned BERT (PyTorch version from huggingface) to generate sentence embedding for cosine similarity computation. A FC layer + tanh activation on the CLS token is added to generate sentence embeddings.

Fine-tuned the model on the STS-B task, and got a 0.83 Pearson score on the Dev dataset. Any suggestion to make it better to approach to 0.89 by the simple regression model?

i concatenate a first token [cls] , mean pooler, max pooler and attention pooler , all features above of layer 9 of bert and feed vecto concat to a Linear , and use cosine distance to evaluate similar of 2 sentences .
I think these can help you to encode a sentence in vecto space

Hi, Volinh, how is the performance?

I fine-tuned BERT (PyTorch version from huggingface) to generate sentence embedding for cosine similarity computation. A FC layer + tanh activation on the CLS token is added to generate sentence embeddings.

Fine-tuned the model on the STS-B task, and got a 0.83 Pearson score on the Dev dataset. Any suggestion to make it better to approach to 0.89 by the simple regression model?

Hi, Beekbin, could you share your code? Does concatenating other tokens and other layers help?

Hi, Beekbin, could you share your code? Does concatenating other tokens and other layers help?

@zizhec, I haven't try concatenating other tokens or other layers.

Here is the model I tried:

class BertPairSim(BertPreTrainedModel):
    def __init__(self, config, emb_size=1024):
        super(BertPairSim, self).__init__(config)
        self.emb_size = emb_size
        self.bert = BertModel(config)
        self.emb = nn.Linear(config.hidden_size, emb_size)
        self.activation = nn.Tanh()
        self.cos_fn = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
        self.apply(self.init_bert_weights)

    def calcSim(self, emb1, emb2):
        return self.cos_fn(va, vb)

    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(input_ids, None, attention_mask,
                                     output_all_encoded_layers=False)
        emb = self.activation(self.emb(pooled_output))
        return emb

The whole project including training process is here

Mark.
What does FT mean?

Mark.
What does FT mean?

Fine-Tune (FT)

pearson correlation coefficient

Hi, sorry for asking stupid questions. I am a beginner in NLP field.
I also would like to learn a fune-tune model for calculating similarities between a lot sentence.
How do you use pearson correlation coefficient to evaluate your validation?
Have you already gotten actual vector for every sentence?

@yolearn please take a look on the detail of STS-B benchmark. These details can answer your question.

@yolearn please take a look on the detail of STS-B benchmark. These details can answer your question.

Thank for your share. I finally understand the details.
I would like to ensure what you do.
You have a lot data without label. Therefore, you would like to try getting embedding from STS-B benchmark and then map to your data to calculate similarity.
Is anything I misunderstand?

Did you use the last state of LSTM output, or you did a pooling over all the state of LSTM output?

I just used the last state of the LSTM

I tried several things, and for my problem, the best solution was to use Reduce-mean pooling between all the feature vectors outputted by BERT. Then I fed this into a LSTM (I believe a simple Feed Forward network would work equally good, if not better) and get the last state of the LSTM.

This last state of the LSTM is my sentence representation. I can then compare it to other sentence representation with a distance. So at the end I have a Siamese network.

_Note : CLS embedding is a better solution for sentence representation if you can fine tune your model._

hi cola. you added LSTM layer at top of bert.i wonder if you had published paper, i wanna a reference. your github don't have email, so i could only reply here.thanks.

@SmartMapple Sorry I didn't published anything, and the code is not public so I can't share it unfortunately..

@SmartMapple Sorry I didn't published anything, and the code is not public so I can't share it unfortunately..

ok, dosen't matter. appreciate for your shared.

Was this page helpful?
0 / 5 - 0 ratings