transformers 🚀 - How to use BERT for finding similar sentences or similar news?

Hi,
BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose:
https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

nreimers on 25 Jul 2019

👍92 ❤77 🎉38 👀1

I think you can use faiss for storing and finding similar embeddings.

stefan-it on 25 Jul 2019

👍26 ❤2

@nreimers Amazing!! Thank you so much. What you created is a real-life savior! Can this be used for finding similar news(given title and abstract)? I ran the code and I have the following doubts.
Which model should I use?
bert-large-nli-stsb-mean-tokens vs bert-base-nli-mean-tokens vs bert-large-nli-mean-tokens (what are the datasets on which all these models are trained on?)

Can I use faiss to compute the search/distance of the vectors instead of L2/Manhattan/Cosine distances?

Many thanks to @stefan-it for introducing me to faiss.

Raghavendra15 on 25 Jul 2019

@nreimers I don't think scipy.spatial.distance.cdist is good enough, it takes a lot of time to compute the results, almost 10 minutes on a corpus of 3.9k news articles. I think I should try using faiss. I don't know anything about faiss but I will try.

Raghavendra15 on 26 Jul 2019

Hi @Raghavendra15,
regarding the model I sadly cannot be helpful, you would need to test them. In general, sentence embeddings methods (like Inference, Universal Sentence Encoder or my git) work well for short text, i.e., for sentences. For longer text with multiple sentences their performance often decrease and average word embeddings or tf-idf is in many case a much better choice. For longer texts, all these sentence embeddings methods are not really needed.

It would be great if you have some training data. Then, it would be quite easy to fine-tune a model specifically for your task. It should achieve much better performances than the pre-trained models.

I think the issue is not scipy.spatial.distance.cdist. On a corpus with 100k embeddings and 1024 embedding size, it requires about 0.2 seconds per query (if you can batch queries, even less time is needed).

I think the issue might be the generation of the 4k sentence embeddings? Transformer networks like BERT are extremely slow on CPUs. However, on a GPU, the implementation can process about 2000 sentences per seconds. On a GPU, only about 40 sentences.

But the corpus must only be processed once and can then be stored & loaded from disk. At inference, you just need to generate one embedding for the respective query.

You can of course combine this with faiss. Faiss generates index structures that allow a quick search in vector space and is especially suitable if you have a high number (millions) of vectors. For 4k vectors, scipy takes about 0.008 seconds per queries to find the most similar vectors.

So either something is really strange with scipy on your computer, or the long run-time comes from the generation of the embeddings.

nreimers on 26 Jul 2019

👍16

@nreimers Thank you very much for your response. You're absolutely right, most of the time taken is for generating the embedding for 4k sentences. I'm now confused between choosing this model over XLNet, XLNet has achieved the state of the art results.

By your comments on faiss, As long as I have a smaller dataset, results from faiss and scipy won't make any difference? However, If I had millions or billions of news articles then using faiss makes sense right? For smaller datasets, there is no difference in terms of quality of matches between faiss and scipy(the results are the same for computing the distances)?

I have one important question, If I want to train the model as you suggested which would yield better results, In that case, I should have labeled dataset right? However, for news, I only have titles and abstract about that news. Is there a way to train them without the labels?

Raghavendra15 on 30 Jul 2019

Hi,
XLNet achieved state-of-the-art performance for supervised tasks like classification. But it is unclear if it generates also good embeddings for unsupervised tasks.

In the framework you can choose XLNet, but I was only able to produce results that are slightly below those of BERT.

Others also have problems getting a good performance with Xlnet for supervised tasks, as it appears that it is extremely sensitive to the hyper parameters.

If you have millions of docs, faiss makes sense. With scipy, you get exact scores. With faiss, the scores are fuzzy and the returned most similar vectors must not necessarily be the actual most similar vectors. There can be small variations. But I think the difference will be small.

Often you have in your data some structure, like categories or links between news articles. This structure can be used to fine-tune a model. Let's say you have links linking to similar events. Than you train the network with triplet loss with the two linked articles and one random other article as negative example.

This will give you a vector space where (possibly) linked articles are close.

nreimers on 30 Jul 2019

👍9

@nreimers Thank you very much for your quick response.
Are the existing model "bert-large-nli-stsb-mean-tokens" better than the google news word2vec google_news_300, they claim that-" We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases."
Is the pretrained "bert-large-nli-stsb-mean-tokens" better than google's pre-trained news vectors?

For training the existing model to improve results for news similarity, the problem I have is I can't create a dataset to compute triplet loss. For triplet loss to work in the case of news similarity for query news ['a'], I need to find a news article ['b'] which is similar as a positive example and a dissimilar news article ['c'] as a negative example. Like positive example and negative example.

However, If I run the news every day then, new entities/topics are going to pop up every single day? I need to update my embeddings right? I don't know how to handle this situation.

Raghavendra15 on 1 Aug 2019

Google News vectors are just word vectors, you still need a strategy to derive sentence embeddings from these. But as mentioned earlier, average word embeddings is a promising idea for your task. Note, average word embeddings models will be added soon to the repository.

Constantly updating of the model is not needed. News are changing, but the used words remain the same. So training once should give you a model that can be used for a long time.

nreimers on 1 Aug 2019

@nreimers Thank you very much! Any tentative date by when the average word embeddings will be added to the repository?

I want to know how to evaluate the results of similar sentences numerically, for example when I use your model to evaluate for a given news, finding similar news in the corpus.

Is there a way to measure numerically how good the similar sentences are in the below example? I used BLEU score, but the problem is, it's not an accurate measure of similarity. BLEU score doesn't consider the context of the sentence, it just blindly counts whether a word in the query sentence is present in the similar sentence regardless of where the word is placed.

For an item, I get related items.
In the below example, the first title in relatedItems is similar, however, the second item in "relatedItems" is not at all similar which talks about Stephen Colbert and Joe Biden.
Suppose I use word2vec model for the above task it might give me two totally different sentences as relatedItems, In that case, how can I evaluate both the models and claim numerically which one is better?

Example:

{"title": "Google Is Rolling Out A New Version Of Android Auto - Here's What You Can Expect",
"abstract": "The new Android Auto. Google If you use Android Auto, you're about to receive to a nice upgrade.",
}
"relatedItems":
[{

"title": "New Android ransomware is spreading through text messages",
"abstract": "Thereu2019s a new type of Android ransomware making the rounds that leverages SMS to spread, according to a new report from cyberappsecurity com",
},
{
"title": "Stephen Colbert Brings Curtain Down On Democratic Debates With Joe Biden Tweaks",
"abstract": "Stephen Colbert closed his second of two live Late Show monologues with a spree of zingers directed at Joe Biden, mixing in plenty for the o",

}
]}

Raghavendra15 on 1 Aug 2019

Bleu wouldn't be a good measure, because the best similarity metric to find similar news would be: Bleu (of course).

What you would need is an annotated Corpus. For a given article, get for example the 20 articles with the highest tf idf similarity. Then annotate every pair as similar or not.

With this data you can compare different methods with Ndcg about how well they rank the 20 candidate articles.

Avg. Word embeddings should be included within the next two weeks to the repo.

nreimers on 1 Aug 2019

❤2

@nreimers When you say -"Bleu wouldn't be a good measure, because the best similarity metric to find similar news would be: Bleu (of course)."
Do you mean when I get similar news like in the above example, BLEU is the best metric to measure how similar the two news articles are? Please correct me if I understood this wrong.

In the STS benchmark, I saw a pair in the training dataset with gold-standard human evaluated scores. The following paid had a score of 5, however, when I use BLEU scores for 1gram they don't get a score of 1. Instead, they get the following scores. BLEU looks for the exact word to be present in the reference sentence that's the problem I feel. There's no notion of similarity.

s=word_tokenize("The polar bear is sliding on the snow")
reference = [s]
candidate =word_tokenize("The polar bear is sliding across the snow")
print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Individual 1-gram: 0.875000
Individual 2-gram: 0.714286
Individual 3-gram: 0.500000
Individual 4-gram: 0.400000

reference sentence has 8 words out of which the candidate matches exactly 7 words, so 7/8 score for 1-gram matches.

I'm not sure how the STS benchmarks are evaluated, I'm currently looking into them. If you have any leads or a document I would be more than happy to read them.

Thank you very much for your help :)

Raghavendra15 on 1 Aug 2019

No, BLEU is a terrible idea for evaluation.

STS is usually evaluated using Pearson correlation between gold and predicted labels. But Pearson correlation is also a bad idea:
https://aclweb.org/anthology/C16-1009

I strongly recommend to use Spearman correlation for comparison.

nreimers on 2 Aug 2019

@nreimers Kudos on the COLING paper! It's very well written. In the paper, you have mentioned How Pearson correlation can be misleading or ill-suited for the semantic text-similarity task. However, you did not suggest to use Spearman correlation instead of Pearson correlation. But for me, you suggested me to use Spearman correlation why? (That's my current understanding of the paper)

Can I use the Spearman rank correlation from scipy?
Basically, I want to compare the BERT output sentences from your model and output from word2vec to see which one gives better output.
So there is a reference sentence and I get a bunch of similar sentences as I mentioned in the previous example [ please refer to the JSON output in the previous comments].

Will the below code is the right way to do the comparison?
In your sentence transformer, you have used the same below package in SentenceEvaluator class. I couldn't figure out how to use that class for my comparison.

Will you please give me some idea in this regard?

Example code:
from scipy.stats import spearmanr
x = [1, 2, 3] ---> I will use BERT and word2vec embeddings here.
x_corr = [2, 4, 6]
corr, p_value = spearmanr(x, x_corr)
print (corr)

Raghavendra15 on 8 Aug 2019

Hi @Raghavendra15
The issue with pearson correlation is, that it assumes a linear correlation between the system output and gold labels. Adding a montone function to the system output can change the scores (make them better or worse), which does not really make sense in applications.

Assume you have a systems that predicts the perfect gold scores, however, the output is output=sqrt(gold_label).

This system would get a really low Pearson correlation. However, for every application, this system would be perfect, as it predicts the gold labels. With Spearman correlation, you don't have this issue. There, just the ranking of the scores are important.

In general I think the STS tasks (or the STS benchmark) are not really well suited to evaluated approaches. The STS tasks with Pearson/Spearman correlation weights every score similar, but in applications, we are often only interested in certain examples.

For example, if we search for pairs with the highest similarity, then we don't care how the scores are for low similarity pairs. A system that gives a perfect score for high similarity pairs and a random score for low similarity pairs would be great for this application. However, this system would get a low Pearson/Spearman correlation, as it fails to correctly order the somewhat-similar and unsimilar pairs.

If you want so estimate the similarity of two vectors, you should use cosine-similarity or Manhatten/Euclidean distance.

Spearman correlation is only used for the comparison to gold scores.

Assume you have the pairs:
x_1, y_1
x_2, y_2

...
for every (x_i, y_i) you have a score s_i from 0 ... 1 indicating a gold label score for their similarity.

You can check how good the embeddings are by computing the cosine similarity between the embeddings for (x_i, y_i) and then you compute the Spearman correlation between these computes cosine similarity scores and the gold score s_i.

Note: Currently I add methods to compute average word embeddings and similar methods to the repository. So a comparison will become easier.

nreimers on 8 Aug 2019

@nreimers Last week you added the methods to compute average word embeddings should I use that method when I get a sentence embedding or will there be a pre-trained average word embedding weights?
In the below code I will get the embeddings once I pass the input strings. Should I use the compute avg word embedding method on top of this?

corpus = ['A man is eating a food.',
'A man is eating a piece of bread.' ]
corpus_embeddings = embedder.encode(corpus)
or
By any chance, pre-trained avg-word embedding weights will be uploaded to the repository by any time this week.

Raghavendra15 on 16 Aug 2019

Hi @Raghavendra15
I just uploaded v0.2.0 to github and PyPi:
https://github.com/UKPLab/sentence-transformers

You can update with pip install -U sentence-transformers

I added an example for average word embeddings (+a DAN layer that is trainable):
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_stsbenchmark_avg_word_embeddings.py

You can also use it without the DAN layer. There is also a tokenizer implemented that allows the usage of the word2vec Google News vectors. These vectors contain phrases like 'New_York'. These phrases are detected by the tokenizer and mapped to the correct embedding for New_York. But there is currently no example for this in the repo. If you need help, let me know.

To get avg. word embeddings only (without DAN), the code must look like this:

# Map tokens to traditional word embeddings like GloVe
word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

corpus_embeddings = model.encode(corpus)

Next release will update include support for RoBERTa and add other sentence embeddings methods (like USE, LASER), which will be trainable.

nreimers on 16 Aug 2019

👍1

@nreimers Thank you very much! You spoke my mind with RoBERTa, I was about to ask you about it. But with the avg-embedding approach, I won't be using BERT at all right?

In addition to that, I won't be training the model. I don't think I fully understand this. Earlier I would pass a pretrained weight model into SentenceTransformer, however now I won't pass anything related to BERT, does that mean I won't be using BERT?

Raghavendra15 on 16 Aug 2019

@Raghavendra15
The framework offers you a lot of flexibility. You can choose between the following embedding approaches:

BERT or XLNet (RoBERTa and other will follow)
Traditional word embeddings like GloVe, word2vec etc.

Then, you can choose between different pooling modes: Mean pooling, max pooling, usage of the CLS token for BERT / XLNet.

Finally, if you like, you can add feed-forward networks to create a deep-averaging network.

If you have training data, I can recommend this combination:
BERT + mean-pooling

This gave the best performance for many cases.

If you have training data, you need a low computation time and performance is not that important, choose this combination:
GloVe embeddings (or something similar) + mean-pooling + 1 or 2 dense layers

If you don't have training data, choose:
GloVe embeddings (or something similar) + mean-pooling

As you can see, there are various options you can choose from, depending if you have training data and how important is a high speed vs. a good performance.

Once I have RoBERTa integrated, how suitable it is for the generation of sentence embeddings. My experiences with XLNet was that the performance is slightly below the performance of BERT for sentence embeddings. Maybe RoBERTa is better for sentence embeddings, maybe not.

Averaging BERT without fine-tuning on data gave really poor results. However, what you can of course try, is to use one of the existent pretrained BERT models like 'bert-base-nli-mean-tokens', which is BERT+mean-pooling, fine-tuned on NLI data to generate meaningful sentence embeddings.

nreimers on 16 Aug 2019

👍2

@nreimers Thank you very much! Why didn't you choose (word2vec) Google news vectors? Is there any particular reason for choosing Glove embedding over word2vec? I'm curious to know how RoBERTa will perform! 😃

Raghavendra15 on 20 Aug 2019

👍1

@Raghavendra15
There are two reasons:
1) Google news word2vec is quite large, it requires about 12 GB of RAM to read it in. Not that ideal for an example script. GloVe embeddings are about 10 times smaller.
2) In most of my experiments, the Google news word2vec vectors did not yield good performances. GloVe embeddings were often a bit better. I especially like the embeddings by Levy et al (trained on dependencies) and by Komninos. I also conducted a larger comparison between word embeddings (https://arxiv.org/abs/1707.06799, Table 5).

But note, using the Google news word2vec vectors is quite easy. In training_stsbenchmark_avg_word_embeddings.py replace

word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

with

word_embedding_model = models.WordEmbeddings.from_text_file('GoogleNews-vectors-negative300.txt.gz')

First experiments with RoBERTa are done: On STSbenchmark, it increases the Spearman correlation by about 1 - 2 percentage points. I will see how it will perform on other datasets.

Best, Nils Reimers

nreimers on 20 Aug 2019

👍1

This issue is very interesting, thanks for sharing your experiments and framework @nreimers!

thomwolf on 20 Aug 2019

❤2

@nreimers I read your paper on word embedding comparison, however, when I saw the GLEU scoreboard for STS benchmark Glove scored very less compared to word2vec, Isn't it contradictory to your paper? Also in your paper, the comparisons are on a certain set of tasks like Entity Recognition, NER but not on Semantic Textual Similarity. I don't know much about it, I'm trying to learn. Do my questions make sense?

Is there any significant difference between using glove.840B.300d.zip (contains 840 billion words vectors trained on the common crawl ) vs glove.6B.300d.txt.gz (contains 6 billion words vectors wikipedia+Gigaword), Is it like more words the better? also, they're trained on different datasets, will that make a huge difference when applied to news similarity?

Raghavendra15 on 21 Aug 2019

👍1

See the GloVe website / paper for the differences. 6B was trained on 6 billion words from Wikipedia, 840B was trained on 840 Billion words from common crawl.

It depends on the task and data which one is more suitable. If you have a lot of rare words, and those play an important role for your task, 840B is often better. If you have clean data / only common words are important for your task, 6B often works better.

However, the differences are often only minor between the two versions.

In my paper I only compare embeddings for supervised task, only for sequence tagging.

In unsupervised tasks, you can get completely different results. Further, how word embeddings are averaged has a big impact. Some authors don't ignore stop words, instead they propose some complicated weighting scheme. If stop words are ignored, performances can be improved up tp 10 percentage points, sometimes outperforming complex weighting approaches.

Best,
Nils Reimers

nreimers on 21 Aug 2019

Thank you for your work, Nils, it is brillant!

I would like to design a sentence level semantic search engine using email data (Enron dataset).

I am still a little bit confused about how I should be fine-tuning models on such dataset (maybe I am missing something obvious).

Thanks.

Gogan

ggndtes on 21 Aug 2019

@ggndtes In general BM25 will be really hard to beat on this type of task. See this paper where they compare sentence embeddings with BM25 on an end-to-end retrieval task (given: question, find similar / duplicate questions in a large corpus):
https://arxiv.org/pdf/1811.08008.pdf

A complex sentence embedding method only achieves 1 - 2 percentage points improvement against BM25 (Table 2, Dual Encoder Paralex vs. Okapi BM 25).

Especially if you have more than just a sentence, carefully constructed BM25 for example with Elasticsearch is really really hard to beat. If you are interested in a production system, I would highly recommend to first try Elasticsearch (or similar), beating it will be difficult.

Back to your question how you can tune it:
The big question narrows down to: What are your queries, what are your documents. Are your documents complete emails? Or only email subjects? Or only sentences within emails?

Are your queries inputs from the user, email subjects or complete emails?

In general you would need to construct same sort of similarity. Currently I can only think of imperfect method to create similarity labels. One option would be: Triplet loss with 2 emails from the same inbox vs. one random other subject. But this would I think create rather bad embeddings.

Currently I can't think of a good method to create similarity labels for that dataset. And as mention, even with perfect labels, it will be really hard to beat BM25.

Best,
-Nils Reimers

nreimers on 21 Aug 2019

❤3 👍3

@nreimers The sentence encoder actually takes quite a lot of time to load the Glove embeddings, Is there a way where I can make it load from the disk or make it faster?

Raghavendra15 on 22 Aug 2019

@Raghavendra15 When you run the code the first time, the embeddings are downloaded and stored in the path of the script. In follow-up executions, the embeddings file is loaded from disk.

GloVe embeddings are quite large, so loading it can take some time.

There are two ways to speed it up:
1) Limit the vocab size, i.e., don't load all the ~400k embeddings. Pass the parameter 'max_vocab_size' to the method 'from_text_file' when called.
2) Save the WordEmbeddings model to disc. In follow-up executions, you can load the (binary) model directly from disc and you don't have to read in and parse in the text file.

Should work something like this:

word_model = WordEmbeddings.from_text_file('my-glove-file.txt')
word_model.save('my/output/folder/GloveWordModel')

# In follow-up calls, should be faster
word_model = WordEmbeddings.load('my/output/folder/GloveWordModel')

nreimers on 22 Aug 2019

@nreimers Wow!! It works blazingly fast!
I was trying to play with the below code. Thank you very much for the help :)
Code in In WordEmbeddings.py file:

 with gzip.open(embeddings_file_path, "rt", encoding="utf8") if embeddings_file_path.endswith('.gz') else open(embeddings_file_path, encoding="utf8") as fIn:
            iterator = tqdm(fIn, desc="Load Word Embeddings", unit="Embeddings")
            for line in iterator:

Also, can I load the model similar to that for BERT pre-trained weights? such as the below code?

embedder = SentenceTransformer('bert-large-nli-stsb-mean-tokens')
Can I load the above pre-trained weights somehow just like you have load method for glove weights?

Is the avg embedding with Glove better than "bert-large-nli-stsb-mean-tokens" the BERT pre-trained model you have loaded in the repository? How's RoBERTa doing? Your work is amazing! Thank you so much again!

Raghavendra15 on 23 Aug 2019

@Raghavendra15 Sure you can:

word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)


model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
model.save('my/output/folder/avg-glove-embeddings')

# Load the Model:
model = SentenceTransformer('my/output/folder/avg-glove-embeddings')

Which model is better depends extremely on your data and on your task. The BERT models work good if you have clean data, which is not too domain specific and rather descriptive. This is due to the nature on which data it was fine-tuned (on NLI dataset).

Average GloVe embeddings works I think better if you have noisy data, really domain specific data or very short sentences or very large paragraphs.

Experiments with RoBERTa are finished. Paper will be uploaded next week to arxiv. In my experiments, I could not observe a major difference between BERT and RoBERTa for sentence embeddings: Sometimes BERT is a little bit better, sometimes RoBERTa. But nothing that is significant. XLNet was so far in general worse than BERT.

Best
-Nils Reimers

nreimers on 23 Aug 2019

@nreimers Thanks! but my question is how can make the pretrained BERT model faster like loading the below model.
embedder = SentenceTransformer('bert-large-nli-stsb-mean-tokens')
When I run the encoder for BERT it takes a lot of time like 10-15 minutes for 4k sentences.
embedder.encode(corpus) ---> This takes around 10minutes for "bert-large-nli-stsb-mean-tokens"
However, Glove model does the job in 30 secs. bert-large-nli-stsb-mean-tokens is similar to glove pretrained word vectors right? Then is there a way to convert speed up the BERT sentence encoder?

Raghavendra15 on 26 Aug 2019

@Raghavendra15 No, the BERT model and average GloVe embeddings are completely different.

GloVe embeddings have one vector for each word in a language, for example, the word 'apple' is mapped to the vector 0.31 0.42 0.15 ....

To compute avg. GloVe embeddings, you just perform some memory lookup: Every word is mapped to the vector and then you compute the mean values.

BERT (https://arxiv.org/abs/1810.04805) is much more complex: Words in a sentence are first broken down to subwords, which are than mapped to vectors (which is the fast part).

But after that, a transformer network is run over the complete sentence: For BERT-base it has 12 layers, for BERT-large, it has 24 layers. This produces vectors for each word which depend on the context of the complete sentence.

If you have the two sentences:

Apple is a healthy fruit
Apple presented their new iPhone

With GloVe, Apple are mapped in both cases to the same vector. With BERT, the two Apple words are mapped to different embeddings. In the first case, it is mapped closer to words like Banana, Mango etc., in the second sentence, it is mapped closer to words like Microsoft, Google etc.

But this comes with a cost: Transformer networks are rather slow. This is especially true if you have only a CPU or an older GPU.

On a CPU, you can process with BERT about 80 sentences / second (with GloVe, more than 5k). On a Nividia V100 GPU, the speed is a bit better: About 2000 sentences / second (BERT-base).

The runtime for transformer networks is quadratic with the sentence length. If your sentence is twice as long, the runtime increases 4x.

So the only ways to speed-up the BERT model:

Try to figure out what the optimal batch size is for your system (you can pass the batch size as a parameter to encode())
Use the base, not the large model. The large model is multiple times slower than the base model.
Be careful with your sentences lengths. Maybe truncate your sentences
Get a better / faster GPU (or multiple GPUs). Running BERT on CPU is horrible.

I hope this of some help for you.

Best regards
-Nils Reimers

nreimers on 26 Aug 2019

👍21 ❤4

This is an outstanding explanation Nils – you should blog or tweet, I'm sure lots of people would be interested in reading more from you!

julien-c on 26 Aug 2019

👍8

@nreimers Brilliant explanation! :D You're a life saviour :-)
I need your help with this issue. Can I use sentence transformer for this case?
https://github.com/huggingface/pytorch-transformers/issues/1170

Raghavendra15 on 1 Sep 2019

@nreimers Very patient brilliant explanation. Wish u a happy life.

xiao2mo on 4 Sep 2019

@nreimers
Let's say you have a sufficient training set for information retrieval, such as that from fever.ai.

We used black-box Bayesian Optimization to train BM25 on Elasticsearch... producing close to the results described in the SOTA evidence retrieval from the UKP-Athene team, but were still a few % off SOTA, without entity extraction or any other ML preprocessing.

Shouldn't it be the case that a well trained encoder transformer with cosine-loss, with specific weights for a query and document / sentence in the result set, should be able to beat an arbitrary algorithm like BM25?

And that it could be deployed at scale using faiss or hsnw?

pertschuk on 30 Sep 2019

Hi @pertschuk
If the recall of BM25 is quite good, I would aim for re-ranking instead of a full semantic search.

In re-ranking, you retrieve e.g. 100 documents with your BM25 algorithm. Then, you run BERT to compare each document with your query to get one score (0...1).
Next, you sort these scores.

Your original ranking from BM25 is then replaced with these BERT-based scores ranking.

Sentence embeddings often have challenges in information retrieval as the false positive probability is higher than BM25. I.e., if you compare two dissimilar sentences with sentence embeddings, the probability of getting a high similarity score is higher for approaches like Sentence-BERT / InferSent / USE, than it is for BM25.

In Information Retrieval, you usually have a large set of unrelated docs, i.e., this higher false positive rate leads to really bad consequences that you find many unrelated documents, leading to a performance usually lower than BM25.

The re-ranking approach prevents this to happen: BM25 gives you a rather clean candidate set, and your neural re-ranking approach can then do the hard work and determine which of the n documents matches the query the best.

Best regards
Nils Reimers

nreimers on 1 Oct 2019

👍6 ❤1

Great thank you, this makes sense.

We are currently using reranking on top 9 documents, but maybe could increase this number since our reranking recall is quite high ~.95 based on a RoBERTA regression model.

link: https://github.com/koursaros-ai/koursaros/blob/master/examples/pipelines/factchecking/services/scorer/__main__.py

I guess then the challenge becomes the scale of re-ranking, because there would be ~700 sentences to rerank with this larger set, and we can maybe run 100/s on SoTA transformer.

I wrote a FEVER dataset loader and am currently training a sentence reranking model based on your cosine loss, I am hoping to achieve the greater performance afforded by precomputing embeddings and running KNN to rerank, I will publish results here when I have them.

pertschuk on 1 Oct 2019

Yes, larger candidate sets can actually be quite interesting.

What you can also try is the faster, destilled BERT from hugging face. It achieves similar results like BERT, but is faster.

Sometimes, a larger set with worse (but cheaper) models achieves better overall results than a small set with a better (but expensive) model.

Best
-Nils Reimers

nreimers on 1 Oct 2019

❤1 👍1

Hi,
BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose:
https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

Hi, Can t

Hi,
BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose:
https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

Can this use GPUs, if so how ?

duttsh on 6 Oct 2019

Hi @duttsh,
Yes, GPU is supported out of the box. You just need the necessary Cuda drivers and then you can train / perform inference on the GPU without any changes

Best regards
Nils Reimers

nreimers on 6 Oct 2019

Thanks Nils

Sent from my iPhone

On Oct 6, 2019, at 1:43 AM, Nils Reimers notifications@github.com wrote:

Hi @duttsh,
Yes, GPU is supported out of the box. You just need the necessary Cuda drivers and then you can train / perform inference on the GPU without any changes

Best regards
Nils Reimers

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

duttsh on 7 Oct 2019

@nreimers (Nils) one more question, when will you have pre-trained models of RoBERTa available ? or if they are please send me the name.

duttsh on 7 Oct 2019

@duttsh I can try to upload it, but in my experimemts I didn't see any improvements from Roberta for sentence embeddings.

Best regards
Nils Reimers

nreimers on 7 Oct 2019

Thanks, can you please upload. Also I believe Roberta will increase the accuracy of inference. Right ?

Sent from my iPhone

On Oct 6, 2019, at 7:18 PM, Nils Reimers notifications@github.com wrote:

@duttsh I can try to upload it, but in my experimemts I didn't see any improvements from Roberta for sentence embeddings.

Best regards
Nils Reimers

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

duttsh on 7 Oct 2019

@duttsh In my experiment, I didn't observe any differences between BERT and RoBERTa when used for different sentence embeddings tasks.

nreimers on 10 Oct 2019

@nreimers thanks. If you could share he name if your RoBERTa model, would be great.

duttsh on 10 Oct 2019

After a couple months of research, the best approach I've found for building semantic search is to integrate with an existing BM25 search platform such as Elasticsearch, and then rerank the top n results using a neural network regression trained to score a combination, on a dataset such as MS MARCO.

Per @nreimers comment, something like BM25 produces a cleaner result set, and training a model to look at query passage pairs at the same time rather than training a cosine loss and comparing precomputed vectors enables it to use attention to more accurately rank passages.

Check out this project that implements such a system: https://github.com/koursaros-ai/nboost

pertschuk on 25 Nov 2019

👍3 ❤1

Took me a long time to reply but thanks so much @nreimers for your incredibly clear explanations and responses.

Thanks also to @pertschuk for sharing the results of your research, this is very helpful.

ggndtes on 25 Nov 2019

Hi,
I'm looking into the question of finding prior art for patents.
This means for one patent application (around 20pages) we would like to find the closest 100 patents in a corpus of 100 million patents. The search results of patent offices could be used as training material.
We thought about tf-idf, word2vec, GloVe etc. So far transformers like BERT seemed to be too slow for such a task.
Now with SBERT and SRoBERTa and powerful AI accelerators, we ask ourselves, if we shouldn't be so quick to exclude transformers.
Any advice? Has anyone applied SBERT to such an amount of data? Anyone using AI accelerators such as Jetson?

wolf-tag on 16 Dec 2019

@wolf-tag
The two biggest issues with my research into building a transformer cosine loss solution based on SBert at scale (I was working with ~6million wikipedia articles, much smaller than all of the patents).

Evaluating the solution. To rebuild all of the 6 million vectors and put them into a FAISS index takes like (200/s to encode) takes like 6-8 hours, and then more time to actually query your test set and calculate something like MRR. Building a good model often requires dozens of evals, tweaking, etc.
Memory usage. There are various compression methods, but currently vector indexes are pretty memory-hungry. https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors. This has possible solutions. It seems like probably to make this scalable, you would need much smaller embeddings than like 1024 for BERT large.

If you are well funded and have lots of GPU/ TPU and memory, it's feasible, and I would look at Patent-BERT, and incorporate that into sentence transformers.

One final thought to keep in mind - I have found that almost everything out there, patents included, have summaries (abstracts). At an even more micro scale, humans often tend to summarize a paragraph with the first sentence. You can leverage this to optimize your solutions, by choosing to look at the summary text instead of all of it.

pertschuk on 16 Dec 2019

👍2

Thank you for your quick reply.

Did I get it right: you used one vector for each wikipedia article which is the result of SBERT's pooling operation?

wolf-tag on 17 Dec 2019

👍1

Hi @wolf-tag
Personally I think tf-idf / BM25 is the best strategy for your task, due to various reasons.

First, it is important to differentiate between false positive and false negative rates:
False positive: A non-similar pair of docs is judged as similar, even they are not similar.
False negative: A similar pair of docs is judged as dissimilar.

TF-IDF/BM25 has a low false positive rate and a high false negative rate, i.e., if a pair is judged as similar, there is a high chance that they are actually similar.

Sentence embeddings methods (avg. GloVe embeddings, InferSent, USE, SBERT etc.) have a reverse characteristic: high false positive rates, low false negative rates. It seldom misses a similar pair, but a pair judged as similar must not necessarily be similar.

For Information Retrieval, you have an extrem imbalance. You have 1 search query and 100 Mio. documents, i.e., you perform 100 Mio pairwise comparisons.

Sentence embeddings with a high false positive rate will return many pairs where the embeddings think they are similar, but they are not. Your result set of 10 documents will be often completely garbage.

TF-IDF / BM25 might miss some relevant documents, but the 10 document you will find will be of high quality.

Second, in my experience, sentence embeddings methods work best for sentences. For (longer) documents, the results are often not that great. Here, word overlap (with tf-idf / BM25) is really hard to beat.

Third: In our experiments in Question Answering (given a question, find in Millions of answers on StackOverflow the correct one), TF-IDF / BM25 is extremely hard to beat. It often performs much better than sentence embeddings methods + it is much quicker.

So far our experiments with end-to-end representation learning for information retrieval rather failed.

What works quite good is a re-ranking approach: You use BM25 to retrieve the top 100 documents. Than, you take a neural approach like BERT to re-rank these 100 results and you present the top-10 results (the 10 results with the highest score according to the neural re-ranker) to the user. This often gives a nice boost to pure BM25 ranking, and the runtime is not too-bad, as you must only re-rank 100 documents.

Best regards
Nils Reimers

nreimers on 17 Dec 2019

👍19 ❤6

Dear Nils,

thank you for your detailed explanation.

Indeed, recent publications for the prior art task do hardly show any improvements when using word2vec, GloVe or doc2vec compared to tf-idf.

I was just curious as Google now uses BERT for the search engine, and I suppose they are more interested in high precision than in high recall, so somehow they seem to master the high false positive rates. Maybe they do so as recommended by you (re-ranking).

I hoped that one of the newer methods would somehow have a positive impact on this task. Just wishful thinking, I fear.

wolf-tag on 17 Dec 2019

Hi @wolf-tag
an interesting paper could be this:
https://arxiv.org/abs/1811.08008

In Table 2 you see, that BM25 outperforms untrained sentence embeddings methods like avg. word2vec. If you have a lot of training data, you can tune the dual encoder so that it performs better than BM25 for the tested task (finding similar questions).

However, the task of finding similar questions involves rather short documents (often only a sentence). For longer documents, I would guess that BM25 still outperforms sentence embeddings methods.

In the paper it would have been interesting to compare the methods also against neural re-ranking, to see if the trained end-to-end retrieval is better or worse than the BM25 + re-ranking approach.

Best regards
Nils Reimers

nreimers on 17 Dec 2019

👍2

Thank you for the hint.

If TF-IDF / BM25 is still the best option for long documents, there seems to be a lot of room for improvement for future research, as this method does not use context, does not follow any semantic approach such as WSD, WordNet or synsets, does neither use trained models nor exploit available training data and does not use any language specific resources (e.g. stemming or noun phrase identification). Maybe some kind of challenge is needed to encourage research in this field.

Best regards,

Wolfgang

wolf-tag on 20 Dec 2019

Hey @nreimers deep thanks for all the info!
(Hopefully) quick question: what would be the optimal setup to find similarities (and build a search engine) between objects defined by a combination of senses?

For instance, consider a DB:
Object 1: "pizza", "street food", "Italian cuisine"
Object 2: "khachapuri", "street food", "Georgian cuisine", "cheese", "bread"

And then, a query "cheesy street food".

I'm using USE + hnswlib now, works pretty good, but only if the query string is more than 1 word. The more words, the better.

realsergii on 22 Dec 2019

Hi @realsergii
Not sure if USE is the best match for that task. From the given example, I would again think that you would get quite far with BM25 and for example Elasticsearch. Elasticsearch is quite great for indexing complex objects and search over them.

Of course you would need to tune the search a bit, e.g. that longer n-grams give higher scores, maybe combined with stemming / lemmatization of words.

Otherwise, for individual words, I think word embeddings (like Word2vec / GloVe) are quite great. Sentence embeddings often have difficulties to give a good representation for words or short phrases, as these systems were not trained for it.

Also this repo could be interesting, which combines Elasticsearch BM25 with BERT re-ranking:
https://towardsdatascience.com/elasticsearch-meets-bert-building-search-engine-with-elasticsearch-and-bert-9e74bf5b4cf2

This could potentially also be combined with a simple average word embedding re-ranking approach.

I hope that helps.

Best
Nils Reimers

nreimers on 22 Dec 2019

👍3

@nreimers thanks Nils!
Just one more clarification - what would change if in my DB I replace each word/phrase with the 1st sentence of Wikipedia entry which is the closest to the respective word/phrase ?
So in that case, would USE or SBERT be a good choice?

realsergii on 22 Dec 2019

Hi @realsergii
Sounds a bit complicated and you would have several other issues (how to find the correct article, what about small spelling variances).

Word embeddings are quite strong on finding similar words. As the context is rather small, I don't see too much benefit from using a sentence embedding methods to disambiguate words. 'Cheese' in your context will most often refer to the food, and not to e.g. a company or a strategy in a computer game.

Best
Nils Reimers

nreimers on 22 Dec 2019

Thanks @nreimers
My idea is not just to find similar words/phrases, but to find similar senses.
E.g. "welding" is similar to "joining", "building", in my understanding.
In order to comprehend this, a machine needs to know what are all of those concepts, described in more basic words, right?
One way to teach a machine is to create a vector from a sentence where the sense is described by Wikipedia (and thus, in more basic concepts).
Other way is to just get a sense (as a vector) (by word/phrase) from a model trained on Wikipedia and other sources.
This is my understanding.

Please suggest what sounds better.

realsergii on 22 Dec 2019

Hi @realsergii
That is exactly what words embeddings are great for: to find similar words, e.g. welding is similar to joining / building.

Mapping words to Wikipedia definitions sounds unnecessary complicated and I doubt you get good results with this (compared to simple word embeddings). At the end, as you have a fixed word to Wikipedia article mapping, you will get a fixed word -> vector mapping. But it is much more complicated and the quality will be much lower.

I would train word2vec / GloVe on large amount of text from your domain and then you can use these word embeddings for comparing word similarities.

nreimers on 22 Dec 2019

👍1

That’s why using word embedding. Because tf-did just find lexicon
overlapping but not the similar semantic meaning

On Mon, Dec 23, 2019 at 03:58 Nils Reimers notifications@github.com wrote:

Hi @realsergii https://github.com/realsergii
That is exactly what words embeddings are great for: to find similar
words, e.g. welding is similar to joining / building.

Mapping words to Wikipedia definitions sounds unnecessary complicated and
I doubt you get good results with this (compared to simple word
embeddings). At the end, as you have a fixed word to Wikipedia article
mapping, you will get a fixed word -> vector mapping. But it is much more
complicated and the quality will be much lower.

I would train word2vec / GloVe on large amount of text from your domain
and then you can use these word embeddings for comparing word similarities.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/876?email_source=notifications&email_token=AIEAE4GQI2RXKMS4HDVY2Y3QZ7BIBA5CNFSM4IGKGJT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHPYXRA#issuecomment-568298436,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIEAE4EKRBKQQRZYQTNZXKDQZ7BIBANCNFSM4IGKGJTQ
.

pohanchi on 22 Dec 2019

Thanks! So it feels like Elasticsearch 7.3+ with a bunch of dense_vectors from GloVe for all components of my objects (e.g. "khachapuri", "street food", "Georgian cuisine", "cheese", "bread") is the most proper data structure for my system (semantic search engine).

I don't even need MySQL (for text representation storage) + separate index based on e.g. Faiss (for vectors and index) and don't need to sync them. Everything can be inside Elasticsearch (however the query speed will be slower than Faiss but I can live with that for now).

@nreimers just want to clarify that I don't need to even look into BERT/USE etc direction, right?

realsergii on 23 Dec 2019

@realsergii
If your queries and documents are only words or short phrases, I think there is no benefit from using BERT / USE. Sentence embeddings can be helpful when you have more text (at least a sentence) and when words can be used ambiguously (like the word apple).

nreimers on 23 Dec 2019

👍1

@realsergii I mentioned it prior in this thread, but if you're using elastic search as your backend, check out NBoost, which acts as a proxy on top of ES and uses BERT to rerank the n top results.

We recently released TinyBERT distilled version of the base models which are about 10x faster (critical when it comes to search). See https://arxiv.org/abs/1909.10351 for distilling custom models by the same method.

pertschuk on 23 Dec 2019

👍2

@Raghavendra15 @nreimers Did you end up trialing it out with Faiss ? What were the results. ?
I have a similar use case where I have a domain dataset (about 100k english sentences) related to fires, I want to find synthetic multilingual sentences in different languages (arabic, italian, chinese etc.). My thought was to download the Wikipedia corpus (source) for each language and embed both wikipedia and my fire data and find synthetic sentences.

By following this example Semantic Similarity

I was able to download the multilingual trained models distiluse-base-multilingual-cased
Embed about 4.2 Million Arabic sentences (took about 7hrs on p2.xlarge instance, 85% GPU utilization) and 100k Fire Sentences (took couple of minutes)

This is where it hangs/very slow -

Running the similarity using cdist seems to run forever, had to cancel after running it for a day. I did not expect for it to take this long. Even though it was very straightforward. Figuring there should be a more optimized way of doing this.

Is there something wrong with the steps I have taken, appreciate any help.

Cheers !
Ayub

mohammedayub44 on 17 Feb 2020

Update: I ran it with faiss library using FlatIndex (as it give the most accurate results). On a p2.xlarge instance it was amazingly fast- building and searching took only 30 mins. I could not compare the results to scipy's cdist but for a sample of 10,000. I saw than >90% of the results lie in the top 5 matches found by faiss distance.

mohammedayub44 on 18 Feb 2020

@Raghavendra15 @mohammedayub44 FYI, two relevant papers that recently came out from Google and Microsoft:

Pre-training embeddings for large-scale retrieval https://arxiv.org/abs/2002.03932
TwinBERT: distillation for efficient retrieval https://arxiv.org/abs/2002.06275

peacej on 25 Feb 2020

👍4

How to train BERT on LinkedIn pages?

waayadi on 7 Mar 2020

Hi @wolf-tag
Personally I think tf-idf / BM25 is the best strategy for your task, due to various reasons.

First, it is important to differentiate between false positive and false negative rates:
False positive: A non-similar pair of docs is judged as similar, even they are not similar.
False negative: A similar pair of docs is judged as dissimilar.

TF-IDF/BM25 has a low false positive rate and a high false negative rate, i.e., if a pair is judged as similar, there is a high chance that they are actually similar.

Sentence embeddings methods (avg. GloVe embeddings, InferSent, USE, SBERT etc.) have a reverse characteristic: high false positive rates, low false negative rates. It seldom misses a similar pair, but a pair judged as similar must not necessarily be similar.

For Information Retrieval, you have an extrem imbalance. You have 1 search query and 100 Mio. documents, i.e., you perform 100 Mio pairwise comparisons.

Sentence embeddings with a high false positive rate will return many pairs where the embeddings think they are similar, but they are not. Your result set of 10 documents will be often completely garbage.

TF-IDF / BM25 might miss some relevant documents, but the 10 document you will find will be of high quality.

Second, in my experience, sentence embeddings methods work best for sentences. For (longer) documents, the results are often not that great. Here, word overlap (with tf-idf / BM25) is really hard to beat.

Third: In our experiments in Question Answering (given a question, find in Millions of answers on StackOverflow the correct one), TF-IDF / BM25 is extremely hard to beat. It often performs much better than sentence embeddings methods + it is much quicker.

So far our experiments with end-to-end representation learning for information retrieval rather failed.

What works quite good is a re-ranking approach: You use BM25 to retrieve the top 100 documents. Than, you take a neural approach like BERT to re-rank these 100 results and you present the top-10 results (the 10 results with the highest score according to the neural re-ranker) to the user. This often gives a nice boost to pure BM25 ranking, and the runtime is not too-bad, as you must only re-rank 100 documents.

Best regards
Nils Reimers

I am working on a use case where I need to get similar documents (2-3 pages average), when I upload a 1 page document. For me reducing false negatives is a priority, at the same time I don't want too many false positives. Can I first implement an embedding model to get let's say 200 similar documents and then apply TFIDF/BM25 to filter out irrelevant documents

ruthwik081 on 9 Mar 2020

I just recently started on NLP and "AI" and have been following this thread. Having a similar use case (less than 10k documents --> find similar documents and also do a multi-label classification) I am very interested in your opinion on BERT-AL:

https://openreview.net/pdf?id=SklnVAEFDB

woiza on 10 Mar 2020

@Raghavendra15 @nreimers Did you end up trialing it out with Faiss ? What were the results. ?
I have a similar use case where I have a domain dataset (about 100k english sentences) related to fires, I want to find synthetic multilingual sentences in different languages (arabic, italian, chinese etc.). My thought was to download the Wikipedia corpus (source) for each language and embed both wikipedia and my fire data and find synthetic sentences.

By following this example Semantic Similarity

I was able to download the multilingual trained models distiluse-base-multilingual-cased

Embed about 4.2 Million Arabic sentences (took about 7hrs on p2.xlarge instance, 85% GPU utilization) and 100k Fire Sentences (took couple of minutes)

This is where it hangs/very slow -

Running the similarity using cdist seems to run forever, had to cancel after running it for a day. I did not expect for it to take this long. Even though it was very straightforward. Figuring there should be a more optimized way of doing this.

Is there something wrong with the steps I have taken, appreciate any help.

Cheers !
Ayub

Did you consider using XLM-R for your multilingual approach? (Generates language independent embeddings for semantic similarity)

timpal0l on 16 Mar 2020

@timpal0l
I tested XLM-R for multilingual sentence embeddings.

If used out-of-the-box (without further fine-tuning), the results are really bad, far worse than mBERT (mBERT is also really bad without fine-tuning).

The vector spaces for XLM-R are not aligned across languages, i.e. the same sentence in two different languages are mapped to completely different points in vector space.

However, when fine-tuned, you can get quite nice results with XLM-R for cross-lingual tasks. Currently I prepare some code + paper + models, which will be release soon in the sentence-transformers repository.

Best
Nils Reimers

nreimers on 16 Mar 2020

@nreimers Thanks for you reply!

I see. I have a unlabelled corpus consisting of several languages that I wish to fine tune XLM-R (just update the language models weights to get more domain specific embeddings). Not a down stream task like classification.

I cant seem to find any example code of doing this, have you managed to do this with XLM-R using HuggingFace? Could you give me any pointers?

Cheers

timpal0l on 16 Mar 2020

Hi @timpal0l
I think this is the file you need
https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py

I haven't tested it by myself.

Best
Nils Reimers

nreimers on 17 Mar 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 16 May 2020

Hi @nreimers, I am using sentence transformers for finding stack overflow duplicate questions. I want to train the model from scratch but I am facing some issues. My training set contains questions and its duplicates only. Is it possible to train the model from this type of training data set.

sajit9285 on 16 May 2020

Hi @sajit9285
Just positive examples won't work. You somehow need to teach the network what is similar and what not.

But usually it is not an issue, as getting negative pairs is quite easy. The most simple strategy is just to sample two questions randomly. In 99,9999% of the cases they are non-duplicate and get the negative label.

A better strategy is to use hard negatives, as with the random strategy, your negatives are too easy to spot. One better way would to sample another random question with the same stackoverflow tag and treat it as negative. Or to find a similar question with ElasticSearch BM25 and to assume that it is a negative example

nreimers on 17 May 2020

👍2

@nreimers Thanks for your reply. I will try your stated methods.
I used word2vec averaging method and sentence transformers with pretrained model('bert-base-nli-mean-tokens') for ranking similar questions, and I found word2vec averaging method (for sentence embeddings) performed better. May be the data has lots of tech terms!
That's why I am thinking of training the model from scratch.

sajit9285 on 17 May 2020

@sajit9285 Yes, the NLI data sadly does not contain any computer science / programming specific examples, so it does not learn these terms. word2vec is trained on a much wider range of topics, so it has an understanding of programming terms.

nreimers on 17 May 2020

@nreimers So will it work if I trained it from scratch as stated in the that github repo?

sajit9285 on 17 May 2020

@sajit9285 As always, it depends on the quality of your training data. But I saw quite some good improvements for domain specific terms / sentences if you train it on appropriate training data

nreimers on 18 May 2020

👍1

@sajit9285 Is it not better to use the existing weights as a base, rather than train something from scratch?

timpal0l on 18 May 2020

@nreimers I will give a try. Thanks :)

sajit9285 on 18 May 2020

@timpal0l Yeah ofcourse, anytime they are better than random weights.

sajit9285 on 18 May 2020

@nreimers You are a beast! A lot of questions I had were addressed on here!

cabal-chan on 20 May 2020

@nreimers I have tried a lot replacing AllNLI files with my own dataset files in the same format. I have also changed labels (inside nliReader class' member function named get_labels) from 3 labels(contradiction,neutral,entailment) to two labels (true,false) for my task. But it is still printing those three labels and unable to detect my dataset. I tried a lot but now need your help now. I task that I trying to perform is fine tuning on bert which takes paired para/sentences as input.

SageAgastya on 18 Jun 2020

Hello @nreimers. I run all models available in https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

only using

model = SentenceTransformer(model_name)

The model that gave the best results was: distiluse-base-multilingual-cased
The results are very similar than USE (Universal Sentence Enconder).

My questions are:

How I can improve the results of distiluse-base-multilingual-cased without clean all weird text cases that I have in dataset?
Should I need explore fine-tune parameters? How I can do that?
Should I need add more layers after the end layer? What are your suggestions?
There are any way to use GloVe pre-trained model with SBERT? If yes, how I can do that?

I want to understand how I can navigate in your beautiful SBERT to do a little/easy modifications that can bring me better results.

Thank you for your help.

njsdias on 18 Jun 2020

Hi,
BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose:
https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

Hi there! Phenomenal work! I just had one question, how do transformer encodings (say BERT) compare against encodings from models like Google's Universal Sentence Encoder on a textual semantic similarity task?

algoromeo on 26 Jun 2020

Hi @algoromeo
Universal Sentence Encoder (USE) spans several different architectures. The USE large is based on transformer networks like BERT, i.e., the architectures are quite comparable. A big advantage of BERT is the language model pre-training, which induces a lot of information about language in the model. This pre-training is missing in USE.
USE also has CNN networks, which are faster and runtime scales better with the input length. But their performance is usually worse than the transformer based architectures. So you trade speed for lower accurarcy.

nreimers on 27 Jun 2020

❤1

Hi @algoromeo
Universal Sentence Encoder (USE) spans several different architectures. The USE large is based on transformer networks like BERT, i.e., the architectures are quite comparable. A big advantage of BERT is the language model pre-training, which induces a lot of information about language in the model. This pre-training is missing in USE.
USE also has CNN networks, which are faster and runtime scales better with the input length. But their performance is usually worse than the transformer based architectures. So you trade speed for lower accurarcy.

Thank you for your timely and apt reply! Gave me the much needed clarity! Cheers!

algoromeo on 28 Jun 2020

Hi @nreimers , thank you for your detailed explanation on many issues around sentence-bert and semantic textual similarity search. I am currently working on a social science project in which I am trying to measure the "cultural distinctiveness" (basically whether people are different from each other when they comment) of Reddit users based on their comments in certain posts.

I am thinking of treating all comments of each user as a document. Hopefully, I could obtain document embeddings using sentence transformers. Alternatively, I could use GloVe or Latent Semantic Analysis as embeddings of the document. After that, I am also hoping to compare each individual with the collectives he/she belongs to. So comparing text generated one user against text generated by a group of pre-defined people (and do that iteratively for every user in the dataset). Do you think sentence BERT is a suitable method to embed documents? Could you recommend any work related to the thing I am trying to do, please? Thank you!

SamALIENWARE on 4 Jul 2020

Hi @SamALIENWARE
I am afraid that Sentence-BERT is not suitable for that.

BERT (&Co.) have a quadratic runtime and quadratic memory requirement with the text length. I.e., for long documents you would need extremely large memory and have an extremely long runtime. This is why BERT & Co. limit the length for the input document to 512 word pieces, which are about 300 words.

For your purpose I would use avg. GloVe embeddings (which are already implemented in the sentence-transformers project) or LSA/LDA (e.g. from Gensim).

Best
Nils Reimers

nreimers on 6 Jul 2020

Hi @SamALIENWARE
I am afraid that Sentence-BERT is not suitable for that.

BERT (&Co.) have a quadratic runtime and quadratic memory requirement with the text length. I.e., for long documents you would need extremely large memory and have an extremely long runtime. This is why BERT & Co. limit the length for the input document to 512 word pieces, which are about 300 words.

For your purpose I would use avg. GloVe embeddings (which are already implemented in the sentence-transformers project) or LSA/LDA (e.g. from Gensim).

Best
Nils Reimers

Thanks a million, @nreimers ! I will definitely try your suggestions out.

I have tried distilled sentence BERT out yesterday. Perhaps its because there aren't that many data (19000+ users in my dataset), the "sentence" embeddings were acquired in a relatively short time period. Then I used k-means clustering on the embeddings and calculated the sum of the distance of each vector to the centroids of the clusters. I am thinking that the larger the sum, the more "distinct" the user's content is since it's semantically far from everyone else's.

So after I got embeddings using GloVe or LSA/LDA, do you think the euclidean distance to k-means centroids is a good representation of semantic textual similarity in a non-pairwise situation (1 vs. many)? Or is it better to stick to cosine similarity (calculate pairwise cosine similarity and then average), as the embedding models are trained using this metric?

Thank you again for your valuable time. I do appreciate it. Have a nice day!

SamALIENWARE on 6 Jul 2020

@nreimers , I have tried a lot replacing AllNLI files with my own dataset files in the same format. I have also changed labels (inside nliReader class' member function named get_labels) from 3 labels(contradiction,neutral,entailment) to two labels (true,false) for my task. But it is still printing those three labels and unable to detect my dataset. I tried a lot but now need your help now. I task that I trying to perform is fine tuning on bert which takes paired para/sentences as input.

SageAgastya on 6 Jul 2020

@nreimers Brillant Work!!! I just wanted to understand when we are doing evaluation we are using STS Bench Mark but when we have domain-specific data do we still need to STS or we can split our data into test and train and evaluate.

saurabhsaxena86 on 19 Jul 2020

Hi @saurabhsaxena86
No, in that case you don't need STS. If your domain specific data is suitable, you can of course train on that.

nreimers on 20 Jul 2020

Hey @nreimers, I am bit confused on how to go about training the model from scratch on my dataset. Is there some resource which I can refer to. I am having hard time figuring out how to create the dataloader and train the model on specific data.

Shubhamsaboo on 4 Aug 2020

Hi @saurabhsaxena86 , can you please share the code on how you have trained the model on your domain specific data?
That would be of great help!

Shubhamsaboo on 4 Aug 2020

Hi @Shubhamsaboo
Currently only the these scripts with training examples exists:
https://github.com/UKPLab/sentence-transformers/tree/master/examples/training_transformers

More examples for training will be pushed soon. Further, I currently work on a more extensive documentation.

nreimers on 5 Aug 2020

HI @nreimers , how can i find average feature vector from the embeddings given by sentence bert for large sentence similarity comparison ?

ankitkr3 on 10 Aug 2020

Hi @ankitkr3
Not sure what you mean. But you can use:
https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/util.py#L12

With quite large sets of sentences to compute the cosine sim between all of them.

This example might also be relevant:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_quora_duplicate_questions/application_duplicate_questions_mining.py

nreimers on 10 Aug 2020

hi @nreimers I want to calculate the similarity between two large/medium paragraphs. How can i achieve that with the current available models?

ankitkr3 on 10 Aug 2020

Hi @ankitkr3
Currently the models are trained and optimized for sentence length inputs. You can also input longer inputs, up to 510 word pieces (which is the limit for BERT).

One way to compare paragraphs and to get a similarity score would be to use Sentence Mover Distance:
https://www.aclweb.org/anthology/P19-1264v2.pdf

Code available here:
https://github.com/eaclark07/sms

I did not use this code / approach, but I heard that it can produce quite good results when you compare paragraphs with each other.

nreimers on 10 Aug 2020

@nreimers are you planning to make it available for large sentences or paragraphs ?

ankitkr3 on 10 Aug 2020

This thread help-me a lot, thanks guys! A question: I have more or less 7000 documents with 300 words each in average and some text from users. My idea is use the text from users as queries to "retrieval" or "ranking" these documents,but I dont know which is the best strategy, STS task or Learn to Rank? Any ideas are welcome.

finardi on 26 Aug 2020

Hi @finardi
Do you have training data in the form (user-query, relevant_doc)? If yes, you can use the MultipleNegativesRankingLoss, which is a learn to rank loss function: https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss

nreimers on 27 Aug 2020

👍1

Hi @nreimers,

For a project I need a good document embeddings for news articles example from cnn.com, what if we use Bert for each paragraph of the article and then average all the embedding vectors, do you think that naive approach could lead to good performance? I am trying to search for similar news covering the same event from different news outlets. Example scenario could be that the company X goes bankrupt, cnn.com will cover the event, CNBC will cover the event, fox.com, and so on, so if the user selects one article the system will show other articles covering the same event, I will pre-filter the news from too far in the future and too far in the past because probably those news are not covering the same event. I have all the news in a database.

Do you think with a simple average word embedding offer good performance in this application? and I forget using Bert? for my application I want the best accuracy I can use GPUs for inference if that is required.

bushjavier on 27 Aug 2020

Hi @bushjavier
I am not sure if BERT / SBERT will work that well for your task.

For documents, the best approach is usually to use TF-IDF / BM25. Often, these documents on the same events have so many word overlaps, that it is quite easy to identify similar documents.

Embedding approaches are more suitable if you want to compare sentences. There, the word overlap can be quite small where TD-IDF / BM25 fails. However, when you average them for larger text, it is quite unclear what the averaged embedding will look like.

Further, you have issues with docs with different lengths.

Assume you have doc A, reporting about event X.
Then you have doc B, reporting about event X but then also providing background information or reporting about other events. With BM25, this is no issue, it detects that the information of A is include in doc B. With averaged embeddings, the embeddings for doc A and B can be quite different.

nreimers on 28 Aug 2020

@nreimers thanks for the crystal clear explanation! I will proceed with TF-IDF / BM25.

bushjavier on 28 Aug 2020

HI @nreimers, i've used bert embeddings with success to perform a smart search on a huge software manual. It works pretty well.
I use 'SentenceTransformer('distiluse-base-multilingual-cased')'. In the corpus i set all the paragraph of the manual and i compare the user query with cosine similarity.
I get great results when the query is a sentence of some words: i.e. "what are supperted servers", "how to upgrade the program", etc. But when the query is a single word, or a single word mispelled or composed by random char (i.e. kjashdjkah) i get false positive often on the first corpus item.
I've made another project in which this issue is more evident: i made a document classificator using an OCR to scan the image and then comparse the results with some sentences. In this case some words read by the OCR are non sense, becouse not all the words are correctly found, and my results a set of words with some strange char. Like i said before in these cases i get false positive, often on the the first corpus item.
Is ther a way to avoid that?

kind regards
Gianluca

Calabrone on 31 Aug 2020

HI @nreimers, i've used bert embeddings with success to perform a smart search on a huge software manual. It works pretty well.
I use 'SentenceTransformer('distiluse-base-multilingual-cased')'. In the corpus i set all the paragraph of the manual and i compare the user query with cosine similarity.
I get great results when the query is a sentence of some words: i.e. "what are supperted servers", "how to upgrade the program", etc. But when the query is a single word, or a single word mispelled or composed by random char (i.e. kjashdjkah) i get false positive often on the first corpus item.
I've made another project in which this issue is more evident: i made a document classificator using an OCR to scan the image and then comparse the results with some sentences. In this case some words read by the OCR are non sense, becouse not all the words are correctly found, and my results a set of words with some strange char. Like i said before in these cases i get false positive, often on the the first corpus item.
Is ther a way to avoid that?

kind regards
Gianluca

Perhaps use word embeddings for single word documents instead of contextualized embeddings such as Bert. And for misspelled words - can't you perform a dictionary checkup to see if the word exists in a vocabulary or not?

timpal0l on 31 Aug 2020

can't you perform a dictionary checkup to see if the word exists in a vocabulary or not?

I can't becouse some queries may contains acronyms or other technician abbreviation (i.e. F24) that aren't present in a vocabulary.
Now i'm trying the way suggested from @nreimers some post above: using bm24 and then sentence embeddings to update the score.

Calabrone on 31 Aug 2020

can't you perform a dictionary checkup to see if the word exists in a vocabulary or not?

I can't becouse some queries may contains acronyms or other technician abbreviation (i.e. F24) that aren't present in a vocabulary.

Cant you build your own word vector model on your domain specific data to learn acronyms and common terms that are not present in a vocabulary. And then allow for the x nearest neighbours?

timpal0l on 31 Aug 2020

Man! this Github far more insightful than some of those NLP blogs out there!!! So thanks for the QNA's.
Here articles and sentences are used for semantic search but what about other datasets like well any DB(SQL or JSON)?

Well for an intuitive example suppose I have an object {Product: Cool Refrigerator, Price: 5000} and if i type a query like "refrigerators under 5000" the result should be Result: Cool Refrigerators there are models out there with similar solutions... but was curios if anyone had good references and solutions.

THanks

shashankMadan-designEsthetics on 28 Sep 2020

👍1

This Discussion is GOLD!!!
@nreimers - I salute your patience and knowledge-graphs embedded in your brains :). You answered almost every question with details. I accidentally bumped on this trying to read through issues.

I usually use Jina, Nboost and Sentence-Transformer depending on the problem statement along with transformers. I learnt a lot from this discussion. Thanks again to everyone who contributed. This discussion should be part of the Huggingface newsletter.

ssameerr on 9 Oct 2020

Hi, I m trying to merge the output of LDA with BERT.
@nreimers Could you please route me with right steps on how to merge the output of LDA vectors with BERT for topic modelling task. Which BERT pre-trained model will be best suited for this task. Sentence-transformer or any ?

sarojadevi on 4 Nov 2020

@sarojadevi Never worked with LDA and BERT. Can't help here, sorry.

nreimers on 4 Nov 2020

@nreimers Hi, I have used sentence-transformer to finetune on my dateset which has Anchor,Positive,Negative.
I use tripletloss as loss function, but after training, the word embeddings are very similar, so they cosine-similar are all close to 1.
Could tell me how to solve this problem?
By the way, my goal is to find the sentences closest to the input sentence, and all the sentence is about 400 words. Do you have any better suggestions？

zitaozz on 10 Nov 2020

Hi @zitaozz
You can try this loss, where you only need anchor and positives (if suitable):
https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss

Do you mean word embeddings or sentence embeddings?

nreimers on 10 Nov 2020

@nreimers Thanks for you replying! But my dataset seems not suitable for this loss because some positive sentences are similar.
Yes, I mean sentence embeddings. I don' t know why they are very similar after training.

zitaozz on 10 Nov 2020

Hi @nreimers
Sorry to barge in on this issue, but i had a query.
I used the scibert model from the hugging face repository and finetuned it on the SNLI and MNLI datasets using the sentence transformer repository. So I was trying to perform a similarity match between a query sentence and a larger text (title and abstract from arxiv papers). However, I wasn't able to get really good performance (as in the matched samples didn't look too good).

I understand that bert is trained on single sentences and not on multiple. But do you have any recommendations as to how I can improve. Thanks :)

MukundVarmaT on 11 Nov 2020

Hi @MukundVarmaT
As often, it depends on the training data. SNLI and MNLI are not really good training sets for this.

Have a look at this paper:
https://arxiv.org/abs/2004.07180

nreimers on 11 Nov 2020

👍1

Transformers: How to use BERT for finding similar sentences or similar news?

Most helpful comment

All 124 comments

Related issues