Provided I have a query and a pair of passages, is it possible to rank passages according to which passage matches more to the query? Can I tweak this for that purpose? Thoughts?
Hi @shaheenkdr - you can use Flair to embed your text and get a vector for your query as well as vectors for each text passage. You need to chose a combination of word embeddings you wish to use and the type of document embedding. If you don't have training data, you can only use DocumentPoolEmbeddings which are essentially a bag-of-embeddings approach.
Here's an example script:
import torch
from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
# first, declare how you want to embed
embeddings = DocumentPoolEmbeddings(
[WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])
# your query
query = Sentence('I love Berlin')
# some texts
paragraph_1 = Sentence('Paris is an interesting city')
paragraph_2 = Sentence('The computer is new')
# embed everything
embeddings.embed([query, paragraph_1, paragraph_2])
# use cosine distance
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
# get similarity between embeddings of query and paragraph 1
similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
print(similarity_to_paragraph_1)
similarity_to_paragraph_2 = cos(query.embedding, paragraph_2.embedding)
print(similarity_to_paragraph_2)
This finds that paragraph 1 is more similar to the query.
If you want to use the more powerful DocumentLSTMEmbeddings you need to train them for this task using labeled data. But they would probably work much better than simple DocumentPoolEmbeddings.
@alanakbik Do you have any links that explains how to test with labelled data sets and predict with an unlabelled dataset?
Not sure if I understand the question. But generally, all Flair embeddings are Pytorch vectors, so you can directly use them in your code like any other Pytorch tensors and build Pytorch models on top, text, predict, etc.
@alanakbik Can I use BERT for this case ? If so, does it improve accuracy?
Yes you can :) Whether it improves accuracy depends on your use case / architecture / data etc, but it is definitely something to try. We'd be interested to hear about your results.
Can you provide your email id , to ask for further doubts if any arise? And I ll close the issue :)
@ alanakbik - When I try your code exactly like you have it, I get a dimensionality error (see below).
But if I change dim=1 to dim=0 it works fine. I am not sure why that works, but it does.
RuntimeError Traceback (most recent call last)
22
23 # get similarity between embeddings of query and paragraph 1
---> 24 similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
25 print(similarity_to_paragraph_1)
26
/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, input, *kwargs)
487 result = self._slow_forward(input, *kwargs)
488 else:
--> 489 result = self.forward(input, *kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/site-packages/torch/nn/modules/distance.py in forward(self, x1, x2)
78 @weak_script_method
79 def forward(self, x1, x2):
---> 80 return F.cosine_similarity(x1, x2, self.dim, self.eps)
RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
Ah yes, there was a small change in the way we embed documents a while back. Now you need the following script:
import torch
from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
# first, declare how you want to embed
embeddings = DocumentPoolEmbeddings(
[WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])
# your query
query = Sentence('I love Berlin')
# some texts
paragraph_1 = Sentence('Paris is an interesting city')
paragraph_2 = Sentence('The computer is new')
# embed everything
embeddings.embed([query, paragraph_1, paragraph_2])
# use cosine distance
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
# get similarity between embeddings of query and paragraph 1
similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
print(similarity_to_paragraph_1)
similarity_to_paragraph_2 = cos(query.embedding, paragraph_2.embedding)
print(similarity_to_paragraph_2)
Hi,
Thank you for the excellent tutorials!! And Flair is awsome!!
If you want to use the more powerful DocumentLSTMEmbeddings you need to train them for this task using labeled data. But they would probably work much better than simple DocumentPoolEmbeddings.
I created a DocumentLSTMEmbeddings with my labeled data and it is working very well!
But I still have a question.
I have several classes as output, so the document can be classified as "Class 1", "Class 2", ... "Class 50"... "Class 100", etc.
The "print(sentence.labels)" gives me the top rated class and its probability. Is there a way to check the probability of all the other Classes as well? The reason that I am asking is that I need to provide the "Top 5" classes as the result for a sentence.
Thank you!
Hello @dsmaciel yes this is possible. Are you training a multi-class or a single-class classifier?
If it is multi-class, you can lower the classification threshold below 0.5 to get more predictions:
# load the model (set the path to downloaded model)
classifier = TextClassifier.load("path/to/your/model.pt")
# set a lower classification threshold if you want
classifier.multi_label_threshold = 0.1
# make a document
review = Sentence("I love Berlin.", use_tokenizer=True)
# classify
classifier.predict(review)
# iterate over predicted labels and print them
for label in review.labels:
print(label)
If it is single-class, you can set multi_class_prob=True in the predict method to get the distribution over all predictions, like this:
# make a document
review = Sentence("I love Berlin.", use_tokenizer=True)
# classify
classifier.predict(review, multi_class_prob=True)
# iterate over predicted labels and print them
for label in review.labels:
print(label)
Hope this helps!
Yes!!! That helps a lot!! Thank you very much!!
@alanakbik useful answers!
Thank you for a great post! Is there an example of doing with Bert - DocumentPooledEmbeddings?
aah - found Bert embedding example here https://towardsdatascience.com/covid-19-with-a-flair-2802a9f4c90f if someone is interested. Thanks
Most helpful comment
Ah yes, there was a small change in the way we embed documents a while back. Now you need the following script: