Flair: Is it possible to do ranking with Flair?

Created on 21 Dec 2018 · 14Comments · Source: flairNLP/flair

Provided I have a query and a pair of passages, is it possible to rank passages according to which passage matches more to the query? Can I tweak this for that purpose? Thoughts?

question

Source

shaheenkdr

Most helpful comment

Ah yes, there was a small change in the way we embed documents a while back. Now you need the following script:

import torch

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, WordEmbeddings

# first, declare how you want to embed
embeddings = DocumentPoolEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])

# your query
query = Sentence('I love Berlin')

# some texts
paragraph_1 = Sentence('Paris is an interesting city')
paragraph_2 = Sentence('The computer is new')

# embed everything
embeddings.embed([query, paragraph_1, paragraph_2])

# use cosine distance
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

# get similarity between embeddings of query and paragraph 1
similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
print(similarity_to_paragraph_1)

similarity_to_paragraph_2 = cos(query.embedding, paragraph_2.embedding)
print(similarity_to_paragraph_2)

alanakbik on 15 May 2019

👍8

All 14 comments

Hi @shaheenkdr - you can use Flair to embed your text and get a vector for your query as well as vectors for each text passage. You need to chose a combination of word embeddings you wish to use and the type of document embedding. If you don't have training data, you can only use DocumentPoolEmbeddings which are essentially a bag-of-embeddings approach.

Here's an example script:

import torch

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, WordEmbeddings

# first, declare how you want to embed
embeddings = DocumentPoolEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])

# your query
query = Sentence('I love Berlin')

# some texts
paragraph_1 = Sentence('Paris is an interesting city')
paragraph_2 = Sentence('The computer is new')

# embed everything
embeddings.embed([query, paragraph_1, paragraph_2])

# use cosine distance
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)

# get similarity between embeddings of query and paragraph 1
similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
print(similarity_to_paragraph_1)

similarity_to_paragraph_2 = cos(query.embedding, paragraph_2.embedding)
print(similarity_to_paragraph_2)

This finds that paragraph 1 is more similar to the query.

If you want to use the more powerful DocumentLSTMEmbeddings you need to train them for this task using labeled data. But they would probably work much better than simple DocumentPoolEmbeddings.

alanakbik on 21 Dec 2018

👍5

@alanakbik Do you have any links that explains how to test with labelled data sets and predict with an unlabelled dataset?

shaheenkdr on 21 Dec 2018

Not sure if I understand the question. But generally, all Flair embeddings are Pytorch vectors, so you can directly use them in your code like any other Pytorch tensors and build Pytorch models on top, text, predict, etc.

alanakbik on 21 Dec 2018

@alanakbik Can I use BERT for this case ? If so, does it improve accuracy?

shaheenkdr on 21 Dec 2018

Yes you can :) Whether it improves accuracy depends on your use case / architecture / data etc, but it is definitely something to try. We'd be interested to hear about your results.

alanakbik on 21 Dec 2018

Can you provide your email id , to ask for further doubts if any arise? And I ll close the issue :)

shaheenkdr on 21 Dec 2018

@ alanakbik - When I try your code exactly like you have it, I get a dimensionality error (see below).
But if I change dim=1 to dim=0 it works fine. I am not sure why that works, but it does.

Here's the error I was getting:

RuntimeError Traceback (most recent call last)
in
22
23 # get similarity between embeddings of query and paragraph 1
---> 24 similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
25 print(similarity_to_paragraph_1)
26

/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, input, *kwargs)
487 result = self._slow_forward(input, *kwargs)
488 else:
--> 489 result = self.forward(input, *kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

/usr/local/lib/python3.6/site-packages/torch/nn/modules/distance.py in forward(self, x1, x2)
78 @weak_script_method
79 def forward(self, x1, x2):
---> 80 return F.cosine_similarity(x1, x2, self.dim, self.eps)

RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

impulsecorp on 13 May 2019

Ah yes, there was a small change in the way we embed documents a while back. Now you need the following script:

import torch

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, WordEmbeddings

# first, declare how you want to embed
embeddings = DocumentPoolEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])

# your query
query = Sentence('I love Berlin')

# some texts
paragraph_1 = Sentence('Paris is an interesting city')
paragraph_2 = Sentence('The computer is new')

# embed everything
embeddings.embed([query, paragraph_1, paragraph_2])

# use cosine distance
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

# get similarity between embeddings of query and paragraph 1
similarity_to_paragraph_1 = cos(query.embedding, paragraph_1.embedding)
print(similarity_to_paragraph_1)

similarity_to_paragraph_2 = cos(query.embedding, paragraph_2.embedding)
print(similarity_to_paragraph_2)

alanakbik on 15 May 2019

👍8

Hi,
Thank you for the excellent tutorials!! And Flair is awsome!!

If you want to use the more powerful DocumentLSTMEmbeddings you need to train them for this task using labeled data. But they would probably work much better than simple DocumentPoolEmbeddings.

I created a DocumentLSTMEmbeddings with my labeled data and it is working very well!
But I still have a question.
I have several classes as output, so the document can be classified as "Class 1", "Class 2", ... "Class 50"... "Class 100", etc.
The "print(sentence.labels)" gives me the top rated class and its probability. Is there a way to check the probability of all the other Classes as well? The reason that I am asking is that I need to provide the "Top 5" classes as the result for a sentence.

Thank you!

dsmaciel on 3 Jul 2019

Hello @dsmaciel yes this is possible. Are you training a multi-class or a single-class classifier?

If it is multi-class, you can lower the classification threshold below 0.5 to get more predictions:

# load the model (set the path to downloaded model)
classifier = TextClassifier.load("path/to/your/model.pt")

# set a lower classification threshold if you want
classifier.multi_label_threshold = 0.1

# make a document
review = Sentence("I love Berlin.", use_tokenizer=True)

# classify 
classifier.predict(review)

# iterate over predicted labels and print them
for label in review.labels:
    print(label)

If it is single-class, you can set multi_class_prob=True in the predict method to get the distribution over all predictions, like this:

# make a document
review = Sentence("I love Berlin.", use_tokenizer=True)

# classify
classifier.predict(review, multi_class_prob=True)

# iterate over predicted labels and print them
for label in review.labels:
    print(label)

Hope this helps!

alanakbik on 4 Jul 2019

👍3

Yes!!! That helps a lot!! Thank you very much!!

dsmaciel on 5 Jul 2019

@alanakbik useful answers!