Hello,
I am working on a thesis about sentence embeddings and I would like to ask how exactly could I reproduce the reported results of InferSent on the STS Benchmark dataset (75.8 Pearson coef.). I tried to simply download their testing dataset, encode each pair of sentences (lower-casing them) and compute cosine similarity between them, but I only managed to get to a Pearson coef. of 71.0.
I know that you write in the description that I can use the SentEval project to reproduce the results, but it computes loads of other tasks that I am not interested in. I also looked at the code of SentEval and I am a bit confused about how exactly it preprocesses the data and computes the scores. From what I understood it uses the STS dataset from 2014 (I tried that one too, but only got to Pearson coef. of 65.8). I would much rather reproduce the result myself.
So the question is: What is the easies way to reproduce the results on STS benchmark without using the (for me too complicated) SentEval module? Should I preprocess the sentences somehow?
Thanks in advance.
I did use tokenize=True and I also lower-cased the text. Here is my code:
import spacy
import torch
from pandas import read_csv
from csv import QUOTE_NONE
from numpy import diag
from scipy.stats import pearsonr
from sklearn.metrics.pairwise import cosine_similarity
data = read_csv('data/stsbenchmark/sts-test.csv', sep='\t', usecols=[4, 5, 6], header=None, quoting=QUOTE_NONE)
target_score = data[4].tolist()
sentences1 = data[5].tolist()
sentences2 = data[6].tolist()
sentences1 = [x.strip() for x in sentences1]
sentences1 = [x.lower() for x in sentences1]
sentences2 = [x.strip() for x in sentences2]
sentences2 = [x.lower() for x in sentences2]
model = torch.load('data/infersent.allnli.pickle', map_location=lambda storage, loc: storage)
model.set_glove_path('data/glove.840B.300d.txt')
model.build_vocab(sentences1+sentences2, tokenize=True)
vectors1 = model.encode(sentences1, tokenize=True)
vectors2 = model.encode(sentences2, tokenize=True)
score = diag(cosine_similarity(vectors1, vectors2))
print('Pearson coef.:', pearsonr(score, target_score)[0])
From this I get Pearson coef. 0.71.
Hi, thanks for your interest in InferSent.
sorry I read your post too fast. The problem is not the preprocessing. For preprocessing STSBenchmark, we just use tokenizer=True and lower-casing.
Though for STSBenchmark, the results we report are not by just using the cosine similarity between sentence embeddings on the test set. It uses the same approach that we use for SICK Relatedness (please refer to our paper for more details on the technique used).
We use the training set for learning a projection over the sentence embeddings.
You can reproduce our results using SentEval though. It should be relatively easy to use. We encourage you to try it (try by just making the "bow.py" example work), especially if you write a thesis about sentence embeddings: it would add great value to your thesis to have multiple evaluation.
Thanks
Thank you, I will have a look ;)
Note related to this issue: SentEval has been modified recently:
Now the results in the paper on InferSent or Bag-of-words are reproducible in SentEval after the fix for all STS tasks. You will have to download again the transfer tasks data (./get_transfer_data.bash) to get the proper preprocessing on these tasks. Thanks.