Fasttext: Nearest Neighbor in Python

Created on 14 Dec 2017  路  5Comments  路  Source: facebookresearch/fastText

It would be great to be able to do NN search in Python.

Most helpful comment

One clarification and addition - the find_nearest_neighbor function in https://github.com/facebookresearch/fastText/blob/master/python/fastText/util/util.py takes vectors rather than words as inputs. I tried for a while to get the vectors from words using the other python utilities, but didn't manage to get the same results as when using ./fasttext nn at the command-line.

One of the answers suggested in another thread at https://github.com/facebookresearch/fastText/issues/322 involves wrapping the command line utility, which works but incurs the setup cost of loading the model for each call.

So I wrote the small solution below, which uses pexpect (installed using pip install pexpect) to wrap the fasttext command-line utility, keeps the spawned process alive throughout, and does some rather ad hoc string-parsing to separate the results.

While I would not recommend anything so makeshift for production, if you're looking for a handy way to run a batch of nearest neighbor queries in python for an experiment, this may be helpful.

(No complaint intended here, so far fasttext has worked super well for me and been easy to use - thanks very much to all involved for contributing, and I hope just sharing a small extra wrapper in this way is appropriate.)

FASTTEXT_PATH = ... path to fasttext binary ...
REVIEWS_MODEL_PATH = ... path to your model ...
NUM_NEIGHBORS = 10


class NNLookup:
    """Class for using the command-line interface to fasttext nn to lookup neighbours.
    It's rather fiddly and depends on exact text strings. But it is at least short and simple."""
    def __init__(self, model_path):
        self.nn_process = pexpect.spawn('%s nn %s %d' % (FASTTEXT_PATH, model_path, NUM_NEIGHBORS))
        self.nn_process.expect('Query word?')  # Flush the first prompt out.

    def get_nn(self, word):
        self.nn_process.sendline(word)
        self.nn_process.expect('Query word?')
        output = self.nn_process.before
        return [word] + [line.strip().split()[0] for line in output.strip().split('\n')[1:]]

All 5 comments

Hello @cpury,

Thank you for your post. We do have an implementation of 'find_nearest_neighbor'. It is part of the utilities under

fastText.util. find_nearest_neighbor

I'm closing this issue for now, but please feel free to reopen this at any point if this doesn't resolve your issue.

Thanks,
Christian

@cpuhrsch nice name!

Aw, I did not see the utils! Awesome. You could consider documenting it, though

One clarification and addition - the find_nearest_neighbor function in https://github.com/facebookresearch/fastText/blob/master/python/fastText/util/util.py takes vectors rather than words as inputs. I tried for a while to get the vectors from words using the other python utilities, but didn't manage to get the same results as when using ./fasttext nn at the command-line.

One of the answers suggested in another thread at https://github.com/facebookresearch/fastText/issues/322 involves wrapping the command line utility, which works but incurs the setup cost of loading the model for each call.

So I wrote the small solution below, which uses pexpect (installed using pip install pexpect) to wrap the fasttext command-line utility, keeps the spawned process alive throughout, and does some rather ad hoc string-parsing to separate the results.

While I would not recommend anything so makeshift for production, if you're looking for a handy way to run a batch of nearest neighbor queries in python for an experiment, this may be helpful.

(No complaint intended here, so far fasttext has worked super well for me and been easy to use - thanks very much to all involved for contributing, and I hope just sharing a small extra wrapper in this way is appropriate.)

FASTTEXT_PATH = ... path to fasttext binary ...
REVIEWS_MODEL_PATH = ... path to your model ...
NUM_NEIGHBORS = 10


class NNLookup:
    """Class for using the command-line interface to fasttext nn to lookup neighbours.
    It's rather fiddly and depends on exact text strings. But it is at least short and simple."""
    def __init__(self, model_path):
        self.nn_process = pexpect.spawn('%s nn %s %d' % (FASTTEXT_PATH, model_path, NUM_NEIGHBORS))
        self.nn_process.expect('Query word?')  # Flush the first prompt out.

    def get_nn(self, word):
        self.nn_process.sendline(word)
        self.nn_process.expect('Query word?')
        output = self.nn_process.before
        return [word] + [line.strip().split()[0] for line in output.strip().split('\n')[1:]]

I suggested a fix in #552 that results in a similar behavior to the ./fasttext nn command

To make efficient multiple queries by word and get results as (words, similarities), I created the following class:

class FastTextNN:

    def __init__(self, ft_model, ft_matrix=None):
        self.ft_model = ft_model        
        self.ft_words = ft_model.get_words()
        self.word_frequencies = dict(zip(*ft_model.get_words(include_freq=True)))
        self.ft_matrix = ft_matrix
        if self.ft_matrix is None:
            self.ft_matrix = np.empty((len(self.ft_words), ft_model.get_dimension()))
            for i, word in enumerate(self.ft_words):
                self.ft_matrix[i,:] = ft_model.get_word_vector(word)

    def find_nearest_neighbor(self, query, vectors, n=10,  cossims=None):
        """
        query is a 1d numpy array corresponding to the vector to which you want to
        find the closest vector
        vectors is a 2d numpy array corresponding to the vectors you want to consider

        cossims is a 1d numpy array of size len(vectors), which can be passed for efficiency
        returns the index of the closest n matches to query within vectors and the cosine similarity (cosine the angle between the vectors)

        """
        if cossims is None:
            cossims = np.matmul(vectors, query, out=cossims)

        norms = np.sqrt((query**2).sum() * (vectors**2).sum(axis=1))
        cossims = cossims/norms
        result_i = np.argpartition(-cossims, range(n+1))[1:n+1]
        return list(zip(result_i, cossims[result_i]))

    def nearest_words(self, word, n=10, word_freq=None):
        result = self.find_nearest_neighbor(self.ft_model.get_word_vector(word), self.ft_matrix, n=n)
        if word_freq:
            return [(self.ft_words[r[0]], r[1]) for r in result if self.word_frequencies[self.ft_words[r[0]]] >= word_freq]
        else:
            return [(self.ft_words[r[0]], r[1]) for r in result]

While testing the similar words, I found that some of the most similar words are not good results, and some of them were spelling mistakes (especially when the fT model is trained a large dataset with a low min_count (default 5).
So I added a word_frequency option to limit the returned results to the most frequent words as specified by word_freq

Slight mod to @dwiddows very helpful class. I ran into an encoding issue, but also wanted to use Pandas, so in the event it's useful, here it is:

import pexpect
import pandas as pd

FASTTEXT_PATH = '/Users/jeff/Documents/fastText/fasttext'
REVIEWS_MODEL_PATH = '/Users/jeff/Documents/fastText/genre_multi.bin'
NUM_NEIGHBORS = 20


class NNLookup:
    """Class for using the command-line interface to fasttext nn to lookup neighbours.
    It's rather fiddly and depends on exact text strings. But it is at least short and simple."""
    def __init__(self, model_path):
        self.nn_process = pexpect.spawn('%s nn %s %d' % (FASTTEXT_PATH, model_path, NUM_NEIGHBORS))
        self.nn_process.expect('Query word?')  # Flush the first prompt out.

    def get_nn_records_df(self, word):
        self.nn_process.sendline(word)
        self.nn_process.expect('Query word?')
        output = self.nn_process.before
        output = [output.decode('utf8').split("\r")][0]
        output = [x.strip().split() for x in output][1:]
        df = pd.DataFrame.from_records(output)
        df.columns = ['term', 'similarity']
        df['inquery'] = word
        return df.dropna()

```
lookup = NNLookup(REVIEWS_MODEL_PATH)
lookup.get_nn_records_df("sad").head(5).to_clipboard(sep='\t')

Was this page helpful?
0 / 5 - 0 ratings

Related issues

loretoparisi picture loretoparisi  路  3Comments

kurtjanssensai picture kurtjanssensai  路  3Comments

a11apurva picture a11apurva  路  3Comments

nomadlx picture nomadlx  路  3Comments

pengyu picture pengyu  路  3Comments