Spacy: vocab.set_vector() has accidentally quadratic runtime

Created on 25 Feb 2018 · 7Comments · Source: explosion/spaCy

There had been reports that vocab.set_vector() was slow, which I figured "yeah okay, adding vectors one by one is a bit inefficient. But you do it once and save out the model".

But I was just converting one of the recent FastText vectors, and it was going at 30 vectors per second --- so it would've taken like 30 hours!

It turns out vectors.add() needs to find the next empty row, and it's using a set() object to keep track of the rows currently unset. I guess I figured there would only be a few, from deletions. However, if you first initialize the table with the shape you want, all the rows will be unset --- so to make n insertions, we iterate the list of n empty rows.

The easiest fix is likely to just use a heap here instead of a set. However, the mechanism does feel pretty dumb overall. There should be a smarter way to do this.

bug feat / vectors help wanted

Source

honnibal

👍1

Most helpful comment

In the meantime, if you're affected by this bug, here's a mitigation. This script loads vectors in text format, and saves the nlp model to disk.

Edit: If you're using spaCy v2.0.10, it's now much more convenient to use spacy init-model for this. It now supports .tgz and .zip files, making it very easy to convert FastText vectors


'''
Make an loadable directory from a FastText vectors file.

https://fasttext.cc/docs/en/english-vectors.html
'''
from pathlib import Path
import plac
import spacy
import numpy
import tqdm
from spacy.vocab import Vectors


def fasttext2spacy(lang, in_loc, out_dir):
    in_loc = Path(in_loc)
    out_dir = Path(out_dir)
    nlp = spacy.blank(lang)
    with in_loc.open('r', encoding='utf8') as file_:
        header = next(file_)
        n_words, width = header.split()
        n_words = int(n_words)
        width = int(width)
        data = numpy.zeros((n_words, width), dtype='f')
        keys = []
        for i, line in tqdm.tqdm(enumerate(file_), total=n_words):
            word, vector = line.split(' ', 1)
            data[i] = numpy.asarray(vector.split(), dtype='f')
            _ = nlp.vocab[word]
            keys.append(word)
        nlp.vocab.vectors = Vectors(data=data, keys=keys)
    nlp.to_disk(out_dir)

if __name__ == '__main__':
    plac.call(fasttext2spacy)

Note that the resulting vocab won't have values for lex.prob or lex.cluster. If you need those, you can set them from a data file or another spaCy model.

honnibal on 25 Feb 2018

👍3

All 7 comments

In the meantime, if you're affected by this bug, here's a mitigation. This script loads vectors in text format, and saves the nlp model to disk.

Edit: If you're using spaCy v2.0.10, it's now much more convenient to use spacy init-model for this. It now supports .tgz and .zip files, making it very easy to convert FastText vectors


'''
Make an loadable directory from a FastText vectors file.

https://fasttext.cc/docs/en/english-vectors.html
'''
from pathlib import Path
import plac
import spacy
import numpy
import tqdm
from spacy.vocab import Vectors


def fasttext2spacy(lang, in_loc, out_dir):
    in_loc = Path(in_loc)
    out_dir = Path(out_dir)
    nlp = spacy.blank(lang)
    with in_loc.open('r', encoding='utf8') as file_:
        header = next(file_)
        n_words, width = header.split()
        n_words = int(n_words)
        width = int(width)
        data = numpy.zeros((n_words, width), dtype='f')
        keys = []
        for i, line in tqdm.tqdm(enumerate(file_), total=n_words):
            word, vector = line.split(' ', 1)
            data[i] = numpy.asarray(vector.split(), dtype='f')
            _ = nlp.vocab[word]
            keys.append(word)
        nlp.vocab.vectors = Vectors(data=data, keys=keys)
    nlp.to_disk(out_dir)

if __name__ == '__main__':
    plac.call(fasttext2spacy)

Note that the resulting vocab won't have values for lex.prob or lex.cluster. If you need those, you can set them from a data file or another spaCy model.

honnibal on 25 Feb 2018

👍3

@honnibal
_unset is used only for 4 purposes

Checking if an id belongs in it
Removing an id from it
Finding total count of _unset
Finding minimum id which is in _unset

For the first 3, we can use a simple boolean array. Total count can be stored as a separate variable which can be updated on every add and remove.
For point 4, when key and row are not found, we have to end up doing a linear search.

If the number of times when key and row are not found is very less, this would be a good approach.
If not so, then heap sounds like the only way left.

skrcode on 25 Mar 2018

I feel that we should be looking into AVL or RedBlack Trees too. heapq will have linear runtime for removing an index from a list. AVL or RedBlack Trees would give a logarithmic time for all operations which would still run pretty well in comparison to constant time operations in set.

skrcode on 30 Mar 2018

Thanks for this.

I'm worried that it says that bintrees library is unmaintained. I'm also not really thrilled by that pure Python container library it suggests instead, as that will surely have higher memory use than a C-level solution.

What if we used the C++ STL priority_queue and a boolean array? The boolean array can do your 1-3 above. We would keep all (but not only!) the available rows in the queue.We'd then have something like this:

from libcpp.queue cimport priority_queue

cdef struct UnsetC:
    priority_queue[int] queue
    char* status # I don't think bool works here? 8 bits doesnt seem a hardship.
    int count

cdef char is_unset(UnsetC unset, int i) nogil:
    return unset.status[i]

cdef void mark_as_set(UnsetC unset, int i) nogil:
    unset.status[i] = 1
    unset.count -= 1
    # Don't iterate the priority queue. But if we do set next, pop it.
    # Note that the priority queue stores negative i, to sort ascending.
    if unset.top() == -i:
        unset.pop()

cdef int count_unset(UnsetC unset) nogil:
    return unset.count

cdef int next_available(UnsetC unset) nogil:
    row = -unset.top()
    while unset.status[row]:
        unset.pop()
        row = -unset.top()
    return row

honnibal on 30 Mar 2018

@honnibal Won't mark_as_set be linear when i is not at the top? How would mark_as_set handle unset for any position?
An additional point, next_available() would run linear in the worst case if the last element of status[row] is False.

What about completely removing the array and the priority_queue and go with c++ STL set? It would solve all the problems.

sorted set in cpp

set[int] _unset

remove key - O(logn)

_unset.erase(key)

add key - O(logn)

_unset.insert(key)

count - O(1)

_unset.size()

check if element exists - O(logn)

_unset.count(key)

get minimum - O(1)

if not _unset.empty()
*_unset.rbegin()

skrcode on 30 Mar 2018

@honnibal Won't mark_as_set be linear when i is not at the top? How would mark_as_set handle unset for any position?

No, because we don't require all entries in the priority queue to be available. If the top happens to be unavailable we go ahead and pop it, but otherwise we leave the queue alone. Then next time we do next_available, we check the status array to make sure we're not returning a value that's unavailable. At that point we know we're dealing with the top, so we can pop it without walking the priority queue.

Essentially we remove from the queue lazily.

STL set might be better though. I didn't know about this -- I'm used to thinking of Python's sets.

honnibal on 30 Mar 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.