Spacy: `Vectors.most_similar()` raises ValueError when query vectors return different num matches

Created on 16 Apr 2020 · 3Comments · Source: explosion/spaCy

How to reproduce the behaviour

In the case that multiple queries passed in a given call to Vectors.most_similar() return different numbers of results — fewer than the specified n — the function fails with a cryptic numpy exception: ValueError: setting an array element with a sequence. Apparently this is raised when you try to create an array from lists of different lengths:

>>> np.array([[1, 2], [1, 2, 3]], dtype="int64")
ValueError                                Traceback (most recent call last)
<ipython-input-86-2d17c17be6b1> in <module>
----> 1 np.array([[1, 2], [1, 2, 3]], dtype="int64")

ValueError: setting an array element with a sequence.

I think these lines are causing it — https://github.com/explosion/spaCy/blob/master/spacy/vectors.pyx#L361-L363 — but can't convince my debugger to dig into the cython. Here's some further evidence:

(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[0]]), n=10)[0].shape
(1, 8)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[1]]), n=10)[0].shape
(1, 10)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[2]]), n=10)[0].shape
(1, 9)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[3]]), n=10)[0].shape
(1, 10)
(Pdb)  vocab.vectors.most_similar(query_vectors, n=10)
*** ValueError: setting an array element with a sequence.

It's possible that this is just a weird edge case, since I'm populating my vocab / vectors table from scratch using a relatively small corpus of (1k docs). But maybe this is a realistic issue for the pre-trained vocab/vectors when n is large.

Your Environment

spaCy version: 2.2.4
Platform: Darwin-19.3.0-x86_64-i386-64bit
Python version: 3.7.4

bug feat / vectors

Source

bdewilde

👍1

All 3 comments

Thanks for the report and the detailed analysis!

Looks like a bug to me, and something we should definitely investigate further.

Any chance you have a small reproducible code snippet (with a mockup vocab maybe?) that triggers this error? That would help us dig into this faster :-)

svlandeg on 18 Apr 2020

Hi @svlandeg , I came up with a (very haphazard) example that raises this error:

import gensim
import numpy as np
import spacy

lang = "en"
embed_size = 100
texts = [
    "Have you listened to the new Fiona Apple album yet?",
    "I've had it on repeat since yesterday, and wow, it's so so great.",
    "Almost makes the 8-year wait worth it!",
]

spacy_lang = spacy.blank(lang)
docs = spacy_lang.pipe(texts)
sents = [[tok.text for tok in doc] for doc in docs]
# generating custom fasttext word embedding vectors
ft = gensim.models.fasttext.FastText(
    sentences=sents,
    size=embed_size,
    min_count=1,
    window=5,
    iter=5,
)
# reset vectors on vocab object w/ desired embedding size
# see: https://spacy.io/usage/vectors-similarity#custom
spacy_lang.vocab.reset_vectors(width=embed_size)
for word in ft.wv.vocab:
    spacy_lang.vocab.set_vector(word, ft.wv[word])

query_vectors = np.asarray([spacy_lang.vocab.get_vector(word) for word in ["music", "album", "I"]])
keys, _, _ = spacy_lang.vocab.vectors.most_similar(query_vectors, n=5)

Thanks for digging in!

bdewilde on 18 Apr 2020

👍1

Ah, entertaining bugs. Here most_similar is also searching in the empty all-0 padding rows of the internal vectors table (it has some padding so it doesn't have to resize for each new vector, just when it gets full). For "music", which isn't assigned a vector in the model so it gets the default all-0 vector, it returns closest matches out of the all-0 padding rows. These individual matches get filtered out because it knows the rows aren't in use, but then you end up with fewer matches for some queries than others.