In the case that multiple queries passed in a given call to Vectors.most_similar() return different numbers of results — fewer than the specified n — the function fails with a cryptic numpy exception: ValueError: setting an array element with a sequence. Apparently this is raised when you try to create an array from lists of different lengths:
>>> np.array([[1, 2], [1, 2, 3]], dtype="int64")
ValueError Traceback (most recent call last)
<ipython-input-86-2d17c17be6b1> in <module>
----> 1 np.array([[1, 2], [1, 2, 3]], dtype="int64")
ValueError: setting an array element with a sequence.
I think these lines are causing it — https://github.com/explosion/spaCy/blob/master/spacy/vectors.pyx#L361-L363 — but can't convince my debugger to dig into the cython. Here's some further evidence:
(Pdb) vocab.vectors.most_similar(np.asarray([query_vectors[0]]), n=10)[0].shape
(1, 8)
(Pdb) vocab.vectors.most_similar(np.asarray([query_vectors[1]]), n=10)[0].shape
(1, 10)
(Pdb) vocab.vectors.most_similar(np.asarray([query_vectors[2]]), n=10)[0].shape
(1, 9)
(Pdb) vocab.vectors.most_similar(np.asarray([query_vectors[3]]), n=10)[0].shape
(1, 10)
(Pdb) vocab.vectors.most_similar(query_vectors, n=10)
*** ValueError: setting an array element with a sequence.
It's possible that this is just a weird edge case, since I'm populating my vocab / vectors table from scratch using a relatively small corpus of (1k docs). But maybe this is a realistic issue for the pre-trained vocab/vectors when n is large.
Thanks for the report and the detailed analysis!
Looks like a bug to me, and something we should definitely investigate further.
Any chance you have a small reproducible code snippet (with a mockup vocab maybe?) that triggers this error? That would help us dig into this faster :-)
Hi @svlandeg , I came up with a (very haphazard) example that raises this error:
import gensim
import numpy as np
import spacy
lang = "en"
embed_size = 100
texts = [
"Have you listened to the new Fiona Apple album yet?",
"I've had it on repeat since yesterday, and wow, it's so so great.",
"Almost makes the 8-year wait worth it!",
]
spacy_lang = spacy.blank(lang)
docs = spacy_lang.pipe(texts)
sents = [[tok.text for tok in doc] for doc in docs]
# generating custom fasttext word embedding vectors
ft = gensim.models.fasttext.FastText(
sentences=sents,
size=embed_size,
min_count=1,
window=5,
iter=5,
)
# reset vectors on vocab object w/ desired embedding size
# see: https://spacy.io/usage/vectors-similarity#custom
spacy_lang.vocab.reset_vectors(width=embed_size)
for word in ft.wv.vocab:
spacy_lang.vocab.set_vector(word, ft.wv[word])
query_vectors = np.asarray([spacy_lang.vocab.get_vector(word) for word in ["music", "album", "I"]])
keys, _, _ = spacy_lang.vocab.vectors.most_similar(query_vectors, n=5)
Thanks for digging in!
Ah, entertaining bugs. Here most_similar is also searching in the empty all-0 padding rows of the internal vectors table (it has some padding so it doesn't have to resize for each new vector, just when it gets full). For "music", which isn't assigned a vector in the model so it gets the default all-0 vector, it returns closest matches out of the all-0 padding rows. These individual matches get filtered out because it knows the rows aren't in use, but then you end up with fewer matches for some queries than others.