Gensim: SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable

Created on 17 May 2019 · 12Comments · Source: RaRe-Technologies/gensim

I am using gensim 3.7.3 and python3.6.

I am following the exact example of SoftCosineSimilarity at https://radimrehurek.com/gensim/similarities/docsim.html
but with my own dataset and embeddings trained on Fasttext.
Dictionary and WordEmbeddingSimilarityIndex are executed properly but then I get an error when trying SparseTermSimilarityMatrix. I found a similar issue, that was solved in the pull below, but I still seem to get this error. However, I tried the exact same code with Word2Vec and the gensim imported common_texts and it worked. Why it doesnt work in my case, is it related to FastText?

https://github.com/RaRe-Technologies/gensim/pull/2356

My code:

from gensim.models import FastText
from gensim.corpora import Dictionary
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix

model = FastText.load('fasttext_vector_100')
# this line works
model.wv.most_similar(positive=['test'], topn=2)
termsim_index = WordEmbeddingSimilarityIndex(model.wv)
# texts is similar to common_texts, list of lists of strings
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(document) for document in texts]
# it fails here
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

TypeError                                 Traceback (most recent call last)
<ipython-input-129-c33cf3beaa3e> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
    232             most_similar = [
    233                 (dictionary.token2id[term], similarity)
--> 234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]
    236 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in <listcomp>(.0)
    231             num_rows = nonzero_limit - num_nonzero
    232             most_similar = [
--> 233                 (dictionary.token2id[term], similarity)
    234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, t1, topn)
   1418         else:
   1419             most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1420             for t2, similarity in most_similar:
   1421                 if similarity > self.threshold:
   1422                     yield (t2, similarity**self.exponent)

TypeError: 'numpy.float32' object is not iterable

Source

magiob

Most helpful comment

I found the root cause: num_rows is np.int64, whereas most_similar requires topn to be an int. That is good news, because it means that the fixes in #2356 and #2461 were ok. It is also a good motivation for another fix, since most_similar should accept any integer (type numbers.Integral), not just int.

@piskvorky I hope I will not be the root cause of a 3.7.4 bugfix release. 😅

Witiko on 17 May 2019

👀1 😕1 🎉1 👍1

All 12 comments

@Witiko can you please have a look?

piskvorky on 17 May 2019

Hi there,

I came across the same issue a few days ago.
I was about to open an issue as well to offer a fix I found.

The issue lies in line 248 of gensim/termsim.py
There is a test intended to ensure no more than nonzero_limit elements are inserted. The test is off and allows this limit to be broken, which makes the loop break at the next iteration.

The fix is simply to replace the "<=" test with a "<".

Hope this helps you guys!

mkoa on 17 May 2019

@piskvorky I will take a closer look when I am on a PC, but this seems to be the same issue as the one reported by @tvrbanec earlier (https://github.com/RaRe-Technologies/gensim/issues/2105#issuecomment-457622349, #2356, #2461). There is a large amount of duplication with the most_similar methods and it seems like some implementations (FastText) still interpret topn=0 as topn=None, returning an array of all similarities instead of a (word_id, similarity) list.

@mkoa Thank you for the report, but this seems unrelated. Moreover, the limit should not be broken, because column_nonzero counts the diagonal elements, whereas nonzero_limit is the maximum number of nonzero elements outside the diagonal, so the invariant should be preserved (although the naming is a little confusing, I'll admit).

Witiko on 17 May 2019

@Witiko Thank you very much for the heads up!
You are right, this is the same issue as that you reference above, which I missed.

I can confirm I got the same error as @magiob but with a word2vec model on my side. index.most_similar(t1, num_rows) is called with num_rows=0 and returns a numeric array even with the latest pull.

My fix aims at directly preventing a call to index.most_similar with topn=0 but does not address the actual root cause then. Thanks for the explanation!

mkoa on 17 May 2019

@mkoa That is a useful suggestion. Changing termsim.py as follows should fix this issue:

232,235c232,238
<             most_similar = [
<                 (dictionary.token2id[term], similarity)
<                 for term, similarity in index.most_similar(t1, num_rows)
<                 if term in dictionary.token2id]
---
>             if num_rows > 0:
>                 most_similar = [
>                     (dictionary.token2id[term], similarity)
>                     for term, similarity in index.most_similar(t1, topn=num_rows)
>                     if term in dictionary.token2id]
>             else:
>                 most_similar = []

Even though this does not address the root cause, it is still good defensive programming.

Witiko on 17 May 2019

Suggested fixes:

Apply the defensive patch from https://github.com/RaRe-Technologies/gensim/issues/2496#issuecomment-493498587, because code purity seems less of an issue than a broken implementation, especially when a proper fix of most_similar has proven to be so elusive. This fix closes the issue.
Fix all implementations of most_similar, so that they correctly handle topn=0. Why this has not been fixed by #2356 and #2461 remains to be investigated. 🤔 The third time is the charm, I suppose. This is less of a priority, since no code seems to rely on the behavior of most_similar(topn=0).
Make column_nonzero count from zero, so that variables are named consistently, as suggested in https://github.com/RaRe-Technologies/gensim/issues/2496#issuecomment-493461508. This is not a priority, but it will make a reader's life a little easier given how terse the SparseTermSimilarityMatrix code already is.

Witiko on 17 May 2019

@piskvorky I hope I will not be the root cause of a 3.7.4 bugfix release. 😅

Witiko on 17 May 2019

👀1 😕1 🎉1 👍1

Yes, I can confirm that code used to work on gensim==3.7.2 now on gensim=3.7.3 throw the error:
TypeError: cannot unpack non-iterable numpy.float32 objectwhen executing code:
SparseTermSimilarityMatrix(similarity_index, dictionary)

tvrbanec on 19 May 2019

I use Gensim 3.7.3. When I executed:

word_vectors = Word2Vec.load(WORD_EMBEDDING_DIR + WORD_EMBEDDING_FILENAME).wv
similarity_matrix = word_vectors.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

I received:
File "/home/piotr/.local/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1420, in most_similar for t2, similarity in most_similar: TypeError: cannot unpack non-iterable numpy.float32 object

And I fixed it completely by https://github.com/RaRe-Technologies/gensim/pull/2356#issuecomment-493498587

piofel on 12 Jun 2019

@piofel Thank you for confirming the fix. After #2497 is merged, this should no longer be an issue.

Witiko on 12 Jun 2019

I am having same problem with my own word2vec model while following tutorial here:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

Is there any time table to publish the fix? Or any workaround other then downgrade to 3.7.2?

mehmetilker on 19 Jun 2019

@mehmetilker: The fix is published, see https://github.com/RaRe-Technologies/gensim/issues/2496#issuecomment-493498587. Hopefully, #2497 will be merged soon; what do you think, @mpenkov?

Witiko on 19 Jun 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings