I am using gensim 3.7.3 and python3.6.
I am following the exact example of SoftCosineSimilarity at https://radimrehurek.com/gensim/similarities/docsim.html
but with my own dataset and embeddings trained on Fasttext.
Dictionary and WordEmbeddingSimilarityIndex are executed properly but then I get an error when trying SparseTermSimilarityMatrix. I found a similar issue, that was solved in the pull below, but I still seem to get this error. However, I tried the exact same code with Word2Vec and the gensim imported common_texts and it worked. Why it doesnt work in my case, is it related to FastText?
https://github.com/RaRe-Technologies/gensim/pull/2356
My code:
from gensim.models import FastText
from gensim.corpora import Dictionary
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
model = FastText.load('fasttext_vector_100')
# this line works
model.wv.most_similar(positive=['test'], topn=2)
termsim_index = WordEmbeddingSimilarityIndex(model.wv)
# texts is similar to common_texts, list of lists of strings
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(document) for document in texts]
# it fails here
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
TypeError Traceback (most recent call last)
<ipython-input-129-c33cf3beaa3e> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
232 most_similar = [
233 (dictionary.token2id[term], similarity)
--> 234 for term, similarity in index.most_similar(t1, num_rows)
235 if term in dictionary.token2id]
236
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in <listcomp>(.0)
231 num_rows = nonzero_limit - num_nonzero
232 most_similar = [
--> 233 (dictionary.token2id[term], similarity)
234 for term, similarity in index.most_similar(t1, num_rows)
235 if term in dictionary.token2id]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, t1, topn)
1418 else:
1419 most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1420 for t2, similarity in most_similar:
1421 if similarity > self.threshold:
1422 yield (t2, similarity**self.exponent)
TypeError: 'numpy.float32' object is not iterable
@Witiko can you please have a look?
Hi there,
I came across the same issue a few days ago.
I was about to open an issue as well to offer a fix I found.
The issue lies in line 248 of gensim/termsim.py
There is a test intended to ensure no more than nonzero_limit elements are inserted. The test is off and allows this limit to be broken, which makes the loop break at the next iteration.
The fix is simply to replace the "<=" test with a "<".
Hope this helps you guys!
@piskvorky I will take a closer look when I am on a PC, but this seems to be the same issue as the one reported by @tvrbanec earlier (https://github.com/RaRe-Technologies/gensim/issues/2105#issuecomment-457622349, #2356, #2461). There is a large amount of duplication with the most_similar methods and it seems like some implementations (FastText) still interpret topn=0 as topn=None, returning an array of all similarities instead of a (word_id, similarity) list.
@mkoa Thank you for the report, but this seems unrelated. Moreover, the limit should not be broken, because column_nonzero counts the diagonal elements, whereas nonzero_limit is the maximum number of nonzero elements outside the diagonal, so the invariant should be preserved (although the naming is a little confusing, I'll admit).
@Witiko Thank you very much for the heads up!
You are right, this is the same issue as that you reference above, which I missed.
I can confirm I got the same error as @magiob but with a word2vec model on my side. index.most_similar(t1, num_rows) is called with num_rows=0 and returns a numeric array even with the latest pull.
My fix aims at directly preventing a call to index.most_similar with topn=0 but does not address the actual root cause then. Thanks for the explanation!
@mkoa That is a useful suggestion. Changing termsim.py as follows should fix this issue:
232,235c232,238
< most_similar = [
< (dictionary.token2id[term], similarity)
< for term, similarity in index.most_similar(t1, num_rows)
< if term in dictionary.token2id]
---
> if num_rows > 0:
> most_similar = [
> (dictionary.token2id[term], similarity)
> for term, similarity in index.most_similar(t1, topn=num_rows)
> if term in dictionary.token2id]
> else:
> most_similar = []
Even though this does not address the root cause, it is still good defensive programming.
Suggested fixes:
most_similar has proven to be so elusive. This fix closes the issue.most_similar, so that they correctly handle topn=0. Why this has not been fixed by #2356 and #2461 remains to be investigated. 馃 The third time is the charm, I suppose. This is less of a priority, since no code seems to rely on the behavior of most_similar(topn=0).column_nonzero count from zero, so that variables are named consistently, as suggested in https://github.com/RaRe-Technologies/gensim/issues/2496#issuecomment-493461508. This is not a priority, but it will make a reader's life a little easier given how terse the SparseTermSimilarityMatrix code already is.I found the root cause: num_rows is np.int64, whereas most_similar requires topn to be an int. That is good news, because it means that the fixes in #2356 and #2461 were ok. It is also a good motivation for another fix, since most_similar should accept any integer (type numbers.Integral), not just int.
@piskvorky I hope I will not be the root cause of a 3.7.4 bugfix release. 馃槄
Yes, I can confirm that code used to work on gensim==3.7.2 now on gensim=3.7.3 throw the error:
TypeError: cannot unpack non-iterable numpy.float32 object
when executing code:
SparseTermSimilarityMatrix(similarity_index, dictionary)
I use Gensim 3.7.3. When I executed:
word_vectors = Word2Vec.load(WORD_EMBEDDING_DIR + WORD_EMBEDDING_FILENAME).wv
similarity_matrix = word_vectors.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)
I received:
File "/home/piotr/.local/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1420, in most_similar
for t2, similarity in most_similar:
TypeError: cannot unpack non-iterable numpy.float32 object
And I fixed it completely by https://github.com/RaRe-Technologies/gensim/pull/2356#issuecomment-493498587
@piofel Thank you for confirming the fix. After #2497 is merged, this should no longer be an issue.
I am having same problem with my own word2vec model while following tutorial here:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
Is there any time table to publish the fix? Or any workaround other then downgrade to 3.7.2?
@mehmetilker: The fix is published, see https://github.com/RaRe-Technologies/gensim/issues/2496#issuecomment-493498587. Hopefully, #2497 will be merged soon; what do you think, @mpenkov?
Most helpful comment
I found the root cause:
num_rowsisnp.int64, whereasmost_similarrequirestopnto be anint. That is good news, because it means that the fixes in #2356 and #2461 were ok. It is also a good motivation for another fix, sincemost_similarshould accept any integer (typenumbers.Integral), not justint.@piskvorky I hope I will not be the root cause of a 3.7.4 bugfix release. 馃槄