Gensim: Can not build similarity matrix when the dictionary contains word indices are not continuous

Created on 3 May 2018 · 12Comments · Source: RaRe-Technologies/gensim

Description

For some reasons, the gensim.corpora.Dictionary contains word indices are not continuous. I met the KeyError in _similarity_matrix()_ function when it tried to re-index from 0 to n-1.

Steps/Code/Corpus to Reproduce

import gensim.downloader as api

def get_similarity_matrix(dictionary):
    '''
    In some cases, the dictionary contains word indices are not continuous.
    For example, the dictionary contains theses items: (1, u'car'), (2, u'so'), (4, 'nice')
    '''
    w2v_model = api.load('glove-wiki-gigaword-50', return_path=False)
    return w2v_model.wv.similarity_matrix(dictionary, nonzero_limit=100)

Expected Results

Term similarity matrix in scipy.sparse.csc_matrix format

Actual Results

    return w2v_model.wv.similarity_matrix(dictionary, nonzero_limit=100)
  File "/home/jen/anaconda2/lib/python2.7/site-packages/gensim/models/keyedvectors.py", line 510, in similarity_matrix
    w1 = dictionary[w1_index]
  File "/home/jen/anaconda2/lib/python2.7/site-packages/gensim/corpora/dictionary.py", line 104, in __getitem__
    return self.id2token[tokenid]  # will throw for non-existent ids
KeyError: 0
ERROR: Non-zero return code '1' from command: Process exited with status 1

Versions

('Python', '2.7.14 |Anaconda custom (64-bit)| (default, Mar 27 2018, 17:29:31) \n[GCC 7.2.0]')
('NumPy', '1.14.2')
('SciPy', '1.0.1')
('gensim', '3.4.0')
('FAST_VERSION', 1)

Source

quandb

👍1

All 12 comments

Have same issue.

In my current system have already trained a vocabulary (word-index, word-term) the word-index is not a continuous series (1,2,3,4,5, ...) but it is (1,4,5,...) randomly , So what we want is: try to convert this vocab to the same format dictionary of gensim and get this error.

Thanks

tanthml on 3 May 2018

I could be completely wrong but this is an idea that comes to mind:

Words in the dictionary are indeed assigned continuous IDs, but some of them are filtered out by the default filters (i.e their occurence count is lower than the threshold). Could you check if the issue persists when disabling the filters?

steremma on 9 May 2018

Looks like a bug to me. There is no expectation that the Dictionary id range must be contiguous.

CC @Witiko similarity_matrix seems to be your code -- can you please fix + add a test?

In the meanwhile: @quandb @tanthml can you call .compactify() on your dictionary, to remove the gaps? Why does your dictionary contain gaps in the first place?

piskvorky on 10 May 2018

@piskvorky All my contributed code assumes that dictionaries are contiguous. I discussed this with @menshikh-iv to confirm that this was a correct assumption. The response was that in the current version of Gensim, dictionaries were automatically compactified on construction and that this assumption was therefore valid. I assume that there are still ways to obtain an uncompactified dictionary, then?

Witiko on 10 May 2018

@Witiko Yes, there are. For example filter_extremes() (the most common way to filter a Dictionary) will call compactify() automatically. But if the user implements their own filtering, it is a quite possible there will be gaps.

There is no API contract that the ids must be contiguous (though they typically are). Let's wait for @quandb and @tanthml 's reponse regarding their use case for having gaps.

piskvorky on 10 May 2018

👍1

Hi all,

In summary: there is another app perform the mapping between word and id, some tokens have been removed due to some reasons. Then I have to build the Gensim dictionary from processed vocab (have word and id already).

In detail: the use case here is we build a product that includes multiple independent apps such as data ingestion, pre-processing, vectorization, topic modeling, etc.
The task i'm working on is to build a similarity_matrix. And the similarity_matrix function have to consume a dictionary. The thing here's we have to keep the token and the token_id consistent across the system. we can't build the dictionary from original corpus since we have another mapper did that.
Then I build a dictionary by refactor the Dictionary.load_from_text() function to load the vocabulary follow our format. We can't even use the compactify() due to the token_id must consistent for other apps.

Let me know if you need more information,

quandb on 11 May 2018

@quandb As a workaround, you can have both the original, and a compactified dictionary and translate documents between them.

>>> import gensim.downloader as api
>>> from gensim.corpora import Dictionary
>>>
>>> text = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
>>>
>>> model = api.load("word2vec-google-news-300")
>>> dict1 = Dictionary([[term for term in model.vocab.keys()]])
>>> dict2 = Dictionary([text])
>>>
>>> doc1 = dict1.doc2bow(text)
>>> doc2 = dict2.doc2bow(text)
>>>
>>> def translate(document, dict1, dict2):
>>>     return [
>>>         (dict2.token2id[dict1[term_id]], term_frequency)
>>>         for (term_id, term_frequency) in document]
>>>
>>> doc1
[(2430835, 1),
 (2522359, 1),
 (2573330, 1),
 (2652431, 1),
 (2665091, 1),
 (2746533, 1),
 (2799299, 1),
 (2912713, 2)]
>>> translate(doc2, dict2, dict1)
[(2430835, 1),
 (2522359, 1),
 (2573330, 1),
 (2652431, 1),
 (2665091, 1),
 (2746533, 1),
 (2799299, 1),
 (2912713, 2)]
>>> doc2
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2)]
>>> translate(doc1, dict1, dict2)
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2)]

@piskvorky If we decide we want to support this, I will add proper support for non-contiguous dictionaries to the methods related to the soft cosine measure. It will make the code a little less maintainable.

Witiko on 11 May 2018

👍1

I think so, thanks. The other option is to enforce contiguous ids, but I don't think that's a good idea.

piskvorky on 11 May 2018

👍1

Hi @piskvorky ,

For my case, I try out an ad-hoc solution: re-index term to new dictionary with continuous term-index.

Tks,

tanthml on 16 May 2018

@tanthml Please see if #2047 fixes your problem.

Witiko on 16 May 2018

👍1

@Witiko, Thanks for your solution, how can I test your new update?

tanthml on 21 May 2018

@tanthml, if you are using Git on the command-like, then as follows:

git clone https://github.com/witiko/gensim
cd gensim
git checkout fix-similarity-matrix-contiguous-dict 
python setup.py install

Witiko on 21 May 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Evaluate similarity search libraries (annoy, faiss, etc).

menshikh-iv · 4Comments

Word2Vec model to dict; Adding to the word2vec to production pipeline

shubhvachher · 4Comments

Loading models generated by other version of gensim

k0nserv · 3Comments

Word2Vec ns_exponent cannot be changed from default

coopwilliams · 3Comments

gensim.similarities.SparseMatrixSimilarity get segmentation-fault

dancinghui · 4Comments