Gensim: Can not build similarity matrix when the dictionary contains word indices are not continuous

Created on 3 May 2018  路  12Comments  路  Source: RaRe-Technologies/gensim

Description

For some reasons, the gensim.corpora.Dictionary contains word indices are not continuous. I met the KeyError in _similarity_matrix()_ function when it tried to re-index from 0 to n-1.

Steps/Code/Corpus to Reproduce

import gensim.downloader as api

def get_similarity_matrix(dictionary):
    '''
    In some cases, the dictionary contains word indices are not continuous.
    For example, the dictionary contains theses items: (1, u'car'), (2, u'so'), (4, 'nice')
    '''
    w2v_model = api.load('glove-wiki-gigaword-50', return_path=False)
    return w2v_model.wv.similarity_matrix(dictionary, nonzero_limit=100)

Expected Results

Term similarity matrix in scipy.sparse.csc_matrix format

Actual Results

    return w2v_model.wv.similarity_matrix(dictionary, nonzero_limit=100)
  File "/home/jen/anaconda2/lib/python2.7/site-packages/gensim/models/keyedvectors.py", line 510, in similarity_matrix
    w1 = dictionary[w1_index]
  File "/home/jen/anaconda2/lib/python2.7/site-packages/gensim/corpora/dictionary.py", line 104, in __getitem__
    return self.id2token[tokenid]  # will throw for non-existent ids
KeyError: 0
ERROR: Non-zero return code '1' from command: Process exited with status 1

Versions

('Python', '2.7.14 |Anaconda custom (64-bit)| (default, Mar 27 2018, 17:29:31) \n[GCC 7.2.0]')
('NumPy', '1.14.2')
('SciPy', '1.0.1')
('gensim', '3.4.0')
('FAST_VERSION', 1)

All 12 comments

Have same issue.

In my current system have already trained a vocabulary (word-index, word-term) the word-index is not a continuous series (1,2,3,4,5, ...) but it is (1,4,5,...) randomly , So what we want is: try to convert this vocab to the same format dictionary of gensim and get this error.

Thanks

I could be completely wrong but this is an idea that comes to mind:

Words in the dictionary are indeed assigned continuous IDs, but some of them are filtered out by the default filters (i.e their occurence count is lower than the threshold). Could you check if the issue persists when disabling the filters?

Looks like a bug to me. There is no expectation that the Dictionary id range must be contiguous.

CC @Witiko similarity_matrix seems to be your code -- can you please fix + add a test?

In the meanwhile: @quandb @tanthml can you call .compactify() on your dictionary, to remove the gaps? Why does your dictionary contain gaps in the first place?

@piskvorky All my contributed code assumes that dictionaries are contiguous. I discussed this with @menshikh-iv to confirm that this was a correct assumption. The response was that in the current version of Gensim, dictionaries were automatically compactified on construction and that this assumption was therefore valid. I assume that there are still ways to obtain an uncompactified dictionary, then?

@Witiko Yes, there are. For example filter_extremes() (the most common way to filter a Dictionary) will call compactify() automatically. But if the user implements their own filtering, it is a quite possible there will be gaps.

There is no API contract that the ids must be contiguous (though they typically are). Let's wait for @quandb and @tanthml 's reponse regarding their use case for having gaps.

Hi all,

In summary: there is another app perform the mapping between word and id, some tokens have been removed due to some reasons. Then I have to build the Gensim dictionary from processed vocab (have word and id already).

In detail: the use case here is we build a product that includes multiple independent apps such as data ingestion, pre-processing, vectorization, topic modeling, etc.
The task i'm working on is to build a similarity_matrix. And the similarity_matrix function have to consume a dictionary. The thing here's we have to keep the token and the token_id consistent across the system. we can't build the dictionary from original corpus since we have another mapper did that.
Then I build a dictionary by refactor the Dictionary.load_from_text() function to load the vocabulary follow our format. We can't even use the compactify() due to the token_id must consistent for other apps.

Let me know if you need more information,

@quandb As a workaround, you can have both the original, and a compactified dictionary and translate documents between them.

>>> import gensim.downloader as api
>>> from gensim.corpora import Dictionary
>>>
>>> text = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
>>>
>>> model = api.load("word2vec-google-news-300")
>>> dict1 = Dictionary([[term for term in model.vocab.keys()]])
>>> dict2 = Dictionary([text])
>>>
>>> doc1 = dict1.doc2bow(text)
>>> doc2 = dict2.doc2bow(text)
>>>
>>> def translate(document, dict1, dict2):
>>>     return [
>>>         (dict2.token2id[dict1[term_id]], term_frequency)
>>>         for (term_id, term_frequency) in document]
>>>
>>> doc1
[(2430835, 1),
 (2522359, 1),
 (2573330, 1),
 (2652431, 1),
 (2665091, 1),
 (2746533, 1),
 (2799299, 1),
 (2912713, 2)]
>>> translate(doc2, dict2, dict1)
[(2430835, 1),
 (2522359, 1),
 (2573330, 1),
 (2652431, 1),
 (2665091, 1),
 (2746533, 1),
 (2799299, 1),
 (2912713, 2)]
>>> doc2
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2)]
>>> translate(doc1, dict1, dict2)
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2)]

@piskvorky If we decide we want to support this, I will add proper support for non-contiguous dictionaries to the methods related to the soft cosine measure. It will make the code a little less maintainable.

I think so, thanks. The other option is to enforce contiguous ids, but I don't think that's a good idea.

Hi @piskvorky ,

For my case, I try out an ad-hoc solution: re-index term to new dictionary with continuous term-index.

Tks,

@tanthml Please see if #2047 fixes your problem.

@Witiko, Thanks for your solution, how can I test your new update?

@tanthml, if you are using Git on the command-like, then as follows:

git clone https://github.com/witiko/gensim
cd gensim
git checkout fix-similarity-matrix-contiguous-dict 
python setup.py install
Was this page helpful?
0 / 5 - 0 ratings

Related issues

johann-petrak picture johann-petrak  路  3Comments

volj1 picture volj1  路  4Comments

franciscojavierarceo picture franciscojavierarceo  路  3Comments

jeradf picture jeradf  路  4Comments

mikkokotila picture mikkokotila  路  3Comments