gensim.similarities.SparseMatrixSimilarity get segmentation-fault

Created on 8 Feb 2018  Â·  4Comments  Â·  Source: RaRe-Technologies/gensim

I want to get the similarity of one document to other documents. I use gensim. The program can run correctly, but after some steps it exits with Segmentation fault.The version of gensim is ''3.3.0'' and the python version is '2.7.6'

Below is my code:

from gensim import corpora, models, similarities
docs = [['Looking', 'for', 'the', 'meanings', 'of', 'words'],
        ['phrases'],
        ['and', 'expressions'],
        ['We', 'provide', 'hundreds', 'of', 'thousands', 'of', 'definitions'],
        ['synonyms'],
        ['antonyms'],
        ['and', 'pronunciations', 'for', 'English', 'and', 'other', 'languages'],
        ['derived', 'from', 'our', 'language', 'research', 'and', 'expert', 'analysis'],
        ['We', 'also', 'offer', 'a', 'unique', 'set', 'of', 'examples', 'of', 'real', 'usage'],
        ['as', 'well', 'as', 'guides', 'to:']]
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]
nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)
phrases = [['This',
            'section',
            'gives',
            'guidelines',
            'on',
            'writing',
            'in',
            'everyday',
            'situations'],
           ['from',
            'applying',
            'for',
            'a',
            'job',
            'to',
            'composing',
            'letters',
            'of',
            'complaint',
            'or',
            'making',
            'an',
            'insurance',
            'claim'],
           ['There',
            'are',
            'plenty',
            'of',
            'sample',
            'documents',
            'to',
            'help',
            'you',
            'get',
            'it',
            'right',
            'every',
            'time'],
           ['create',
            'a',
            'good',
            'impression'],
           ['and',
            'increase',
            'the',
            'likelihood',
            'of',
            'achieving',
            'your',
            'desired',
            'outcome']]
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
sims=index[phrase2word]

It can run normally until get sims, but it cannot get sims, and using gdb gets the following info:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffd881d809 in csr_tocsc (n_row=5, n_col=39,
Ap=0x4a4eb10, Aj=0x9fc6ec0, Ax=0x1be4a00, Bp=0xa15f6a0, Bi=0x9f3ee80,
Bx=0x9f85f60) at scipy/sparse/sparsetools/csr.h:411 411
scipy/sparse/sparsetools/csr.h: 没有那个文件或目录.

Most helpful comment

Hi,@menshikh-iv
I have tried your advice and put

nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)

after
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
then it can work normaly.The main reason is that num_features should be same with the dictionary.dfs

All 4 comments

Hi @dancinghui,

problem comes from this line
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases],
you shouldn't update dictionary after specify number of features
index = similarities.SparseMatrixSimilarity(corpus,num_features=nf)

all works fine if you replace
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
to
phrase2word=[dictionary.doc2bow(text) for text in phrases]

HI,@menshikh-iv ,
I need update the dictionary,so I set allow_update=True
if I replace
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
to
phrase2word=[dictionary.doc2bow(text) for text in phrases]
the dictionary will not put new words that appears in text into the dictionary

@dancinghui after this operation you need to re-create index (otherwise you'll have segfault).

I advise you to fill your dictionary first and after - create an index.
Also, you can try to use HashDictionary, this based on hashing-trick and have constant-size always (it receives absolutely all tokens).

Hi,@menshikh-iv
I have tried your advice and put

nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)

after
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
then it can work normaly.The main reason is that num_features should be same with the dictionary.dfs

Was this page helpful?
0 / 5 - 0 ratings

Related issues

simonm3 picture simonm3  Â·  3Comments

volj1 picture volj1  Â·  4Comments

k0nserv picture k0nserv  Â·  3Comments

menshikh-iv picture menshikh-iv  Â·  4Comments

Laubeee picture Laubeee  Â·  3Comments