gensim.similarities.SparseMatrixSimilarity get segmentation-fault

Created on 8 Feb 2018 · 4Comments · Source: RaRe-Technologies/gensim

I want to get the similarity of one document to other documents. I use gensim. The program can run correctly, but after some steps it exits with Segmentation fault.The version of gensim is ''3.3.0'' and the python version is '2.7.6'

Below is my code:

from gensim import corpora, models, similarities
docs = [['Looking', 'for', 'the', 'meanings', 'of', 'words'],
        ['phrases'],
        ['and', 'expressions'],
        ['We', 'provide', 'hundreds', 'of', 'thousands', 'of', 'definitions'],
        ['synonyms'],
        ['antonyms'],
        ['and', 'pronunciations', 'for', 'English', 'and', 'other', 'languages'],
        ['derived', 'from', 'our', 'language', 'research', 'and', 'expert', 'analysis'],
        ['We', 'also', 'offer', 'a', 'unique', 'set', 'of', 'examples', 'of', 'real', 'usage'],
        ['as', 'well', 'as', 'guides', 'to:']]
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]
nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)
phrases = [['This',
            'section',
            'gives',
            'guidelines',
            'on',
            'writing',
            'in',
            'everyday',
            'situations'],
           ['from',
            'applying',
            'for',
            'a',
            'job',
            'to',
            'composing',
            'letters',
            'of',
            'complaint',
            'or',
            'making',
            'an',
            'insurance',
            'claim'],
           ['There',
            'are',
            'plenty',
            'of',
            'sample',
            'documents',
            'to',
            'help',
            'you',
            'get',
            'it',
            'right',
            'every',
            'time'],
           ['create',
            'a',
            'good',
            'impression'],
           ['and',
            'increase',
            'the',
            'likelihood',
            'of',
            'achieving',
            'your',
            'desired',
            'outcome']]
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
sims=index[phrase2word]

It can run normally until get sims, but it cannot get sims, and using gdb gets the following info:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffd881d809 in csr_tocsc (n_row=5, n_col=39,
Ap=0x4a4eb10, Aj=0x9fc6ec0, Ax=0x1be4a00, Bp=0xa15f6a0, Bi=0x9f3ee80,
Bx=0x9f85f60) at scipy/sparse/sparsetools/csr.h:411 411
scipy/sparse/sparsetools/csr.h: 没有那个文件或目录.

Source

dancinghui

Most helpful comment

Hi,@menshikh-iv
I have tried your advice and put

nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)

after
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
then it can work normaly.The main reason is that num_features should be same with the dictionary.dfs

dancinghui on 9 Feb 2018

👍3

All 4 comments

Hi @dancinghui,

problem comes from this line
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases],
you shouldn't update dictionary after specify number of features
index = similarities.SparseMatrixSimilarity(corpus,num_features=nf)

all works fine if you replace
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
to
phrase2word=[dictionary.doc2bow(text) for text in phrases]

menshikh-iv on 8 Feb 2018

HI,@menshikh-iv ,
I need update the dictionary,so I set allow_update=True
if I replace
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
to
phrase2word=[dictionary.doc2bow(text) for text in phrases]
the dictionary will not put new words that appears in text into the dictionary

dancinghui on 8 Feb 2018

@dancinghui after this operation you need to re-create index (otherwise you'll have segfault).

I advise you to fill your dictionary first and after - create an index.
Also, you can try to use HashDictionary, this based on hashing-trick and have constant-size always (it receives absolutely all tokens).

menshikh-iv on 8 Feb 2018

👍1

Hi,@menshikh-iv
I have tried your advice and put

nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)

after
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
then it can work normaly.The main reason is that num_features should be same with the dictionary.dfs

dancinghui on 9 Feb 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings