gensim.summarization.keywords fetching different results

Created on 29 Aug 2019 · 7Comments · Source: RaRe-Technologies/gensim

Problem description

Hi, I am having a weird issue where when I pass the exact same text in the following function -

gensim.summarization.keywords(text1, ratio=0.9, pos_filter=('NP')).split("\n")

and get two different result set for exact same parameters when I run it multiple times. The output should be same for a particular text.
How is it possible that it's excluding /including few phrase extracts over a few iteration?
Below it shows the difference - ['data'] vs ['static data'] and ['dynamic'] was not fetched in the second iter run at all.
Attached a screenshot for reference. Any guidance will be appreciated.
gensim_summarization_diffresults

Steps/code/corpus to reproduce

import gensim 
text1 = 'The method according to claim3, wherein the step of collecting further comprises: receiving the static data in the management data through a notification about change of the at least one cloud server being reported by a protocol agent which is configured to collect the management data from the at least one cloud server; and requesting and receiving the dynamic data in the management data from the protocol agent.'
phrase_token=gensim.summarization.keywords(text1, ratio=0.9, pos_filter=('NP')).split("\n")
phrase_token

Versions

Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.4
SciPy 1.2.1
gensim 3.7.3
FAST_VERSION 1

Source

JayeetaP

Most helpful comment

gensim.summarization.keywords is not deterministic due to non-determinism of eig(s) in numpy (for example https://github.com/numpy/numpy/issues/6378).

horpto on 8 Sep 2019

👍4

All 7 comments

I can reproduce this issue.

@piskvorky @menshikh-iv Is summarization supposed to be deterministic?

mpenkov on 7 Sep 2019

This was a student / contributed project, unfortunately I'm not familiar with the algo or code at all.

piskvorky on 7 Sep 2019

I think given the momentum behind deprecating and eventually removing summarisation, we can sweep this one under the rug, right @piskvorky?

mpenkov on 7 Sep 2019

I'm not familiar with the module so hard to make the call.

If it's something useful to users, I'd prefer to fix it. IIRC the summarization algo was standard (blog post). But if it's one of the badly-motivated-badly-executed student projects, then yeah, let's cut it.

piskvorky on 7 Sep 2019

gensim.summarization.keywords is not deterministic due to non-determinism of eig(s) in numpy (for example https://github.com/numpy/numpy/issues/6378).

horpto on 8 Sep 2019

👍4

okay, so unless it is updated, using summarization can lead to different results. Are there any similar techniques, to generate summary phrases (probably only noun phrases) from long texts that I can test instead? I guess using nltk regex parser to find sub leaves labels as 'NP' words in a sentence and then joining them to get phrases can be a workaround? Appreciate all the assistance.

JayeetaP on 10 Sep 2019

@JayeetaP We use github tickets for error reports only, so I think your questions are out of scope for this ticket. Could you please ask on the mailing list instead?

mpenkov on 28 Sep 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings