Gensim: Correct way to use pos_tagger option in /summarization/keywords.py

Created on 16 May 2018 · 10Comments · Source: RaRe-Technologies/gensim

Description

While using "keywords()" in summarization/keywords.py file, I am getting the same set of tags, no matter what value I choose for pos_tagger=['NN'], ['JJ'] or ['NN','JJ']

Steps/Code/Corpus to Reproduce

Example:

from gensim.summarization import keywords
import requests
text = requests.get('https://www.nytimes.com/2018/05/16/opinion/ramadan-spirit-america.html
').text
print keywords(text,words=15,pos_filter=('NN'),lemmatize=True,scores=True)
print()
print keywords(text,words=15,pos_filter=('NN','JJ'),lemmatize=True,scores=True)
print()
print keywords(text,words=15,pos_filter=('JJ'),lemmatize=True,scores=True)

Expected Results

If I am giving pos_filter as 'NN', only nouns should come as tags, however, tags like "started", "looking" are also coming as output.
Similarly, there is no difference in the output irresepective of pos_filter='NN', pos_filter='NN','JJ', pos_filter='JJ'

What is the correct way of using pos_filter to reflect appropriate output?

Actual Results

student:0.20870111939889552, muslims:0.18960896637225794, americans:0.18895097005190414, ramadan:0.17605599898176202, month:0.12130699512494893, started:0.11817668681654464, community:0.11691583075245701, places:0.1117677772315554, spirituality:0.103727092629442, car:0.09988305780275739, white:0.09747271853405554, trump:0.09747271853405551, looking:0.09538360210000996, president:0.09538360210000986, black:0.0920316444206821

student:0.2087011193988958, muslims:0.18960896637225758, americans:0.1889509700519042, ramadan:0.17605599898176225, month:0.12130699512494901, started:0.11817668681654461, community:0.11691583075245732, places:0.11176777723155559, spirituality:0.10372709262944187, car:0.099883057802757, trump:0.09747271853405544, white:0.09747271853405512, president:0.0953836021000099, looking:0.09538360210000954, black:0.09203164442068222

student:0.20870111939889593, muslims:0.1896089663722575, americans:0.1889509700519037, ramadan:0.17605599898176255, month:0.1213069951249494, started:0.11817668681654483, community:0.11691583075245665, places:0.11176777723155547, spirituality:0.10372709262944207, car:0.09988305780275722, white:0.09747271853405541, trump:0.09747271853405526, looking:0.09538360210000975, president:0.0953836021000096, black:0.09203164442068222

Versions

Please run the following snippet and paste the output below.
Linux-4.13.0-39-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Dec 4 2017, 14:50:18) \n[GCC 5.4.0 20160609]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)

Source

NandanIITM

All 10 comments

I have the same issue

romanovzky on 24 Jul 2018

👍1

Found the reason why it doesn't work. Basically, in the _get_words_for_graph function, namely at

for word, unit in iteritems(tokens):
        if exclude_filters and unit.tag in exclude_filters:
            continue
        if (include_filters and unit.tag in include_filters) or not include_filters or not unit.tag:
            result.append(unit.token)

The "tokens" is derived from _clean_text_by_word function, which in turn is not providing any POS tagging, as it can see with the following code

test_text = "This is an example text. It talks about dogs and God."

tokens = gensim.summarization.textcleaner.clean_text_by_word(test_text)

from six import iteritems

for word, unit in iteritems(tokens):
    print(word)
    print(unit.tag)

Returning

example
None
text
None
talks
None
dogs
None
god
None

Looking at clean_text_by_word documentation I observe that there is no mention of any POS tagging functionality.

So either there is functionality missing, there was some functionality removed with refactoring, or the wrong function is being used to clean the text.

romanovzky on 24 Jul 2018

👍1

I would also like to point out that the keyword tutorial is wrong for the same reason, as by default the keyword function is supposed to tag only ['NN', 'JJ'] and in The Matrix movie synopsis example we can see says, which is neither.

romanovzky on 24 Jul 2018

👍1

Thank you so much romanovzky for the explanation.

NandanIITM on 25 Jul 2018

Why close it? The problem persists, no? There is some functionality that is not working as intended and contrary to the documentation.

romanovzky on 25 Jul 2018

@romanovzky your example will work if you install pattern package, I talk about an example from https://github.com/RaRe-Technologies/gensim/issues/2053#issuecomment-407524565

as by default the keyword function is supposed to tag only ['NN', 'JJ'] and in The Matrix movie synopsis example we can see says, which is neither.

probably tutorial was ran without installed pattern package -> no tags -> no filtering

menshikh-iv on 30 Jul 2018

👍1

Hi there. I do have Pattern installed. pattern3 to be exact, or does it have to be pattern? This I can't install on python 3.6...

romanovzky on 30 Jul 2018

@romanovzky I checked with 2.7 and pattern (not pattern3), I'm not sure about python3 in this case.

menshikh-iv on 30 Jul 2018

Just found a way to install pattern. However, after which I have the same problem. How can I debug further the reason for this?

romanovzky on 30 Jul 2018

That's really strange, try to run this code first

from gensim.summarization.textcleaner import HAS_PATTERN
from gensim.utils import has_pattern

assert HAS_PATTERN
assert has_pattern()

if assert failed - pattern isn't installed correctly, if all OK - run your code with debugger (for example ipdb) and check that this line executed correctly (i.e. assign tags)
https://github.com/RaRe-Technologies/gensim/blob/f09b7db9c2d21cde85ee338deac26f74bdb1b000/gensim/summarization/textcleaner.py#L277

menshikh-iv on 31 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Structural Topic Models in gensim

cschwem2er · 27Comments

divide by zero encountered in log

AndreasMadsen · 29Comments

conversion function naming

amueller · 30Comments

Drop Py2 support

mpenkov · 29Comments

BleiCorpus after serialize cannot be loaded

vincentmajor · 30Comments