Gensim: Correct way to use pos_tagger option in /summarization/keywords.py

Created on 16 May 2018  路  10Comments  路  Source: RaRe-Technologies/gensim

Description

While using "keywords()" in summarization/keywords.py file, I am getting the same set of tags, no matter what value I choose for pos_tagger=['NN'], ['JJ'] or ['NN','JJ']

Steps/Code/Corpus to Reproduce

Example:

from gensim.summarization import keywords
import requests
text = requests.get('https://www.nytimes.com/2018/05/16/opinion/ramadan-spirit-america.html
').text
print keywords(text,words=15,pos_filter=('NN'),lemmatize=True,scores=True)
print()
print keywords(text,words=15,pos_filter=('NN','JJ'),lemmatize=True,scores=True)
print()
print keywords(text,words=15,pos_filter=('JJ'),lemmatize=True,scores=True)

Expected Results

If I am giving pos_filter as 'NN', only nouns should come as tags, however, tags like "started", "looking" are also coming as output.
Similarly, there is no difference in the output irresepective of pos_filter='NN', pos_filter='NN','JJ', pos_filter='JJ'

What is the correct way of using pos_filter to reflect appropriate output?

Actual Results

student:0.20870111939889552, muslims:0.18960896637225794, americans:0.18895097005190414, ramadan:0.17605599898176202, month:0.12130699512494893, started:0.11817668681654464, community:0.11691583075245701, places:0.1117677772315554, spirituality:0.103727092629442, car:0.09988305780275739, white:0.09747271853405554, trump:0.09747271853405551, looking:0.09538360210000996, president:0.09538360210000986, black:0.0920316444206821

student:0.2087011193988958, muslims:0.18960896637225758, americans:0.1889509700519042, ramadan:0.17605599898176225, month:0.12130699512494901, started:0.11817668681654461, community:0.11691583075245732, places:0.11176777723155559, spirituality:0.10372709262944187, car:0.099883057802757, trump:0.09747271853405544, white:0.09747271853405512, president:0.0953836021000099, looking:0.09538360210000954, black:0.09203164442068222

student:0.20870111939889593, muslims:0.1896089663722575, americans:0.1889509700519037, ramadan:0.17605599898176255, month:0.1213069951249494, started:0.11817668681654483, community:0.11691583075245665, places:0.11176777723155547, spirituality:0.10372709262944207, car:0.09988305780275722, white:0.09747271853405541, trump:0.09747271853405526, looking:0.09538360210000975, president:0.0953836021000096, black:0.09203164442068222

Versions

Please run the following snippet and paste the output below.
Linux-4.13.0-39-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Dec 4 2017, 14:50:18) \n[GCC 5.4.0 20160609]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)

All 10 comments

I have the same issue

Found the reason why it doesn't work. Basically, in the _get_words_for_graph function, namely at

for word, unit in iteritems(tokens):
        if exclude_filters and unit.tag in exclude_filters:
            continue
        if (include_filters and unit.tag in include_filters) or not include_filters or not unit.tag:
            result.append(unit.token)

The "tokens" is derived from _clean_text_by_word function, which in turn is not providing any POS tagging, as it can see with the following code

test_text = "This is an example text. It talks about dogs and God."

tokens = gensim.summarization.textcleaner.clean_text_by_word(test_text)

from six import iteritems

for word, unit in iteritems(tokens):
    print(word)
    print(unit.tag)

Returning

example
None
text
None
talks
None
dogs
None
god
None

Looking at clean_text_by_word documentation I observe that there is no mention of any POS tagging functionality.

So either there is functionality missing, there was some functionality removed with refactoring, or the wrong function is being used to clean the text.

I would also like to point out that the keyword tutorial is wrong for the same reason, as by default the keyword function is supposed to tag only ['NN', 'JJ'] and in The Matrix movie synopsis example we can see says, which is neither.

Thank you so much romanovzky for the explanation.

Why close it? The problem persists, no? There is some functionality that is not working as intended and contrary to the documentation.

@romanovzky your example will work if you install pattern package, I talk about an example from https://github.com/RaRe-Technologies/gensim/issues/2053#issuecomment-407524565

as by default the keyword function is supposed to tag only ['NN', 'JJ'] and in The Matrix movie synopsis example we can see says, which is neither.

probably tutorial was ran without installed pattern package -> no tags -> no filtering

Hi there. I do have Pattern installed. pattern3 to be exact, or does it have to be pattern? This I can't install on python 3.6...

@romanovzky I checked with 2.7 and pattern (not pattern3), I'm not sure about python3 in this case.

Just found a way to install pattern. However, after which I have the same problem. How can I debug further the reason for this?

That's really strange, try to run this code first

from gensim.summarization.textcleaner import HAS_PATTERN
from gensim.utils import has_pattern

assert HAS_PATTERN
assert has_pattern()

if assert failed - pattern isn't installed correctly, if all OK - run your code with debugger (for example ipdb) and check that this line executed correctly (i.e. assign tags)
https://github.com/RaRe-Technologies/gensim/blob/f09b7db9c2d21cde85ee338deac26f74bdb1b000/gensim/summarization/textcleaner.py#L277

Was this page helpful?
0 / 5 - 0 ratings