Gensim: lemmatize: StopIteration error in Python 3.7

Created on 8 Apr 2019  路  12Comments  路  Source: RaRe-Technologies/gensim

Problem description

Trying to run simple lemmatization as described in the documentation. Getting:
RuntimeError: generator raised StopIteration

Steps/code/corpus to reproduce

from gensim.utils import lemmatize
lemmatize('Hello World! How is it going?! Nonexistentword, 21')

Versions

Darwin-17.7.0-x86_64-i386-64bit
Python 3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 1

Most helpful comment

There is no need to use StopIteration on python generators. Remove all StopIterators, or adds a try execpt outside all functions that returns an generators. Infact you can comment the StopIterator inside _read method (pattern\text__init__.py", line 609, in _read)

**BEFORE**
...
StopIteration
#return

AFTER

...
#StopIteration
return

TL;DR
The problem is when pattern tries to lemmatize, it uses a file or libraries that are loaded in lazy mode, that means that only when you use the lemma function, it loads the libraries.

But the method that raises the StopIteration exception, specifically, it fails when creating an instance of the Verbs class, which uses a lazy dictionary, that is, it loads when it is going to be used.

This is the doc of class Verbs inside pattern

"""
    A dictionary of verb infinitives, each linked to a list of conjugated forms.
    Each line in the file at the given path is one verb, with the tenses separated by a comma.
    The format defines the order of tenses (see TENSES).
    The default dictionary defines default tenses for omitted tenses.
"""

The real problem is inside the _read method, which has a poorly implemented generator, we can see it in the code

if path:
        if isinstance(path, str) and os.path.exists(path):
            # From file path.
            f = open(path, "r", encoding="utf-8")
        elif isinstance(path, str):
            # From string.
            f = path.splitlines()
        else:
            # From file or buffer.
            f = path
        for i, line in enumerate(f):
            line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
            line = line.strip()
            line = decode_utf8(line, encoding)
            if not line or (comment and line.startswith(comment)):
                continue
            yield line
raise StopIteration

The last raise StopIteration will be inside the if else condition, because if the parameter path is null or empty, this method will raise a StopIteration. It would be necessary to add an else after the if path and put the StopIteration inside the else. In addition, the StopIteration would have to be captured by try catch to capture that the file is not found in that path, and in this way it would return well if it finds the file.

if path:
        if isinstance(path, str) and os.path.exists(path):
            # From file path.
            f = open(path, "r", encoding="utf-8")
        elif isinstance(path, str):
            # From string.
            f = path.splitlines()
        else:
            # From file or buffer.
            f = path
        for i, line in enumerate(f):
            line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
            line = line.strip()
            line = decode_utf8(line, encoding)
            if not line or (comment and line.startswith(comment)):
                continue
            yield line
else:
    raise StopIteration

...
class Verbs(lazydict):
    def __init__
    ...
    def load(self):
        # have,,,has,,having,,,,,had,had,haven't,,,hasn't,,,,,,,hadn't,hadn't
        id = self._format[TENSES_ID[INFINITIVE]]
        try:
            for v in _read(self._path):
                v = v.split(",")
                dict.__setitem__(self, v[id], v)
                for x in (x for x in v if x):
                    self._inverse[x] = v[id]
        except StopIteration as no_path:
            raise("The path is empty or False")

All 12 comments

@ajdapretnar what's your version of Pattern? (import pattern; print(pattern.__version__))

Also, please include the full stack trace.

'3.6'

And what's the stack trace?

Traceback (most recent call last):
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 609, in _read
    raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/gensim/utils.py", line 1692, in lemmatize
    parsed = parse(content, lemmata=True, collapse=False)
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 169, in parse
    return parser.parse(s, *args, **kwargs)
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 1183, in parse
    s[i] = self.find_lemmata(s[i], **kwargs)
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 107, in find_lemmata
    return find_lemmata(tokens)
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 99, in find_lemmata
    lemma = conjugate(word, INFINITIVE) or word
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 2208, in conjugate
    b = self.lemma(verb, parse=kwargs.get("parse", True))
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 2172, in lemma
    self.load()
  File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 2127, in load
    for v in _read(self._path):
RuntimeError: generator raised StopIteration

Seems unrelated to Gensim; try contacting the Pattern maintainers. FWIW, I have Pattern 2.6 and lemmatize works without problems there.

Strangely, it works when I am using just Pattern itself.

from pattern.en import lemma
[lemma(w) for w in s.split(' ')]

I will investigate this a bit further, but it seems like a common Python 3.7 problem.

Great, thanks. Let us know if Pattern changed its APIs recently, and the issue is somehow connected to Gensim after all.

Note that Gensim is using pattern.en.parse(content, lemmata=True, collapse=False) to get the lemmata & POS tags.

Actually, it's related to new python 3.7 behavior:

PEP 479 is enabled for all code in Python 3.7, meaning that StopIteration exceptions raised directly or indirectly in coroutines and generators are transformed into RuntimeError exceptions. (Contributed by Yury Selivanov in bpo-32670.)

https://stackoverflow.com/questions/51700960/runtimeerror-generator-raised-stopiteration-every-time-i-try-to-run-app

Switching to python 3.6 should solve the issue

Sounds like something the pattern guys should fix, right?

I tried to open a PR fixing this issue, but pattern seems to be abandoned since August 2018. I hope that it will be solved one day, because it makes all versions of Python >= 3.7 unable to use gensim's lemmatizer.

@NicolasBizzozzero Can you please elaborate? I thought gensim has only a soft dependency on pattern, and if that library is not available, then things still work?

There is no need to use StopIteration on python generators. Remove all StopIterators, or adds a try execpt outside all functions that returns an generators. Infact you can comment the StopIterator inside _read method (pattern\text__init__.py", line 609, in _read)

**BEFORE**
...
StopIteration
#return

AFTER

...
#StopIteration
return

TL;DR
The problem is when pattern tries to lemmatize, it uses a file or libraries that are loaded in lazy mode, that means that only when you use the lemma function, it loads the libraries.

But the method that raises the StopIteration exception, specifically, it fails when creating an instance of the Verbs class, which uses a lazy dictionary, that is, it loads when it is going to be used.

This is the doc of class Verbs inside pattern

"""
    A dictionary of verb infinitives, each linked to a list of conjugated forms.
    Each line in the file at the given path is one verb, with the tenses separated by a comma.
    The format defines the order of tenses (see TENSES).
    The default dictionary defines default tenses for omitted tenses.
"""

The real problem is inside the _read method, which has a poorly implemented generator, we can see it in the code

if path:
        if isinstance(path, str) and os.path.exists(path):
            # From file path.
            f = open(path, "r", encoding="utf-8")
        elif isinstance(path, str):
            # From string.
            f = path.splitlines()
        else:
            # From file or buffer.
            f = path
        for i, line in enumerate(f):
            line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
            line = line.strip()
            line = decode_utf8(line, encoding)
            if not line or (comment and line.startswith(comment)):
                continue
            yield line
raise StopIteration

The last raise StopIteration will be inside the if else condition, because if the parameter path is null or empty, this method will raise a StopIteration. It would be necessary to add an else after the if path and put the StopIteration inside the else. In addition, the StopIteration would have to be captured by try catch to capture that the file is not found in that path, and in this way it would return well if it finds the file.

if path:
        if isinstance(path, str) and os.path.exists(path):
            # From file path.
            f = open(path, "r", encoding="utf-8")
        elif isinstance(path, str):
            # From string.
            f = path.splitlines()
        else:
            # From file or buffer.
            f = path
        for i, line in enumerate(f):
            line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
            line = line.strip()
            line = decode_utf8(line, encoding)
            if not line or (comment and line.startswith(comment)):
                continue
            yield line
else:
    raise StopIteration

...
class Verbs(lazydict):
    def __init__
    ...
    def load(self):
        # have,,,has,,having,,,,,had,had,haven't,,,hasn't,,,,,,,hadn't,hadn't
        id = self._format[TENSES_ID[INFINITIVE]]
        try:
            for v in _read(self._path):
                v = v.split(",")
                dict.__setitem__(self, v[id], v)
                for x in (x for x in v if x):
                    self._inverse[x] = v[id]
        except StopIteration as no_path:
            raise("The path is empty or False")
Was this page helpful?
0 / 5 - 0 ratings

Related issues

menshikh-iv picture menshikh-iv  路  4Comments

franciscojavierarceo picture franciscojavierarceo  路  3Comments

menshikh-iv picture menshikh-iv  路  3Comments

jeradf picture jeradf  路  4Comments

shubhvachher picture shubhvachher  路  4Comments