Gensim: lemmatize: generator raised StopIteration

Created on 29 Dec 2019  Â·  14Comments  Â·  Source: RaRe-Technologies/gensim

Problem description

I'm trying to use lemmatize function to my text but getting StopIteration exception.

Steps/code/corpus to reproduce

from gensim.utils import lemmatize


s = lemmatize('eight')
print(s)

Result:

python3 lem.py 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 609, in _read
    raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "lem.py", line 4, in <module>
    s = lemmatize('eight')
  File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 1692, in lemmatize
    parsed = parse(content, lemmata=True, collapse=False)
  File "/usr/local/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 169, in parse
    return parser.parse(s, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 1172, in parse
    s[i] = self.find_tags(s[i], **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 114, in find_tags
    return _Parser.find_tags(self, tokens, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 1113, in find_tags
    lexicon = kwargs.get("lexicon", self.lexicon or {}),
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 376, in __len__
    return self._lazy("__len__")
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 368, in _lazy
    self.load()
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 625, in load
    dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if len(x.split(" ")) > 1))
  File "/usr/local/lib/python3.7/site-packages/pattern/text/__init__.py", line 625, in <genexpr>
    dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if len(x.split(" ")) > 1))
RuntimeError: generator raised StopIteration

Versions

I'm using MacOS, Python3:

```>>> import platform; print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit

import sys; print("Python", sys.version)
Python 3.7.4 (default, Sep 7 2019, 18:27:02)
[Clang 10.0.1 (clang-1001.0.46.4)]
import numpy; print("NumPy", numpy.__version__)
NumPy 1.18.0
import scipy; print("SciPy", scipy.__version__)
SciPy 1.4.1
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
gensim 3.8.1
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0


pip3 freeze | grep pattern
pattern3==3.0.0
pip3 freeze | grep gensim
gensim==3.8.1
```

All 14 comments

Note: the issue reproduced with any text I'm trying to lemmatize

The same issue with "Pattern==3.6" library.

Thanks for the clear report. IIRC this was a bug inside the pattern library, nothing we can do on our side (I think).

If so, it's best you report the issue there and (assuming the pattern project is still maintained) let us know when pattern is fixed so we can update its dependency version in gensim. Thanks!

I recall issues with pattern coming up from time to time. From what I can tell, we only use it for one thing: lemmatization in our gensim.utils submodule.

It doesn't look like we need it all that much:

$ grep -R lemmatize gensim --files-with-match --exclude *.pyc
gensim/test/test_corpora.py
gensim/summarization/textcleaner.py
gensim/summarization/keywords.py
gensim/utils.py
gensim/scripts/make_wikicorpus.py
gensim/scripts/make_wiki_online_lemma.py
gensim/scripts/segment_wiki.py
gensim/scripts/make_wiki_online.py
gensim/scripts/make_wiki.py
gensim/scripts/make_wiki_lemma.py
gensim/scripts/make_wiki_online_nodebug.py
gensim/corpora/wikicorpus.py

I think in the majority of cases, we can avoid doing lemmatization ourselves, and leave it up to the user. Ideally, it should happen outside of gensim. If it needs to happen within gensim for some reason (I really can't think of one right now, but it's possible such use cases exist), we can always do dependency injection via a callback.

This has several benefits. First, we get rid of these problems with the pattern library. Second, we reduce the amount of code we have to maintain. Finally, we enable users to use other lemmatization methods (e.g. from NLTK).

@menshikh-iv @piskvorky @gojomo Thoughts?

Yes, Gensim explicitly does not concern itself with text preprocessing. It's mentioned explicitly multiple times in our documentation.

I originally included a few simple functions (simple_preprocess, porter stemmer, lemmatize), for illustration purposes and tutorials and tests. I should have known better… there still doesn't exist any reliable Python lib for NLP, after a decade.

I'm -1 on breaking backward compatibility though. Some people may rely on lemmatize(). Is there an option that doesn't break existing code, while letting users know they "shouldn't be doing this"? Maybe a deprecation warning?

I removed lemmatization and still get the error. Gensim checks if "pattern" is installed and uses it for other tasks. If pattern does not work properly then I suggest remove the dependency.

There is no need to use StopIteration on python generators. Remove all StopIterators, or adds a try execpt outside all functions that returns an generators. Infact you can comment the StopIterator inside _read method (patterntext__init__.py", line 609, in _read)

**BEFORE**
...
StopIteration
#return

AFTER

...
#StopIteration
return

TL;DR
The problem is when pattern tries to lemmatize, it uses a file or libraries that are loaded in lazy mode, that means that only when you use the lemma function, it loads the libraries.

But the method that raises the StopIteration exception, specifically, it fails when creating an instance of the Verbs class, which uses a lazy dictionary, that is, it loads when it is going to be used.

This is the doc of class Verbs inside pattern

"""
    A dictionary of verb infinitives, each linked to a list of conjugated forms.
    Each line in the file at the given path is one verb, with the tenses separated by a comma.
    The format defines the order of tenses (see TENSES).
    The default dictionary defines default tenses for omitted tenses.
"""

The real problem is inside the _read method, which has a poorly implemented generator, we can see it in the code

if path:
        if isinstance(path, str) and os.path.exists(path):
            # From file path.
            f = open(path, "r", encoding="utf-8")
        elif isinstance(path, str):
            # From string.
            f = path.splitlines()
        else:
            # From file or buffer.
            f = path
        for i, line in enumerate(f):
            line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
            line = line.strip()
            line = decode_utf8(line, encoding)
            if not line or (comment and line.startswith(comment)):
                continue
            yield line
raise StopIteration

The last raise StopIteration will be inside the if else condition, because if the parameter path is null or empty, this method will raise a StopIteration. It would be necessary to add an else after the if path and put the StopIteration inside the else. In addition, the StopIteration would have to be captured by try catch to capture that the file is not found in that path, and in this way it would return well if it finds the file.

if path:
        if isinstance(path, str) and os.path.exists(path):
            # From file path.
            f = open(path, "r", encoding="utf-8")
        elif isinstance(path, str):
            # From string.
            f = path.splitlines()
        else:
            # From file or buffer.
            f = path
        for i, line in enumerate(f):
            line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
            line = line.strip()
            line = decode_utf8(line, encoding)
            if not line or (comment and line.startswith(comment)):
                continue
            yield line
**else:
    raise StopIteration**

...
class Verbs(lazydict):
    def __init__
    ...
    def load(self):
        # have,,,has,,having,,,,,had,had,haven't,,,hasn't,,,,,,,hadn't,hadn't
        id = self._format[TENSES_ID[INFINITIVE]]
        **try:**
            for v in _read(self._path):
                v = v.split(",")
                dict.__setitem__(self, v[id], v)
                for x in (x for x in v if x):
                    self._inverse[x] = v[id]
        **except StopIteration as no_path:
            raise("The path is empty or False")**

Gensim checks if "pattern" is installed and uses it for other tasks.

@simonm3 What other tasks? Can you post your error traceback?

@vquilon you're in the wrong repo – you want to post that in pattern, not gensim.

@piskvorky gensim can use pattern, if you have it

Backward compatibility is nice when practical, but if this has truly been broken for all inputs since the Pattern 3.6 release – about 2 years ago – it can't be that important.

Also, at this point the Pattern package has:

It's all fishy enough I wouldn't blindly-install the package, on the chance it may have been hijacked for malicious purposes. And, the project appears orphaned, so any plan predicated on "Pattern needs to fix something" seems unwise.

Finally, the gensim.utils.lemmatize() function is somewhat peculiar: it does POS-tagging, too, and already has to capture & error on a parameter pattern no-longer supports, and include a comment complaining about Pattern's 'weird' tokenization, with a "FIXME" workaround that throws away possibly-valuable characters.

People who truly need lemmatization can try NLTK or Spacy's options.

I'd suggest removing all mentions/uses of the Pattern package, and replacing the lemmatize() function with a stub that shows an error suggesting the use of other libraries directly.

You're right. Let's remove pattern (and Porter stemmer… and possibly even the "tokenizer") from gensim, with a helpful error message in case someone stumbles upon them.

You are right I have been looking at it, it was for if someone had the problem using it within gensim who knew how to fix it. But looking at the repo I have seen that there is a branch that solves it, develop fix or something like that.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bgokden picture bgokden  Â·  3Comments

franciscojavierarceo picture franciscojavierarceo  Â·  3Comments

coopwilliams picture coopwilliams  Â·  3Comments

menshikh-iv picture menshikh-iv  Â·  3Comments

vlad17 picture vlad17  Â·  4Comments