Trying to run simple lemmatization as described in the documentation. Getting:
RuntimeError: generator raised StopIteration
from gensim.utils import lemmatize
lemmatize('Hello World! How is it going?! Nonexistentword, 21')
Darwin-17.7.0-x86_64-i386-64bit
Python 3.7.2 (default, Dec 29 2018, 00:00:04)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 1
@ajdapretnar what's your version of Pattern? (import pattern; print(pattern.__version__))
Also, please include the full stack trace.
'3.6'
And what's the stack trace?
Traceback (most recent call last):
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 609, in _read
raise StopIteration
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/gensim/utils.py", line 1692, in lemmatize
parsed = parse(content, lemmata=True, collapse=False)
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 169, in parse
return parser.parse(s, *args, **kwargs)
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 1183, in parse
s[i] = self.find_lemmata(s[i], **kwargs)
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 107, in find_lemmata
return find_lemmata(tokens)
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/en/__init__.py", line 99, in find_lemmata
lemma = conjugate(word, INFINITIVE) or word
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 2208, in conjugate
b = self.lemma(verb, parse=kwargs.get("parse", True))
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 2172, in lemma
self.load()
File "/Users/ajda/miniconda3/envs/o3/lib/python3.7/site-packages/pattern/text/__init__.py", line 2127, in load
for v in _read(self._path):
RuntimeError: generator raised StopIteration
Seems unrelated to Gensim; try contacting the Pattern maintainers. FWIW, I have Pattern 2.6 and lemmatize works without problems there.
Strangely, it works when I am using just Pattern itself.
from pattern.en import lemma
[lemma(w) for w in s.split(' ')]
I will investigate this a bit further, but it seems like a common Python 3.7 problem.
Great, thanks. Let us know if Pattern changed its APIs recently, and the issue is somehow connected to Gensim after all.
Note that Gensim is using pattern.en.parse(content, lemmata=True, collapse=False) to get the lemmata & POS tags.
Actually, it's related to new python 3.7 behavior:
PEP 479 is enabled for all code in Python 3.7, meaning that StopIteration exceptions raised directly or indirectly in coroutines and generators are transformed into RuntimeError exceptions. (Contributed by Yury Selivanov in bpo-32670.)
Switching to python 3.6 should solve the issue
Sounds like something the pattern guys should fix, right?
I tried to open a PR fixing this issue, but pattern seems to be abandoned since August 2018. I hope that it will be solved one day, because it makes all versions of Python >= 3.7 unable to use gensim's lemmatizer.
@NicolasBizzozzero Can you please elaborate? I thought gensim has only a soft dependency on pattern, and if that library is not available, then things still work?
There is no need to use StopIteration on python generators. Remove all StopIterators, or adds a try execpt outside all functions that returns an generators. Infact you can comment the StopIterator inside _read method (pattern\text__init__.py", line 609, in _read)
**BEFORE**
...
StopIteration
#return
AFTER
...
#StopIteration
return
TL;DR
The problem is when pattern tries to lemmatize, it uses a file or libraries that are loaded in lazy mode, that means that only when you use the lemma function, it loads the libraries.
But the method that raises the StopIteration exception, specifically, it fails when creating an instance of the Verbs class, which uses a lazy dictionary, that is, it loads when it is going to be used.
This is the doc of class Verbs inside pattern
"""
A dictionary of verb infinitives, each linked to a list of conjugated forms.
Each line in the file at the given path is one verb, with the tenses separated by a comma.
The format defines the order of tenses (see TENSES).
The default dictionary defines default tenses for omitted tenses.
"""
The real problem is inside the _read method, which has a poorly implemented generator, we can see it in the code
if path:
if isinstance(path, str) and os.path.exists(path):
# From file path.
f = open(path, "r", encoding="utf-8")
elif isinstance(path, str):
# From string.
f = path.splitlines()
else:
# From file or buffer.
f = path
for i, line in enumerate(f):
line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
line = line.strip()
line = decode_utf8(line, encoding)
if not line or (comment and line.startswith(comment)):
continue
yield line
raise StopIteration
The last raise StopIteration will be inside the if else condition, because if the parameter path is null or empty, this method will raise a StopIteration. It would be necessary to add an else after the if path and put the StopIteration inside the else. In addition, the StopIteration would have to be captured by try catch to capture that the file is not found in that path, and in this way it would return well if it finds the file.
if path:
if isinstance(path, str) and os.path.exists(path):
# From file path.
f = open(path, "r", encoding="utf-8")
elif isinstance(path, str):
# From string.
f = path.splitlines()
else:
# From file or buffer.
f = path
for i, line in enumerate(f):
line = line.strip(BOM_UTF8) if i == 0 and isinstance(line, str) else line
line = line.strip()
line = decode_utf8(line, encoding)
if not line or (comment and line.startswith(comment)):
continue
yield line
else:
raise StopIteration
...
class Verbs(lazydict):
def __init__
...
def load(self):
# have,,,has,,having,,,,,had,had,haven't,,,hasn't,,,,,,,hadn't,hadn't
id = self._format[TENSES_ID[INFINITIVE]]
try:
for v in _read(self._path):
v = v.split(",")
dict.__setitem__(self, v[id], v)
for x in (x for x in v if x):
self._inverse[x] = v[id]
except StopIteration as no_path:
raise("The path is empty or False")
Most helpful comment
There is no need to use StopIteration on python generators. Remove all StopIterators, or adds a try execpt outside all functions that returns an generators. Infact you can comment the StopIterator inside _read method (pattern\text__init__.py", line 609, in _read)
AFTER
TL;DR
The problem is when pattern tries to lemmatize, it uses a file or libraries that are loaded in lazy mode, that means that only when you use the lemma function, it loads the libraries.
But the method that raises the StopIteration exception, specifically, it fails when creating an instance of the Verbs class, which uses a lazy dictionary, that is, it loads when it is going to be used.
This is the doc of class Verbs inside pattern
The real problem is inside the _read method, which has a poorly implemented generator, we can see it in the code
The last raise StopIteration will be inside the if else condition, because if the parameter path is null or empty, this method will raise a StopIteration. It would be necessary to add an else after the if path and put the StopIteration inside the else. In addition, the StopIteration would have to be captured by try catch to capture that the file is not found in that path, and in this way it would return well if it finds the file.