After update to spaCy 2.1.0 there is an error while loading model:
import spacy
nlp = spacy.load('en_core_web_sm')
error:
Traceback (most recent call last):
File "ken.py", line 56, in <module>
NLP_EN = spacy.load('en_core_web_sm')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/__init__.py", line 22, in load
return util.load_model(name, **overrides)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 131, in load_model
return load_model_from_package(name, **overrides)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 152, in load_model_from_package
return cls.load(**overrides)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/en_core_web_sm/__init__.py", line 12, in load
return load_model_from_init_py(__file__, **overrides)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 190, in load_model_from_init_py
return load_model_from_path(data_path, meta, **overrides)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 161, in load_model_from_path
nlp = cls(meta=meta, **overrides)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/language.py", line 163, in __init__
make_doc = factory(self, **meta.get("tokenizer", {}))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/language.py", line 64, in create_tokenizer
util.compile_prefix_regex(cls.prefixes).search if cls.prefixes else None
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 343, in compile_prefix_regex
return re.compile(expression)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
python -m spacy info
============================== Info about spaCy ==============================
Python version 2.7.16
Platform Darwin-15.6.0-x86_64-i386-64bit
spaCy version 2.1.0
Location /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy
Models
python -m spacy validate
✔ Loaded compatibility table
====================== Installed models (spaCy v2.1.0) ======================
ℹ spaCy installation:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy
TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.1.0 ✔
Could you double-check and make sure you're not running a narrow unciode build of Python? With the new updates to the regular expressions, it's definitely important that you're using a wide unicode build.
import sys
print(sys.maxunicode)
I checked it and that's what I got:
>>> import sys
>>> print(sys.maxunicode)
65535
It means that Python built with --enable-unicode=ucs2 ('narrow')
So, I need to rebuild Python with --enable-unicode=ucs4 ('wide')
and then I will get:
>>> import sys
>>> print sys.maxunicode
1114111
I will do that and add updates here.
Thanks for trying and yes, pretty sure this should solve it! 👍 We should probably add a check for this when you import spaCy and raise an error if a narrow unicode build is detected. If you're on a narrow unicode build, you'll likely also run into various other issues when working with text in Python, so running a wide build is always good, even aside from spaCy.
Thanks for your attention! I rebuilt Python with --enable-unicode=ucs4 ('wide')
And yes, please add a check for this when import spaCy and raise an error if a narrow unicode build is detected.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Thanks for your attention! I rebuilt Python with
--enable-unicode=ucs4 ('wide')And yes, please add a check for this when import spaCy and raise an error if a narrow unicode build is detected.