Spacy: sre_constants.error: bad character range

Created on 18 Mar 2019  ·  5Comments  ·  Source: explosion/spaCy

Can't load model

After update to spaCy 2.1.0 there is an error while loading model:

import spacy
nlp = spacy.load('en_core_web_sm')

error:

Traceback (most recent call last):
  File "ken.py", line 56, in <module>
    NLP_EN = spacy.load('en_core_web_sm')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 131, in load_model
    return load_model_from_package(name, **overrides)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 152, in load_model_from_package
    return cls.load(**overrides)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 190, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 161, in load_model_from_path
    nlp = cls(meta=meta, **overrides)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/language.py", line 163, in __init__
    make_doc = factory(self, **meta.get("tokenizer", {}))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/language.py", line 64, in create_tokenizer
    util.compile_prefix_regex(cls.prefixes).search if cls.prefixes else None
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/util.py", line 343, in compile_prefix_regex
    return re.compile(expression)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Environment

  • Operating System: macOS 10.11.6
  • Python Version Used: 2.7.16
  • spaCy Version Used: 2.1.0
  • Environment Information: en_core_web_sm
python -m spacy info

============================== Info about spaCy ==============================

Python version   2.7.16                        
Platform         Darwin-15.6.0-x86_64-i386-64bit
spaCy version    2.1.0                         
Location         /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy
Models    
python -m spacy validate
✔ Loaded compatibility table

====================== Installed models (spaCy v2.1.0) ======================
ℹ spaCy installation:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy

TYPE      NAME             MODEL            VERSION                            
package   en-core-web-sm   en_core_web_sm   2.1.0   ✔
compat models

Most helpful comment

Thanks for your attention! I rebuilt Python with --enable-unicode=ucs4 ('wide')
And yes, please add a check for this when import spaCy and raise an error if a narrow unicode build is detected.

All 5 comments

Could you double-check and make sure you're not running a narrow unciode build of Python? With the new updates to the regular expressions, it's definitely important that you're using a wide unicode build.

import sys
print(sys.maxunicode)

I checked it and that's what I got:

>>> import sys
>>> print(sys.maxunicode)
65535

It means that Python built with --enable-unicode=ucs2 ('narrow')
So, I need to rebuild Python with --enable-unicode=ucs4 ('wide')
and then I will get:

>>> import sys
>>> print sys.maxunicode
1114111

I will do that and add updates here.

Thanks for trying and yes, pretty sure this should solve it! 👍 We should probably add a check for this when you import spaCy and raise an error if a narrow unicode build is detected. If you're on a narrow unicode build, you'll likely also run into various other issues when working with text in Python, so running a wide build is always good, even aside from spaCy.

Thanks for your attention! I rebuilt Python with --enable-unicode=ucs4 ('wide')
And yes, please add a check for this when import spaCy and raise an error if a narrow unicode build is detected.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings