Spacy: Tokenization not working using v2.1

Created on 5 Mar 2019  Β·  29Comments  Β·  Source: explosion/spaCy

How to reproduce the behaviour

I found a bug where tokenization is completely not working with version 2.1.0a10 on python 2.7. I have reproduced this on three of my machines.

$ conda create -n py27_spacy2 python=2.7
$ source activate py27_spacy2
$ pip install -U spacy-nightly
$ python -m spacy download en_core_web_sm
$ python -c "import spacy; nlp=spacy.load('en_core_web_sm'); doc=nlp(u'hello world'); print ','.join([t.text for t in doc])"
h,e,ll,o,w,o,r,l,d

Your Environment

  • Operating System: Ubuntu
  • Python Version Used: 2.7
  • spaCy Version Used: 2.1.0a10
bug compat feat / tokenizer help wanted upgrade

All 29 comments

Hmm that's very confusing. I don't see how our tests could be passing if this is the case. I don't suppose you're somehow getting a narrow unicode build of Python?

Another thing to check is that your $ pip install -U spacy-nightly line is actually running the version of pip you expect. But then if it werent't I don't see how spaCy would be in your environment. Hmm.

It's quite easy to reproduce for me. The following is another run. I also checked my python and pip and they are correct.

user@alienware:~/dev$ conda --version
conda 4.6.1

user@alienware:~/dev$ conda create -y -n py27_spacy2 python=2.7
Collecting package metadata: done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.6.1
  latest version: 4.6.7

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/user/miniconda3/envs/py27_spacy2

  added / updated specs:
    - python=2.7


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.1.23  |                0         126 KB
    certifi-2018.11.29         |           py27_0         146 KB
    openssl-1.1.1b             |       h7b6447c_0         4.0 MB
    pip-19.0.3                 |           py27_0         1.8 MB
    python-2.7.15              |       h9bab390_6        12.8 MB
    setuptools-40.8.0          |           py27_0         635 KB
    wheel-0.33.1               |           py27_0          39 KB
    ------------------------------------------------------------
                                           Total:        19.5 MB

The following NEW packages will be INSTALLED:

  ca-certificates    pkgs/main/linux-64::ca-certificates-2019.1.23-0
  certifi            pkgs/main/linux-64::certifi-2018.11.29-py27_0
  libedit            pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0
  libffi             pkgs/main/linux-64::libffi-3.2.1-hd88cf55_4
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-8.2.0-hdf63c60_1
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-8.2.0-hdf63c60_1
  ncurses            pkgs/main/linux-64::ncurses-6.1-he6710b0_1
  openssl            pkgs/main/linux-64::openssl-1.1.1b-h7b6447c_0
  pip                pkgs/main/linux-64::pip-19.0.3-py27_0
  python             pkgs/main/linux-64::python-2.7.15-h9bab390_6
  readline           pkgs/main/linux-64::readline-7.0-h7b6447c_5
  setuptools         pkgs/main/linux-64::setuptools-40.8.0-py27_0
  sqlite             pkgs/main/linux-64::sqlite-3.26.0-h7b6447c_0
  tk                 pkgs/main/linux-64::tk-8.6.8-hbc83047_0
  wheel              pkgs/main/linux-64::wheel-0.33.1-py27_0
  zlib               pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3



Downloading and Extracting Packages
python-2.7.15        | 12.8 MB   | #################################################################################################################### | 100% 
certifi-2018.11.29   | 146 KB    | #################################################################################################################### | 100% 
pip-19.0.3           | 1.8 MB    | #################################################################################################################### | 100% 
openssl-1.1.1b       | 4.0 MB    | #################################################################################################################### | 100% 
ca-certificates-2019 | 126 KB    | #################################################################################################################### | 100% 
wheel-0.33.1         | 39 KB     | #################################################################################################################### | 100% 
setuptools-40.8.0    | 635 KB    | #################################################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > source activate py27_spacy2
#
# To deactivate an active environment, use:
# > source deactivate
#

user@alienware:~/dev$ source activate py27_spacy2
(py27_spacy2) user@alienware:~/dev$ pip install -U spacy-nightly
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting spacy-nightly
  Downloading https://files.pythonhosted.org/packages/e3/18/000e94b05aa86b8a414ad34c043fd30d242a54b170363f13f53a92036f2b/spacy_nightly-2.1.0a10-cp27-cp27mu-manylinux1_x86_64.whl (27.6MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 27.6MB 1.4MB/s 
Collecting thinc<7.1.0,>=7.0.2 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/7f/f5/e37a2e5167be02b417046fd00980d1e74bce02d0dff635416cb2c75e8286/thinc-7.0.2-cp27-cp27mu-manylinux1_x86_64.whl (2.0MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.1MB 8.3MB/s 
Collecting pathlib==1.0.1; python_version < "3.4" (from spacy-nightly)
Collecting blis<0.3.0,>=0.2.2 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/d7/46/19c4b2cef1c210e99db89bc0e2fc591c0d5e6fb230aee54b5993c57f0b73/blis-0.2.3.tar.gz (1.5MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.5MB 7.5MB/s 
Collecting plac<1.0.0,>=0.9.6 (from spacy-nightly)
  Using cached https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl
Collecting wasabi<1.1.0,>=0.0.12 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/8d/aa/41ecccdbf9a0b1a5f9a9cc54fc72e4744eb6b86ed73bc4b3ca941cad945d/wasabi-0.1.2.tar.gz
Collecting srsly<1.1.0,>=0.0.5 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/db/65/066ab96cc14ccc2fe908663fab329df96bd92a168f74d6249473b4fb0a55/srsly-0.0.5-cp27-cp27mu-manylinux1_x86_64.whl (174kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 184kB 11.9MB/s 
Collecting numpy>=1.15.0 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/c4/33/8ec8dcdb4ede5d453047bbdbd01916dbaccdb63e98bba60989718f5f0876/numpy-1.16.2-cp27-cp27mu-manylinux1_x86_64.whl (17.0MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 17.0MB 1.9MB/s 
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/ed/31/247b34db5ab06afaf5512481e77860fb4cd7a0c0ddff9d2566651c8c2f07/murmurhash-1.0.2-cp27-cp27mu-manylinux1_x86_64.whl
Collecting requests<3.0.0,>=2.13.0 (from spacy-nightly)
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting jsonschema<3.0.0,>=2.6.0 (from spacy-nightly)
  Using cached https://files.pythonhosted.org/packages/77/de/47e35a97b2b05c2fadbec67d44cfcdcd09b8086951b331d82de90d2912da/jsonschema-2.6.0-py2.py3-none-any.whl
Collecting cymem<2.1.0,>=2.0.2 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/df/b1/4ff2cbd423184bd68e85f1daa6692753cd7710b0ba68552eb64542906a57/cymem-2.0.2-cp27-cp27mu-manylinux1_x86_64.whl
Collecting preshed<2.1.0,>=2.0.1 (from spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/25/b1/9098d07e70b960001a8a9b99435c6987006d0d7bcbf20523adce9272f66e/preshed-2.0.1-cp27-cp27mu-manylinux1_x86_64.whl (80kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 81kB 34.3MB/s 
Collecting tqdm<5.0.0,>=4.10.0 (from thinc<7.1.0,>=7.0.2->spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 39.1MB/s 
Collecting thinc-gpu-ops<0.1.0,>=0.0.1 (from thinc<7.1.0,>=7.0.2->spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/a4/ad/11ab80a24bcedd7dd0cfabaedba2ceaeca11f1aaeeff432a3d2e63ca7d02/thinc_gpu_ops-0.0.4.tar.gz (483kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 491kB 7.1MB/s 
Collecting urllib3<1.25,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy-nightly)
  Using cached https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6ad3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0,>=2.13.0->spacy-nightly)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests<3.0.0,>=2.13.0->spacy-nightly)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /home/user/miniconda3/envs/py27_spacy2/lib/python2.7/site-packages (from requests<3.0.0,>=2.13.0->spacy-nightly) (2018.11.29)
Collecting functools32; python_version == "2.7" (from jsonschema<3.0.0,>=2.6.0->spacy-nightly)
  Downloading https://files.pythonhosted.org/packages/c5/60/6ac26ad05857c601308d8fb9e87fa36d0ebf889423f47c3502ef034365db/functools32-3.2.3-2.tar.gz
Building wheels for collected packages: blis, wasabi, thinc-gpu-ops, functools32
  Building wheel for blis (setup.py) ... done
  Stored in directory: /home/user/.cache/pip/wheels/d8/bc/d8/fa2e772b04b368972ed51759dc69275e020649a14ff30cf853
  Building wheel for wasabi (setup.py) ... done
  Stored in directory: /home/user/.cache/pip/wheels/64/6b/79/6d2b350fba2e7fd0ec125a929c8b67b022731bea36a87bcc87
  Building wheel for thinc-gpu-ops (setup.py) ... done
  Stored in directory: /home/user/.cache/pip/wheels/eb/ba/a3/9af9f326ed0d75a4540378af64a05a0e42be39d9b8513f3aea
  Building wheel for functools32 (setup.py) ... done
  Stored in directory: /home/user/.cache/pip/wheels/b5/18/32/77a1030457155606ba5e3ec3a8a57132b1a04b1c4f765177b2
Successfully built blis wasabi thinc-gpu-ops functools32
Installing collected packages: pathlib, tqdm, numpy, blis, thinc-gpu-ops, plac, srsly, murmurhash, cymem, preshed, wasabi, thinc, urllib3, chardet, idna, requests, functools32, jsonschema, spacy-nightly
Successfully installed blis-0.2.3 chardet-3.0.4 cymem-2.0.2 functools32-3.2.3.post2 idna-2.8 jsonschema-2.6.0 murmurhash-1.0.2 numpy-1.16.2 pathlib-1.0.1 plac-0.9.6 preshed-2.0.1 requests-2.21.0 spacy-nightly-2.1.0a10 srsly-0.0.5 thinc-7.0.2 thinc-gpu-ops-0.0.4 tqdm-4.31.1 urllib3-1.24.1 wasabi-0.1.2
(py27_spacy2) user@alienware:~/dev$ python -m spacy download en_core_web_sm
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting en_core_web_sm==2.1.0a7 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0a7/en_core_web_sm-2.1.0a7.tar.gz#egg=en_core_web_sm==2.1.0a7
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0a7/en_core_web_sm-2.1.0a7.tar.gz (11.0MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 11.1MB 10.6MB/s 
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.1.0a7
βœ” Linking successful
/home/user/miniconda3/envs/py27_spacy2/lib/python2.7/site-packages/en_core_web_sm
-->
/home/user/miniconda3/envs/py27_spacy2/lib/python2.7/site-packages/spacy/data/en_core_web_sm
You can now load the model via spacy.load('en_core_web_sm')
(py27_spacy2) user@alienware:~/dev$ python -c "import spacy; nlp=spacy.load('en_core_web_sm'); doc=nlp(u'hello world'); print ','.join([t.text for t in doc])"
h,e,ll,o,w,o,r,l,d
(py27_spacy2) user@alienware:~/dev$ python -c "import sys; print sys.maxunicode"
1114111
(py27_spacy2) user@alienware:~/dev$ which pip
/home/user/miniconda3/envs/py27_spacy2/bin/pip
(py27_spacy2) user@alienware:~/dev$ which python
/home/user/miniconda3/envs/py27_spacy2/bin/python

Okay, thanks! Definitely seems bad --- will look into it.

I can't reproduce either :(
@rulai-huajunzeng : what happens if you use the default English model instead?

from spacy.lang.en import English
nlp = English()

Using the default English model the result is correct. But using spacy.load('en_core_web_sm') the result is the same. I am guessing maybe there are conflicts with previous installed spacy models?

>>> from spacy.lang.en import English
>>> nlp = English()
>>> doc=nlp(u'hello world')
>>> print ','.join([t.text for t in doc])
hello,world
>>> import spacy
>>> nlp=spacy.load('en_core_web_sm')
>>> doc=nlp(u'hello world')
>>> print ','.join([t.text for t in doc])
h,e,ll,o,w,o,r,l,d
>>> 

@rulai-huajunzeng @handsomezebra : Any chance you could run this entire sentence through the nlp pipeline?
Well I wonder how this will / shall look after tokenization with the model - ill or not ?

>>> from spacy.lang.en import English
>>> nlp = English()
>>> doc=nlp(u'Well I wonder how this will / shall look after tokenization with the model - ill or not ?')
>>> print ','.join([t.text for t in doc])
Well,I,wonder,how,this,will,/,shall,look,after,tokenization,with,the,model,-,ill,or,not,?
>>> import spacy
>>> nlp=spacy.load('en_core_web_sm')
>>> doc=nlp(u'Well I wonder how this will / shall look after tokenization with the model - ill or not ?')
>>> print ','.join([t.text for t in doc])
W,e,l,l,I,w,o,n,d,e,r,h,o,w,t,h,i,s,w,i,l,l,/,s,h,a,l,l,l,o,o,k,a,f,t,e,r,t,o,k,e,n,i,z,a,t,i,o,n,w,i,t,h,t,h,e,m,o,d,e,l,-,i,ll,o,r,n,o,t,?

Thanks for your patience on this, this is definitely mysterious!

And yeah, maybe what you're loading as en_core_web_sm here isn't actually the correct model package? Could you print nlp.meta and see what that returns?

>>> import spacy
>>> nlp=spacy.load('en_core_web_sm')
>>> print nlp.meta
{u'description': u'English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.', u'sources': [u'OntoNotes 5'], u'speed': {u'gpu': None, u'nwords': 291344, u'cpu': 7436.9320951768}, u'lang': u'en', u'pipeline': [u'tagger', u'parser', u'ner'], u'name': u'core_web_sm', u'license': u'MIT', u'author': u'Explosion AI', u'url': u'https://explosion.ai', u'vectors': {u'keys': 0, u'width': 0, u'vectors': 0, u'name': None}, u'version': u'2.1.0a7', u'spacy_version': u'>=2.1.0a4', u'parent_package': u'spacy-nightly', u'email': u'[email protected]', u'accuracy': {u'token_acc': 99.0646795541, u'ents_p': 85.4670314332, u'ents_r': 85.5530095568, u'uas': 91.5689404737, u'tags_acc': 96.8320691599, u'ents_f': 85.5099988828, u'las': 89.682797403}}

I still got the same issue even using the stable release of 2.1.0

I suspect there might actually be a connection to #3432, especially since you and @jfelectron both had the same problems and we weren't able to reproduce either of them yet on any of our test machines and CI... So there must be some configuration here that we might be missing? I'm not sure, this is so confusing πŸ€”

The tests are green in the exact same environment as produces the tokenization issues:

(.env) [ 6:30PM ] [ jfelectron@madmax:~/Documents/Repos/spaCy/spacy/tests(masterβœ”) ]
$ pytest -svv
==================================================================================================================================================================================== test session starts =====================================================================================================================================================================================
platform linux2 -- Python 2.7.15+, pytest-4.0.2, py-1.7.0, pluggy-0.8.0 -- /usr/bin/python
cachedir: .pytest_cache
metadata: {'Python': '2.7.15+', 'Platform': 'Linux-4.18.0-10-generic-x86_64-with-Ubuntu-18.10-cosmic', 'Packages': {'py': '1.7.0', 'pytest': '4.0.2', 'pluggy': '0.8.0'}, 'Plugins': {'html': '1.19.0', 'xdist': '1.15.0', 'timeout': '1.3.3', 'metadata': '1.7.0'}}
rootdir: /home/jfelectron/Documents/Repos/spaCy, inifile:
plugins: xdist-1.15.0, timeout-1.3.3, metadata-1.7.0, html-1.19.0
collected 2123 items
...
1482 passed, 585 skipped, 56 xfailed in 62.04 seconds

In [2]: import spacy

In [3]: spacy.__version__
Out[3]: '2.1.1'

In [4]: spacy_nlp = spacy.load('en_core_web_sm')

In [5]: [d.text for d in spacy_nlp(u'I have a dog named spot')]
Out[5]:
[u'I',
u'h',
u'a',
u'v',
u'e',
u'a',
u'd',
u'o',
u'g',
u'n',
u'a',
u'm',
u'e',
u'd',
u's',
u'p',
u'o',
u't']

The tests are green in the exact same environment as produces the tokenization issues:

Thanks for running more tests! This is consistent with what's been reported before. The problem seems to occur when deserializing the tokenization rules from the model, not when using a blank language class like English. The tests shipped with the library do not test the models, so it makes sense they succeed.

If you could run the env command in your environment and post the environment variables that you have set here, that might be helpful (just double-check and make sure to remove stuff like secrets etc.). We're trying to pin down the exact configuration that causes the problems – we tried lots of different combinations but we haven't been able to reproduce it yet.

How about compiler tool chain or Cython? It looks like spacy leaves many of its deps unpinned to a specific version.

Or how about compression/serialization tool chain?

Now I can reproduce it using miniconda docker image.

First create a Dockerfile as below:

FROM continuumio/miniconda:4.5.12

RUN apt-get update && apt-get -y install build-essential python-dev
ENV WASABI_NO_PRETTY=1
RUN pip install spacy
RUN spacy download en_core_web_sm
RUN python -c "import spacy; nlp=spacy.load('en_core_web_sm'); doc=nlp(u'hello world'); print ','.join([t.text for t in doc])"

Then run docker build -t my_image ., you will be able to see:

Step 6/6 : RUN python -c "import spacy; nlp=spacy.load('en_core_web_sm'); doc=nlp(u'hello world'); print ','.join([t.text for t in doc])"
 ---> Running in d6f82843b94e
h,e,ll,o,w,o,r,l,d

Also tried the following base images, both can reproduce easily.

FROM continuumio/anaconda:5.3.0

FROM python:2.7

@ines what environments are you using? Linux or OS X?

LC_ADDRESS=en_US.UTF-8
XDG_CONFIG_DIRS=/etc/xdg/xdg-pop:/etc/xdg
LC_TELEPHONE=en_US.UTF-8
LANG=en_US.UTF-8
DISPLAY=:1
SHLVL=1
LOGNAME=jfelectron
LANGUAGE=en_US:en
MANDATORY_PATH=/usr/share/gconf/pop.mandatory.path
LC_NAME=en_US.UTF-8
XDG_VTNR=2
USER=jfelectron
XAUTHORITY=/run/user/1000/gdm/Xauthority
PWD=/home/jfelectron/Documents/Repos/spaCy
GTK_IM_MODULE=ibus
GJS_DEBUG_TOPICS=JS ERROR;JS LOG
XDG_SESSION_ID=3
COLORTERM=truecolor
GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/aa51ec0d_5220_4740_957e_4aa554ffa884
DESKTOP_SESSION=pop
XDG_SESSION_DESKTOP=pop
GDMSESSION=pop
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
DEFAULTS_PATH=/usr/share/gconf/pop.default.path
WINDOWPATH=2
PAPERSIZE=letter
LC_MEASUREMENT=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_PAPER=en_US.UTF-8
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
VTE_VERSION=5401
USERNAME=jfelectron
CLUTTER_IM_MODULE=xim
GNOME_TERMINAL_SERVICE=:1.114
QT4_IM_MODULE=xim
XDG_DATA_DIRS=/usr/share/pop:/usr/local/share/:/usr/share/
XDG_MENU_PREFIX=gnome-
LC_IDENTIFICATION=en_US.UTF-8
IM_CONFIG_PHASE=2
SHELL=/usr/bin/zsh
GNOME_SHELL_SESSION_MODE=pop
QT_IM_MODULE=xim
LC_TIME=en_US.UTF-8
TERM=xterm-256color
SESSION_MANAGER=local/madmax:@/tmp/.ICE-unix/2674,unix/madmax:/tmp/.ICE-unix/2674
GJS_DEBUG_OUTPUT=stderr
XDG_SESSION_TYPE=x11
GTK_MODULES=gail:atk-bridge
XDG_CURRENT_DESKTOP=pop:GNOME
SSH_AGENT_PID=2818
PATH=/home/jfelectron/Documents/Repos/spaCy/.env/bin:/home/jfelectron/.local/bin:/home/jfelectron/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/cuda/bin:/home/jfelectron/bin:/usr/local/bin:/usr/share/idea/bin
SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
XMODIFIERS=@im=ibus
HOME=/home/jfelectron
XDG_RUNTIME_DIR=/run/user/1000
GPG_AGENT_INFO=/run/user/1000/gnupg/S.gpg-agent:0:1
XDG_SEAT=seat0
QT_ACCESSIBILITY=1
OLDPWD=/home/jfelectron
PYTHONPATH=:/home/jfelectron/Documents/Repos/phoenix/shared:/home/jfelectron/Documents/Repos/phoenix/plato:/home/jfelectron/Documents/Repos/phoenix/sourcerer:/home/jfelectron/Documents/Repos/phoenix/vulcan:/home/jfelectron/Documents/Repos/phoenix/zoltar:/home/jfelectron/Documents/Repos/phoenix/tests:/home/jfelectron/Documents/Repos/phoenix/olympus:/home/jfelectron/Documents/Repos/flashtext
ZSH=/home/jfelectron/.oh-my-zsh
PAGER=less
LESS=-R
LC_CTYPE=en_US.UTF-8
LSCOLORS=Gxfxcxdxbxegedabagacad
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:
VIRTUAL_ENV=/home/jfelectron/Documents/Repos/spaCy/.env
PS1=(.env) $fg_bold[blue][ $fg[red]%t $fg_bold[blue]] $fg_bold[blue] [ $fg[red]%n@%m:%~$(git_prompt_info)$fg[yellow]$(rvm_prompt_info)$fg_bold[blue] ]$reset_color
$
_=/usr/bin/env

$ pip freeze
atomicwrites==1.3.0
attrs==19.1.0
blis==0.2.4
certifi==2019.3.9
chardet==3.0.4
configparser==3.7.3
cymem==2.0.2
Cython==0.29.6
en-core-web-sm==2.1.0
enum34==1.1.6
flake8==3.5.0
funcsigs==1.0.2
functools32==3.2.3.post2
idna==2.8
jsonschema==2.6.0
mccabe==0.6.1
mock==2.0.0
more-itertools==5.0.0
murmurhash==1.0.2
numpy==1.16.2
pathlib==1.0.1
pathlib2==2.3.3
pbr==5.1.3
plac==0.9.6
pluggy==0.9.0
preshed==2.0.1
py==1.8.0
pycodestyle==2.3.1
pyflakes==1.6.0
pytest==4.0.2
pytest-timeout==1.3.3
requests==2.21.0
scandir==1.10.0
six==1.12.0
spacy==2.1.1
srsly==0.0.5
thinc==7.0.4
tqdm==4.31.1
uncommon==1.0.2.14072
uncommon-testing==1.0.2.11140
urllib3==1.24.1
vulcan==1.0.2.14072
wasabi==0.1.4

@ines we are both using py27, my sense is you and @honnibal are using Py3, have you tried a py2.7 venv?

Otherwise, what the the lowest level thing we can check to identify the root cause here?

@jfelectron Yes of course we've tried Python 2...We've tried configuring the locale to ASCII, tried setting the PYTHONIOENCODING, and different combinations of the two. We have Python2.7 in our test suite, so the CI tests Python2.7 for every pull request and every release.

@rulai-huajunzeng Thanks, I'll try that container.

Your test suite doesn't cover this behavior, they are green for me too, it's irrelevant unless in your CI env you are running integration tests that do load models?? If so how do we invoke them?

Haven't run the container yet, but I had a look at the definition. I think it's the use of the C.utf8 locale https://community.hpe.com/t5/General/Difference-between-C-utf8-and-en-us-utf8-points/td-p/4418194#.XJLCEaQo_Zs:

C = POSIX standards-compliant default locale. Only strict ASCII characters are valid.

C.utf8 = POSIX standards-compliant locale, extended to allow the basic use of UTF-8. No character upper-lower case relationships and collation orders defined beyond ASCII.

(In other words: this sorts non-ASCII characters strictly according to their Unicode character encoding value. It does not understand that upper and lower case "A with diaeresis" are two versions of the same character and should be sorted near each other. For non-Latin alphabets, your guess is as good as mine.)

For all C.* locales, the default currency symbol is undefined -> POSIX default "$" is used. Thousands separators are not used in large numbers.

@jfelectron I'm not 100% sure but I think your locale looks broken too. Your LC_ALL environment variable isn't set, which looks suspicious: https://perlgeek.de/en/article/set-up-a-clean-utf8-environment

LC_ALL isn't set on ANY Ubuntu based system I have access to. So either it's a Debian/Ubuntu problem or Spacy is expecting something that most environment simply can't provide. Why are we only seeing this on 2.1?

I fixed my locale, seems to have no impact.

In [1]: from vulcan.text_spacy import init_spacy

In [2]: spacy_nlp = init_spacy()

In [3]: [d.text for d in spacy_nlp(u'I have a dog named spot')]
Out[3]:
[u'I',
u'h',
u'a',
u'v',
u'e',
u'a',
u'd',
u'o',
u'g',
u'n',
u'a',
u'm',
u'e',
u'd',
u's',
u'p',
u'o',
u't']

In [4]: import os

In [5]: os.environ
Out[5]: {'LC_NUMERIC': 'en_US.UTF-8', 'LESS': '-R', 'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated', 'GJS_DEBUG_OUTPUT': 'stderr', 'LC_CTYPE': 'en_US.UTF-8', 'XDG_CURRENT_DESKTOP': 'pop:GNOME', 'LC_PAPER': 'en_US.UTF-8', 'QT_IM_MODULE': 'xim', 'LOGNAME': 'jfelectron', 'USER': 'jfelectron', 'HOME': '/home/jfelectron', 'XDG_VTNR': '2', 'PATH': '/home/jfelectron/.local/bin:/home/jfelectron/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/cuda/bin:/home/jfelectron/bin:/usr/local/bin:/usr/share/idea/bin:/home/jfelectron/bin:/usr/local/bin:/usr/share/idea/bin', 'ZSH': '/home/jfelectron/.oh-my-zsh', 'DISPLAY': ':1', 'SSH_AGENT_PID': '2727', 'LANG': 'en_US.UTF-8', 'TERM': 'xterm-256color', 'SHELL': '/usr/bin/zsh', 'XAUTHORITY': '/run/user/1000/gdm/Xauthority', 'LANGUAGE': 'en_US.UTF-8', 'SESSION_MANAGER': 'local/madmax:@/tmp/.ICE-unix/2559,unix/madmax:/tmp/.ICE-unix/2559', 'SHLVL': '1', 'MANDATORY_PATH': '/usr/share/gconf/pop.mandatory.path', 'QT_ACCESSIBILITY': '1', 'QT4_IM_MODULE': 'xim', 'CLUTTER_IM_MODULE': 'xim', 'WINDOWPATH': '2', 'IM_CONFIG_PHASE': '2', 'GPG_AGENT_INFO': '/run/user/1000/gnupg/S.gpg-agent:0:1', 'LC_MONETARY': 'en_US.UTF-8', 'USERNAME': 'jfelectron', 'XDG_SESSION_DESKTOP': 'pop', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LC_IDENTIFICATION': 'en_US.UTF-8', 'LC_ADDRESS': 'en_US.UTF-8', 'PYTHONPATH': ':/home/jfelectron/Documents/Repos/phoenix/shared:/home/jfelectron/Documents/Repos/phoenix/plato:/home/jfelectron/Documents/Repos/phoenix/sourcerer:/home/jfelectron/Documents/Repos/phoenix/vulcan:/home/jfelectron/Documents/Repos/phoenix/zoltar:/home/jfelectron/Documents/Repos/phoenix/tests:/home/jfelectron/Documents/Repos/phoenix/olympus:/home/jfelectron/Documents/Repos/flashtext:/home/jfelectron/Documents/Repos/phoenix/shared:/home/jfelectron/Documents/Repos/phoenix/plato:/home/jfelectron/Documents/Repos/phoenix/sourcerer:/home/jfelectron/Documents/Repos/phoenix/vulcan:/home/jfelectron/Documents/Repos/phoenix/zoltar:/home/jfelectron/Documents/Repos/phoenix/tests:/home/jfelectron/Documents/Repos/phoenix/olympus:/home/jfelectron/Documents/Repos/flashtext', 'SSH_AUTH_SOCK': '/run/user/1000/keyring/ssh', 'VTE_VERSION': '5401', 'GDMSESSION': 'pop', 'XMODIFIERS': '@im=ibus', 'GNOME_SHELL_SESSION_MODE': 'pop', 'XDG_DATA_DIRS': '/usr/share/pop:/usr/local/share/:/usr/share/', 'LC_ALL': 'en_US.UTF-8',

Dude, I said I wasn't sure. I'm working on it. Back off okay?

Don't get me wrong, I think we want to help. The tests are uninformative, so I'm looking for the lowest entry point to break in code (Python or not) where we can figure out what's going on here.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings