Spacy: Unicode problem - Python2

Created on 22 Nov 2016  路  2Comments  路  Source: explosion/spaCy

Hi

I wanted to understand where am I going wrong with my pipeline
Whenever I feed spacy a clean text file, I get this error regarding Unicode
Any ideas where I'm going wrong?

Traceback (most recent call last):
File "spacypipeline3.py", line 21, in
parsedData = parser(text)
File "/usr/local/lib/python2.7/dist-packages/spacy/language.py", line 314, in __call__
doc = self.make_doc(text)
File "/usr/local/lib/python2.7/dist-packages/spacy/language.py", line 288, in
self.make_doc = lambda text: self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected unicode, got str)

Your Environment

  • Operating System: Ubuntu - 16.04
  • Python Version Used: 2.7
  • spaCy Version Used: Latest
  • Environment Information:

Most helpful comment

You need to convert the text to unicode before passing it to spacy.
text.decode(). You can pass it a codec if its not in ascii.

en_nlp('as莽eptique is not a word'.decode('utf-8'))

This is something that you should really need to know about if you are using python2 rather than python3. The documentation is here:
https://docs.python.org/2/howto/unicode.html

All 2 comments

You need to convert the text to unicode before passing it to spacy.
text.decode(). You can pass it a codec if its not in ascii.

en_nlp('as莽eptique is not a word'.decode('utf-8'))

This is something that you should really need to know about if you are using python2 rather than python3. The documentation is here:
https://docs.python.org/2/howto/unicode.html

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings