Spacy: Japanese (MeCab): the very first call of `nlp` fails

Created on 5 Nov 2018 · 8Comments · Source: explosion/spaCy

How to reproduce the behaviour

The very first call of nlp (probably for any string) results in an error:

>>> import spacy
>>> nlp = spacy.blank('ja')
>>> nlp('pythonが大好きです')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 340, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 117, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 81, in __call__
    doc = Doc(self.vocab, words=words, spaces=[False]*len(words))
  File "doc.pyx", line 176, in spacy.tokens.doc.Doc.__init__
  File "doc.pyx", line 559, in spacy.tokens.doc.Doc.push_back
ValueError: [E031] Invalid token: empty string ('') at position 0.
>>> nlp('pythonが大好きです')
pythonが大好きです
>>>

MeCab is installed according to https://github.com/SamuraiT/mecab-python3#user-content-installation-and-usage. The example from there works:

>>> import MeCab
>>> mecab = MeCab.Tagger ("-Ochasen")
>>> print(mecab.parse("pythonが大好きです"))
python  python  python  名詞-固有名詞-組織      
が   ガ   が   助詞-格助詞-一般       
大好き ダイスキ    大好き 名詞-形容動詞語幹       
です  デス  です  助動詞 特殊・デス   基本形
EOS

>>>

Info about spaCy

spaCy version: 2.0.16
Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.10
Python version: 3.6.4

bug help wanted lang / ja third-party

Source

kbulygin

Most helpful comment

Thanks for the Dockerfile and sorry for my late reply - was hoping this was related to the recent issue with the mecab-python3 upgrade. I tried downgrading the version of that used in the Dockerfile and it didn't help though, so it seems it's unrelated.

The good news is I found a minimal example of the issue, the bad news is it makes no sense. Here's code that shows the issue:

import MeCab
tagger = MeCab.Tagger()

def print_tokens(text):
    node = tagger.parseToNode(text).next
    while node.posid != 0:
        print(node.surface[:])
        node = node.next

print('-----')
print_tokens("日本語だよ")
print_tokens("日本語だよ")

Output

-----


だ
よ
日本
語
だ
よ

Please note the first two lines are blank, which is not only wrong but very strange. Poking around internally, data besides the literal token (like POS and lemma) seem OK, just the surface disappears. This issue does not affect the .parse function, only .parseToNode.

As the sample code above indicates, this isn't a problem with spaCy - it's an issue either with the python library or mecab itself, though since neither of those are updated often and I can't reproduce it on my Arch machine it might be Debian specific.

I'll keep looking into this and see if I can figure out what's up.

polm on 16 Dec 2018

👍2

All 8 comments

I can't reproduce this, but it seems like you might not have Unidic installed. That wouldn't explain why it would fail once and then work afterwards, but could you verify you have Unidic installed? It's required because Universal Dependencies requires it. The documentation should be clearer about this soon (see #3001).

(I assume you are using IPAdic because IPAdic has a chasen output format in the config file and Unidic does not.)

polm on 6 Dec 2018

👍2

@polm Thanks for the reply. I'm using IPAdic indeed: as I said, MeCab is installed according to https://github.com/SamuraiT/mecab-python3#user-content-installation-and-usage, so the command apt-get install mecab mecab-ipadic-utf8 libmecab-dev swig was used before pip install mecab-python3.

Unfortunately, using unidic-mecab instead of mecab-ipadic-utf8 doesn't help. The error is the same:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 340, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 117, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 81, in __call__
    doc = Doc(self.vocab, words=words, spaces=[False]*len(words))
  File "doc.pyx", line 176, in spacy.tokens.doc.Doc.__init__
  File "doc.pyx", line 566, in spacy.tokens.doc.Doc.push_back
ValueError: [E031] Invalid token: empty string ('') at position 0.

Here's a Dockerfile to reproduce (run docker build . from the directory with it):

FROM python@sha256:a837aefef8f2553789bc70436621160af4da65d95b0fb02d5032557f887d0ca5
# python:3.7.1 (Debian Stretch), Sun Nov 11 17:29:43 +05 2018

RUN apt-get update
RUN apt-get install -y --no-install-recommends \
  mecab=0.996-3.1 \
  unidic-mecab=2.1.2~dfsg-6 \
  libmecab-dev=0.996-3.1 \
  swig=3.0.10-1.1

RUN pip install spacy==2.0.18 mecab-python3==0.996.1

RUN python -c 'import spacy; nlp = spacy.blank("ja"); \
  doc = nlp("pythonが大好きです"); print(doc)'

kbulygin on 6 Dec 2018

The good news is I found a minimal example of the issue, the bad news is it makes no sense. Here's code that shows the issue:

import MeCab
tagger = MeCab.Tagger()

def print_tokens(text):
    node = tagger.parseToNode(text).next
    while node.posid != 0:
        print(node.surface[:])
        node = node.next

print('-----')
print_tokens("日本語だよ")
print_tokens("日本語だよ")

Output

-----


だ
よ
日本
語
だ
よ

I'll keep looking into this and see if I can figure out what's up.

polm on 16 Dec 2018

👍2

I think I found the cause of the issue - it looks like it's the same as SamuraiT/mecab-python3#3, which is caused by taku910/mecab#5. The problem is in Mecab itself and is fixed in the latest git version, but not in the latest release (which is from 2013).

@kbulygin For you individually I think the best solution is to install Mecab from source. Sorry I can't provide a better solution, but given how long Mecab has been without a release it's hard to say when the next one would be or how long it would take distribution packages to be updated.

For spaCy we might want to post a warning somewhere... not sure about the best place to do that. Since this is an issue in the C++ lib we can't really deal with the version in requirements.txt or something.

I will poke the main Mecab project and maybe some distribution maintainers.

polm on 17 Dec 2018

👍1

@polm Thanks for the investigation. Personally, I'm not affected by the error much now: I just thought the behaviour is worth reporting, as the error seemed related to how initialization is done on the spaCy side.

As far as I get from https://github.com/taku910/mecab/issues/5#issuecomment-80528189, another safe solution is probably just to do warming up like:

import spacy
nlp = spacy.blank('ja')
nlp('')  # no exception is raised
# `nlp` is ready now.

This can also be done in the spaCy itself of course, unless considered too hacky.

kbulygin on 17 Dec 2018

👍1

Thanks for getting to the bottom of this!

Running warm-up code on initialization of the Japanese language class sounds fine. It could process an empty string, or just make the most minimal call into Mecab required to prevent this issue. We can then add a comment pointing to this issue and remove the hack once it's fixed in Mecab itself.

ines on 17 Dec 2018

👍1

The next release of python3-mecab from PyPI will include pre-built wheels that bundle a version of MeCab with this bug fixed. Unfortunately I cannot promise exactly when that will happen, but I'm shooting for "before the end of December".