Spacy: cannot analyze ` 虅 虅` with japanese models

Created on 24 Aug 2020  路  10Comments  路  Source: explosion/spaCy

How to reproduce the behaviour

When I tried the following very small script

import spacy
nlp = spacy.load('ja_core_news_sm')
nlp(' 虅 虅')

I got the following error

>>> nlp(' 虅 虅')
nlp(' 虅 虅')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 106, in get_dtokens_and_spaces
    word_start = text[text_pos:].index(word)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 441, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 145, in __call__
    dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 108, in get_dtokens_and_spaces
    raise ValueError(Errors.E194.format(text=text, words=words))
ValueError: [E194] Unable to aligned mismatched text ' 虅 虅' and words '[' ', '虅', ' 虅']'.

The minimal Dockerfile is here

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Your Environment

  • Operating System: Linux 04a7a76544e5 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux
  • Python Version Used: 3.7.7
  • spaCy Version Used: 2.3.2
  • Environment Information: Minimal Dockerfile is as bellow
FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm
bug lang / ja

All 10 comments

Thanks for the report! This is definitely a bug.

@hiroshi-matsuda-rit: I don't know whether you'd have time to look into this? I don't speak Japanese, so I'm not sure about the tokenization issues. It looks to me, from a first inspection, that the self._get_dtokens function includes the space within the third token, but then get_dtokens_and_spaces skips over the space which then ultimately results in an error because the third token can not be found anymore in the string. I feel like probably self._get_dtokens should be fixed somehow?

This behavior might be coming from SudachiPy.
I'd like to research it soon.
@polm Have you encountered this kind of problems?

@sorami Could you help us?

Looks like it's a macron character? Wouldn't be used in normal Japanese, but might be used in romaji.

https://www.fileformat.info/info/unicode/char/0304/index.htm

I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue.

https://github.com/WorksApplications/SudachiPy/issues/120

The sudachipy analysis didn't look obviously incorrect to me, either. I suspect the problem is that the third token returned by sudachipy starts with whitespace and that throws the alignment off like Sofie described. But I'm also not sure enough about how sudachipy should work to be sure where the bug is.

@adrianeboyd The third token of the output of SudachiPy for example sentence is starting with whitespace and it's unexpected behavior for current Japanese lang model.
In such cases, we should divide a token into a whitespace and a remaining part.
I'd make a quick fix for master branch.

After some workarounds, I decided to set space after field of each token by referring the surface of next token instead next char in text.

@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space.

@svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed.

Hello, I realize this topic is closed but I recently ran into a similar problem when attempting to read text containing the character . I was wondering if this is just malformed data on my part, or if the bugfix described in this issue should take care of it? And if the latter, is the bugfix already released? I didn't notice anything in 2.3.1. Thanks for your help!

My impression is that spaCy should not throw an exception on any text you throw at it. However, that means that it will process even garbage.

It looks like you have a COMBINING ACUTE ACCENT floating by itself, which is not really going to be useful. You might be able to fix it by using Unicode NFKC normalization on your input text.

Was this page helpful?
0 / 5 - 0 ratings