Sphinx: English words cannot be searched in Chinese word segmentation mode

Created on 9 Nov 2018 · 17Comments · Source: sphinx-doc/sphinx

Problem:

I can't search for English words in Chinese search mode.

Procedure to reproduce the problem

When I set html_search_language = 'zh' to use Chinese search, .
And add the following string 可以查看 FAQ 模块中 test 部分.
search keyword test.

Error results

When searching keyword test, there will be no result display.

Expected results

There will be test string in search results.

Sugestions

Why do space characters appear at the end of the word segmentation result?

I look up the source code and send the following regular expression, which will identify the space.
sphinx/search/zh.py

Now test as follows：

In [3]: re.compile(u'(?u)\\w+[\u0000-\u00ff]').findall("可以查看 FAQ 模块中 test 部分")
Out[3]: ['可以查看 ', 'FAQ ', '模块中 ', 'test ']

The result clearly contains the space character, here is the source of the problem.
Can we solve this problem by adjusting regular expressions?

Environment info

OS: MacOS 10.14.1 with Python 2.7 in a sphinx specific virtualenv
Python version: Python 2.7.10
Sphinx version: 1.8.1
Browser Chrome

help wanted html search

Source

DaGeger

All 17 comments

Why do space characters appear at the end of the word segmentation result?

Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)

tk0miya on 10 Nov 2018

Why do space characters appear at the end of the word segmentation result?

Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)

I did a test, this problem will also occur in English sentences.

In [1]: import re
In [2]: re.compile(u'\\w+[\u0000-\u00ff]').findall('this is a test string')
Out[2]: ['this ', 'is ', 'a ', 'test ', 'string']

DaGeger on 11 Nov 2018

@DaGeger I created a PR that removes the trailing white spaces for _Latin_ search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?

TimKam on 11 Nov 2018

@DaGeger I created a PR that removes the trailing white spaces for _Latin_ search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?

Great! I tested your code and it can solve the problem.

Reply: why these spaces should remain in Chinese?
In the process of write documents, there are often two English words connected in Chinese.
eg: 查找CAS service配置
In this case Spaces must be preserved.

By the way, I don't know, why search CAS will have no results?

Is it self made? How can it be searched?

DaGeger on 11 Nov 2018

Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.

TimKam on 11 Nov 2018

Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.

In the same way, I tested your commit, but there will be no result when searching for CAS.

DaGeger on 12 Nov 2018

@DaGeger In my tests, the search index includes cas. Can you run git pull --rebase and pip install -e . in the repo's root folder (make sure my branch is checked out) to ensure you have the latest changes? Also, try deleting the old output folder (typically: _build) manually before running the build. If this doesn't help, can you share the exact doc set you are testing with?

TimKam on 12 Nov 2018

As you described before, I tested many times and found that there would be problems.
Use your code test:
this sentence can be searched

模块中 CAS service部分

But this is not acceptable

模块中CAS service部分

The difference is that there is a blank on the left side of the CAS word.

DaGeger on 13 Nov 2018

I tried to find the problem and found new conclusions. I think the reason for the problem is still in this regular expression.

re.compile(u'(?u)\\w+[\u0000-\u00ff]')

We can conduct tests.

In [1]: import re

In [2]: latin1_letters = re.compile(u'(?u)\\w+[\u0000-\u00ff]')

In [3]: latin1_letters.findall('模块中 CAS service部分')
Out[3]: ['模块中 ', 'CAS ', 'service']

In [4]: latin1_letters.findall('模块中CAS service部分')
Out[4]: ['模块中CAS ', 'service']

Look, when the word is left without spaces, the keyword cas will not be matched.
Does this belong to the expected situation?

DaGeger on 13 Nov 2018

Okay; splitting Chinese from "Latin" words is then an additional issue that needs fixing. I talked this over with a Chinese colleague to make sure I understand the implications. Will try to fix it asap.

TimKam on 15 Nov 2018

👍1

@TimKam , how about this pattern, are they expected results?

>>> import re
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['可以查看', 'FAQ', '模块中', 'Chinesetest', '部分', '模块中', 'CAS', 'service', '部分', '取而代之的是它们通过', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, '模块中CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']

For easy reading:

(?:
    (?:
        (?![\s.,])[\x00-\xFF]
    )+
    |
    (?:
        (?![\x00-\xFF])\w
    )+
)

animalize on 25 Dec 2018

Thanks for the suggestion, @animalize. Some days ago, I updated the PR and adjusted the regexp to the following: [\u0000-\u00ff]\w+\w+[\u0000-\u00ff]. This correctly extracts Latin "sub terms", as far as I can assess. Can you confirm this or find a test case for which your regexp is better? :-)

TimKam on 25 Dec 2018

Please assess this testcase:

>>> import re
>>> pattern = u'[\u0000-\u00ff]\w+\w+[\u0000-\u00ff]'
>>> re.findall(pattern, '可以abc查看')
[]
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This ', ' test ', 'string.']
>>> 
>>>
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> re.findall(pattern, '可以abc查看')
['可以', 'abc', '查看']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']

animalize on 25 Dec 2018

If only Latin string are need, we can use this pattern:

>>> import re
>>> pattern = u'(?:(?![\s.,])[\x00-\xFF])+'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['FAQ', 'Chinesetest', 'CAS', 'service', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, '模块中CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
>>> re.findall(pattern, '可以abc查看')
['abc']

animalize on 25 Dec 2018

I have a better idea, we can use this pattern: r'[a-zA-Z0-9_]+'

Please see these code in https://github.com/sphinx-doc/sphinx/blob/1.8/sphinx/search/__init__.py

    _word_re = re.compile(r'(?u)\w+')
    ...
    def split(self, input):
        # type: (unicode) -> List[unicode]
        """
        This method splits a sentence into words.  Default splitter splits input
        at white spaces, which should be enough for most languages except CJK
        languages.
        """
    return self._word_re.findall(input)

BTW, this description is wrong according to the code:

>Default splitter splits input at white spaces

\w means in Python3's doc:

\w

For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

\w means in Python2's doc:

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

So we can simply use r'[a-zA-Z0-9_]+' for Latin words.

animalize on 25 Dec 2018

Thanks, seems to work. I will merge this when all tests pass!

TimKam on 25 Dec 2018

@TimKam @tk0miya
you can @ me when you need help on htmlhelp/Chinese related issues.

animalize on 26 Dec 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings