I can't search for English words in Chinese search mode.

html_search_language = 'zh' to use Chinese search, .可以查看 FAQ 模块中 test 部分.test.When searching keyword test, there will be no result display.
There will be test string in search results.
Why do space characters appear at the end of the word segmentation result?
I look up the source code and send the following regular expression, which will identify the space.
sphinx/search/zh.py
Now test as follows:
In [3]: re.compile(u'(?u)\\w+[\u0000-\u00ff]').findall("可以查看 FAQ 模块中 test 部分")
Out[3]: ['可以查看 ', 'FAQ ', '模块中 ', 'test ']
The result clearly contains the space character, here is the source of the problem.
Can we solve this problem by adjusting regular expressions?
Why do space characters appear at the end of the word segmentation result?
Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)
Why do space characters appear at the end of the word segmentation result?
Sorry, I don't know why because all maintainers can't use Chinese. So any pull requests are welcome :-)
I did a test, this problem will also occur in English sentences.
In [1]: import re
In [2]: re.compile(u'\\w+[\u0000-\u00ff]').findall('this is a test string')
Out[2]: ['this ', 'is ', 'a ', 'test ', 'string']
@DaGeger I created a PR that removes the trailing white spaces for _Latin_ search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?
@DaGeger I created a PR that removes the trailing white spaces for _Latin_ search index terms. However, I do not understand, if/why these spaces should remain in Chinese. Can you confirm the search works as intended in Chinese now (check out my branch and install from source)?
Great! I tested your code and it can solve the problem.

Reply: why these spaces should remain in Chinese?
In the process of write documents, there are often two English words connected in Chinese.
eg: 查找CAS service配置
In this case Spaces must be preserved.
By the way, I don't know, why search CAS will have no results?

Is it self made? How can it be searched?
Thanks for testing. The problem with CAS is that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.
Thanks for testing. The problem with
CASis that it is first stemmed and then considered as too short to be relevant (only two letters). We've fixed a similar issue in the English search a while ago. I can fix this here as well and will ask you for another test once the fix is ready.
In the same way, I tested your commit, but there will be no result when searching for CAS.

@DaGeger In my tests, the search index includes cas. Can you run git pull --rebase and pip install -e . in the repo's root folder (make sure my branch is checked out) to ensure you have the latest changes? Also, try deleting the old output folder (typically: _build) manually before running the build. If this doesn't help, can you share the exact doc set you are testing with?
As you described before, I tested many times and found that there would be problems.
Use your code test:
this sentence can be searched
模块中 CAS service部分

But this is not acceptable
模块中CAS service部分

The difference is that there is a blank on the left side of the CAS word.
I tried to find the problem and found new conclusions. I think the reason for the problem is still in this regular expression.
re.compile(u'(?u)\\w+[\u0000-\u00ff]')
In [1]: import re
In [2]: latin1_letters = re.compile(u'(?u)\\w+[\u0000-\u00ff]')
In [3]: latin1_letters.findall('模块中 CAS service部分')
Out[3]: ['模块中 ', 'CAS ', 'service']
In [4]: latin1_letters.findall('模块中CAS service部分')
Out[4]: ['模块中CAS ', 'service']
Look, when the word is left without spaces, the keyword cas will not be matched.
Does this belong to the expected situation?
Okay; splitting Chinese from "Latin" words is then an additional issue that needs fixing. I talked this over with a Chinese colleague to make sure I understand the implications. Will try to fix it asap.
@TimKam , how about this pattern, are they expected results?
>>> import re
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['可以查看', 'FAQ', '模块中', 'Chinesetest', '部分', '模块中', 'CAS', 'service', '部分', '取而代之的是它们通过', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, '模块中CAS service部分')
['模块中', 'CAS', 'service', '部分']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
For easy reading:
(?:
(?:
(?![\s.,])[\x00-\xFF]
)+
|
(?:
(?![\x00-\xFF])\w
)+
)
Thanks for the suggestion, @animalize. Some days ago, I updated the PR and adjusted the regexp to the following: [\u0000-\u00ff]\w+\w+[\u0000-\u00ff]. This correctly extracts Latin "sub terms", as far as I can assess. Can you confirm this or find a test case for which your regexp is better? :-)
Please assess this testcase:
>>> import re
>>> pattern = u'[\u0000-\u00ff]\w+\w+[\u0000-\u00ff]'
>>> re.findall(pattern, '可以abc查看')
[]
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This ', ' test ', 'string.']
>>>
>>>
>>> pattern = u'(?:(?:(?![\s.,])[\x00-\xFF])+|(?:(?![\x00-\xFF])\w)+)'
>>> re.findall(pattern, '可以abc查看')
['可以', 'abc', '查看']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
If only Latin string are need, we can use this pattern:
>>> import re
>>> pattern = u'(?:(?![\s.,])[\x00-\xFF])+'
>>> s = '''\
... 可以查看 FAQ 模块中 Chinesetest 部分
...
... 模块中 CAS service部分
...
... 取而代之的是它们通过ZigBee'''
>>> re.findall(pattern, s)
['FAQ', 'Chinesetest', 'CAS', 'service', 'ZigBee']
>>> re.findall(pattern, '模块中 CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, '模块中CAS service部分')
['CAS', 'service']
>>> re.findall(pattern, 'This is a test string. a.b[].1.2')
['This', 'is', 'a', 'test', 'string', 'a', 'b[]', '1', '2']
>>> re.findall(pattern, '可以abc查看')
['abc']
I have a better idea, we can use this pattern: r'[a-zA-Z0-9_]+'
Please see these code in https://github.com/sphinx-doc/sphinx/blob/1.8/sphinx/search/__init__.py
_word_re = re.compile(r'(?u)\w+')
...
def split(self, input):
# type: (unicode) -> List[unicode]
"""
This method splits a sentence into words. Default splitter splits input
at white spaces, which should be enough for most languages except CJK
languages.
"""
return self._word_re.findall(input)
BTW, this description is wrong according to the code:
\w means in Python3's doc:
\w
For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
\w means in Python2's doc:
\w
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
So we can simply use r'[a-zA-Z0-9_]+' for Latin words.
Thanks, seems to work. I will merge this when all tests pass!
@TimKam @tk0miya
you can @ me when you need help on htmlhelp/Chinese related issues.