Sphinx: HTML Search results aren't reader friendly

Created on 3 Jan 2015  路  28Comments  路  Source: sphinx-doc/sphinx

The HTML built-in search is very useful, especially for offline help, but the results content isn't reader friendly.

For example, I get such content:

#!python

*************** Report *************** .. important:: For method, 

which isn't understandable by the average user.

As searchtools.js use files in _sources, which are a copy of the Rest sources, this is happening.
But the search only need text files I think, not real sources files.

By replacing the content of _sources with the output of sphinx to text, and when setting

#!python

text_sectionchars = '       '

, I get a better result:

#!python

Report Important: For method,

It is a lot better, even if the * of bold are still visible.

It would be great if this rendering to text is automated when doing the HTML rendering.


enhancement html search low

Most helpful comment

As the original reporter of the issue, I would like that you don't forget that a search back end isn't always possible. In some cases, we need the documentation and the search to work offline, ie. the HTML directly opened in a browser on the same machine, without any web server (and also no Internet access).
Without this offline requirement, I would have since long switched to a better search engine.

All 28 comments

_From Andrea Cassioli on 2014-12-17 10:08:54+00:00_

I totally agree, using the rest text as output is misleading and not nice at all!

+1 for this feature
I was surprised that search is displaying source files in search result.
Display of rendered HTML content is something what users expect. For now using workaround with replacement of source files with text files.

The search results would be much better if one:

  • removed all markup (i.e. headings, images, bold and italic print) when building the .txt files.
  • built the .txt files as part of the normal html build and places them in the _sources directory of the build.

Is there any downside in doing this by default?

I'm pretty sure lots of machinery in Sphinx assumes there's only ever one build happening per runtime, so doing text as part of html might be both, wasteful in terms of build time and complex to implement.

Thanks for the info. Regarding the build of the txt file, a small customization of the makefile can do the trick.

Still, it would be nice if one could produce more "search-friendly" .txts. What do you think?

Currently, we run some custom script to remove the remaining markup.
I suppose this might be a relatively common problem Sphinx users have.

In case this is a relevant issue for somebody else:

I wrote a - so far very basic - extension that fixes this issue and builds the search result snippets without markup.
GitHub: https://github.com/TimKam/sphinx-pretty-searchresults
PyPi: https://pypi.python.org/pypi/sphinxprettysearchresults

The extension should also provide a fix/workaround for issue #2369.

Of course, I welcome feedback & improvement suggestions.

There's one fairly simply way to fix this without adding additional build steps or output files. Currently, the search displays results snippets by requesting the corresponding source files from the server/local file system and extracting the text from them. It's possible to adjust this functionality so that it requests the HTML files instead of the source files. Then, it's fairly simple to extract the text from the HTML during client/browser runtime.
Of course, this increases load sizes and computing time a bit, but I don't think the change in performance is significant.

What do you think @tk0miya ? I'll add a PR (which probably needs some refinement/discussion) later.

This would make the pretty search results extension obsolete, which is a good thing in my opinion, because the messed up search results are a hard bug in the eyes of the users and it shouldn't be necessary to install an extension to fix a bug :-)

Any updates on this issue?

It would be useful to know the current state of this issue. It's very confusing for users, as the main sphinx docs themselves don't seem to have this issue (e.g. see http://www.sphinx-doc.org/en/master/search.html?q=sphinx&check_keywords=yes&area=default)

@tk0miya Could we have your opinions on this?

There are two alternatives to the approach I use in my PR (requesting the HTML):

  • Creating plain text files during build time, like in Sphinx: pretty search results: I don't think this is a good idea for standard Sphinx, because it increases the build time significantly.

  • Using regexps to remove the markup during run time: this is not elegant, either, but it at least feels light weight in comparison to loading the HTML files.

Can you live with any of these options?

I prefer to the first. Certainly, it increases build time. But it can remove markups perfectly, and also can support translation.
But, I know this way requires big refactoring of Sphinx core. So latter one is enough useful. And we can
improve our search much earlier than first way. The large problem of the second way is my skill. I'm not good at JavaScript. So I will not be able to maintain it. For this way, new maintainers are needed.

Okay, then I propose the following:

  • I add a PR that removes the markup (somewhat imperfectly) with regexps.
  • If you will want to merge it, but are concerned about its maintenance, I can take a limited/junior maintainer role with a focus on JavaScript and docs.

I just remembered @timhoffm had sent such script to us at #4857. It might be a good workaround for this problem. Could you check this please?

@shimizukawa What do you think about the workaround?

Worth noting that both sphinx-doc.org and readthedocs.org seem to have fixed this problem already, so there are already solutions to this that are being used in anger. They're presumably happy with their solutions, so would understanding them help inform which route is the most sensible?

@tstibbs afaik sphinx-doc.org is hosted by ReadTheDocs. And ReadTheDocs provides a custom search back end (using Haystack and Elasticsearch. But for the average self-hosted Sphinx project, a search back end is presumably too much work and fixing the front end-only search in one of the ways I described is necessary.

@tk0miya I will take a look at #4857 asap and compare it to the tests I wrote for my sphinx-pretty-searchresults extension.

As the original reporter of the issue, I would like that you don't forget that a search back end isn't always possible. In some cases, we need the documentation and the search to work offline, ie. the HTML directly opened in a browser on the same machine, without any web server (and also no Internet access).
Without this offline requirement, I would have since long switched to a better search engine.

I took a look at #4857 (regexp-parsing) and compared it to #4022 (using HTML snippets).
That's how the comparison looks like:

Regexp:
search_regexp_parsing

HTML:
search_html_snippets

IMHO, the regexp approach requires quite some additional work. The only disadvantage of the HTML approach is that it loads significantly more data, but I personally still think it's feasible (probably better than implementing a reStructuredText parser in JavaScript).

We could make this configurable (opt-out).

Any other opinions on this?

Good to see that this topic gets attention.

I just did a minimum amount of work in #4857 (regexp-parsing) to get something readable.
The regexp approach gets you 80% of the way with no or little additional work on the minimal parser. A full reStructuredText parser wouldn't help that much more. Though there seem to be libraries for that, you'd still have to interpret the generated document tree.

The HTML search has clearly an advantage because it operates on the target document. Can you quantify how much "significantly more data" is?

Depending on how large the difference in data is and how important we consider it to be, making this configurable would be a good way. Regexp parsing is a drop-in improvement on the current plain rst search with no disadvantage. If you need even better results and are willing to take the data overhead, use HTML-search. Which one should be the default may depend on the number s of the data overhead.

I compared the data load for a set of Sphinx documentation pages:

| Page | Content length HTML (in bytes) | Content length regexp (in bytes) | Ratio (HTML/regexp) |
| ------------------ | ------------------------------ | -------------------------------- | -------------------- |
| builders | 62208 | 16729 | 3.72 |
| config | 203764 | 85442 | 2.38 |
| ext/extlinks | 11756 | 2342 | 5.02 |
| theming | 33809 | 17427 | 1.94 |
| quickstart | 37713 | 9577 | 3.94 |
| Sum | 349250 | 131517 | 2.66 |

The differences in ratio can be explained by the following factors:

  • As the HTML pages contain fixed overhead (the general layout) for all pages, the ratio is worse for small pages.
  • Some pages contain directives that instruct Sphinx to include content from other sources (in particular docstrings, I suppose). This content is not included in the .rst files, which is a major flaw of the regexp approach. Note that I did not include pages that contain includedirectives, because this would have rendered the comparison useless.

Considering this information, I suggest we use the HTML approach without any configuration to keep things simple. It's just text content; IMHO there is no need to optimize for data load.

Thanks for digging into the numbers.

Considering all aspects, I'm fine with an HTML-only approach.

@tk0miya Do you agree? Then we could move forward with my PR.

+1 to the HTML-only approach. I think that the approach is sufficiently useful as workaround until Sphinx core has build output function for search display.

I resolved the conflicts in the original PR (which is close to a year old). Could someone do the review?

+1 for HTML approach. It can support other source formats (markdown and others).

+1 for the HTML approach since a source code based solution will be almost useless as we currently get the context of a search in the translation as "source in the canonical language" - This was very disappointing. The Patch available here https://github.com/sphinx-doc/sphinx/pull/4022/files was working for us very well:

@TimKam Congrats! Thank you for your work!

I am using version 1.8.3 and I am still facing this issue. I tried the code on the master branch and it is fixed. When is version 2 planned for public release?

2.0 is planned for mid-end March (without a definite date), see: https://github.com/sphinx-doc/sphinx/issues/5950. If you want the fix earlier, I recommend not to switch to master, but to instead adjust your template to include the updated version of searchtools.js, as changed here: https://github.com/sphinx-doc/sphinx/pull/4022/files#diff-71eb2d907f122b85744ef4c3390903cbR59

Was this page helpful?
0 / 5 - 0 ratings