Wagtail: RichTextField / RichTextBlock should strip out HTML tags from search-indexable content

Created on 1 Jun 2020  路  3Comments  路  Source: wagtail/wagtail

Issue Summary

Currently, when indexing RichTextFields, or RichTextBlocks within StreamFields, the raw content including HTML tags is passed to the index. Since HTML tag names / attributes are not meaningful searchable content*, it would be better to provide a get_searchable_content method on these that applies Django's striptags filter.

More importantly, if you customise the Elasticsearch backend (or run raw Elasticsearch queries) to take advantage of Elasticsearch's highlighting support (see #5340), you currently end up with either stray HTML tags or spurious formatting in the results, depending on whether you passed 'encoder': 'html' in the query.

(* This isn't totally true - for example, you could make a good case for indexing the alt text on images - so an ideal solution would probably involve being able to specify custom rules at the richtext feature level to generate a plain text representation. However, on balance, I think stripping tags out is better than keeping them in.)

Steps to Reproduce

Search for 'h2' on https://www.rca.ac.uk/ . Observe that the results include many pages that contain an <h2> element but not the text "h2"...

  • I have confirmed that this issue can be reproduced as described on a fresh Wagtail project: no
Search Bug

All 3 comments

Is it OK if I work on a PR that resolves this issue? I've already implemented the same possible solution above in my project a long time ago.

@acarasimon96 Absolutely, yes please!

Resolved via #6099

Was this page helpful?
0 / 5 - 0 ratings