Elasticsearch: Extend built-in analyzers

Created on 7 Jun 2018 · 6Comments · Source: elastic/elasticsearch

After discussion in https://discuss.elastic.co/t/extend-built-in-analyzers/134778, it appears that it's not possible to extend a built-in analyzer.

It would be cool to be able to do something like this to strip html tags and use the english analyzer without having to reimplement the english analyzer from scratch.

PUT /myindex

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_html_strip": {
                    "type": "english",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    }
}

:SearcAnalysis >enhancement team-discuss

Source

yansal

Most helpful comment

We've talked about this on and off for a few years now. The way these are built in Lucene makes doing what you want very difficult even though it seems like it should be simple. This is why we made sure to document how to rebuild the analyzers and why I recently took the time to add a special construct to make sure those docs are correct. Giving you instructions on how to rebuild the language analyzers as a custom anlayzer is about the best we're going to be able to do for this.

nik9000 on 7 Jun 2018

👍3

All 6 comments

Pinging @elastic/es-search-aggs

elasticmachine on 7 Jun 2018

@yansal thanks for opening this issue. I don't think this is something that we are likely to pursue in the near future, given that building your own analyzer on top of an existing language analyzer is pretty straight forward and well documented (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html).
On the other hand, coming up with a robust and maintainable API for this seems to be quiet complex at first glance. I will label this issue for discussion but personally don't think we should do this.

cbuescher on 7 Jun 2018

nik9000 on 7 Jun 2018

👍3

Thank you for your detailed explanations, closing the issue.

yansal on 8 Jun 2018

Once https://issues.apache.org/jira/browse/LUCENE-8352 gets in, this might be worth revisiting, as AnalyzerWrapper should now be able to arbitrarily add CharFilters or TokenFilters to existing analyzers without any risk, meaning that the complications at the lucene level are resolved.

romseygeek on 18 Sep 2018

👍1

LUCENE-8352 is now merged - can this be revisited?
It is a maintenance nightmare to copy paste the analyzers from the documentation just to add icu_folding to all of them for accent-insensitive search