After discussion in https://discuss.elastic.co/t/extend-built-in-analyzers/134778, it appears that it's not possible to extend a built-in analyzer.
It would be cool to be able to do something like this to strip html tags and use the english analyzer without having to reimplement the english analyzer from scratch.
PUT /myindex
{
"settings": {
"analysis": {
"analyzer": {
"english_html_strip": {
"type": "english",
"char_filter": [
"html_strip"
]
}
}
}
}
}
Pinging @elastic/es-search-aggs
@yansal thanks for opening this issue. I don't think this is something that we are likely to pursue in the near future, given that building your own analyzer on top of an existing language analyzer is pretty straight forward and well documented (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html).
On the other hand, coming up with a robust and maintainable API for this seems to be quiet complex at first glance. I will label this issue for discussion but personally don't think we should do this.
We've talked about this on and off for a few years now. The way these are built in Lucene makes doing what you want very difficult even though it seems like it should be simple. This is why we made sure to document how to rebuild the analyzers and why I recently took the time to add a special construct to make sure those docs are correct. Giving you instructions on how to rebuild the language analyzers as a custom anlayzer is about the best we're going to be able to do for this.
Thank you for your detailed explanations, closing the issue.
Once https://issues.apache.org/jira/browse/LUCENE-8352 gets in, this might be worth revisiting, as AnalyzerWrapper should now be able to arbitrarily add CharFilters or TokenFilters to existing analyzers without any risk, meaning that the complications at the lucene level are resolved.
LUCENE-8352 is now merged - can this be revisited?
It is a maintenance nightmare to copy paste the analyzers from the documentation just to add icu_folding to all of them for accent-insensitive search
Most helpful comment
We've talked about this on and off for a few years now. The way these are built in Lucene makes doing what you want very difficult even though it seems like it should be simple. This is why we made sure to document how to rebuild the analyzers and why I recently took the time to add a special construct to make sure those docs are correct. Giving you instructions on how to rebuild the language analyzers as a custom anlayzer is about the best we're going to be able to do for this.