Elasticsearch: Custom analyzer filter scope

Created on 14 Sep 2016  路  30Comments  路  Source: elastic/elasticsearch

Elasticsearch version: 2.4.0

Plugins installed: [kopf]

JVM version: 1.8.0_60

OS version: OSX 10.11.6 locally, reproducible on CentOS 7.2.1511 as well

Description of the problem including expected versus actual behavior:

After upgrading to 2.4.0 our templates are no longer functioning from 2.3.3.

We have our template split up into several templates so we can specify default analyzers per language, and we share analyzers/filters/tokenizers between the templates.

Steps to reproduce:

  1. Push a shared template with common analyzers/filters
POST /_template/analyzers
{
  "template": "*",
  "order": 20,
  "settings": {
    "analysis": {
      "analyzer": {
        "untouched_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [ "max_length", "lowercase" ]
        },
        "no_stopwords_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "max_length", "standard", "lowercase" ]
        }
      },
      "tokenizer": {
        "standard": {
          "type": "standard",
          "version": "4.4"
        }
      },
      "filter": {
        "max_length": {
          "type": "length",
          "max": "32766"
        }
      }
    }
  }
}
  1. Push a language specific analyzer template, which is a czech analyzer referencing a filter from the analyzers template.
POST /_template/analyzer-cs
{
  "template": "*-cs",
  "order": 30,
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "czech",
          "filter": [ "max_length" ]
        }
      }
    }
  }
}
  1. Push another language specific analyzer template which uses a custom analyzer, referencing some filters from the shared analyzers template.
POST /_template/analyzer-en
{
  "template": "*-en",
  "order": 30,
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "max_length", "standard", "lowercase", "stop_en" ]
        }
      },
      "filter": {
        "stop_en": {
          "type": "stop",
          "stopwords": "_english_",
          "ignore_case": "true"
        }
      }
    }
  }
}

This action fails, resulting in:

2016-09-13 09:52:00,174][DEBUG][action.admin.indices.template.put] [Zartra] failed to put template [analyzers-en]
[kOccXLKbR5CwAsXCt0v_3w] IndexCreationException[failed to create index]; nested: IllegalArgumentException[Custom Analyzer [default] failed to find filter under name [max_length]];

However, as you see the first analyzer-cs was able to reference the max_length filter without issue.

These templates worked fine in 2.3.3 and 2.3.4, haven't tried 2.3.5 but it sounds like there weren't any changes to that release. Reading over release notes for 2.4.0 don't seem to indicate anything that would prevent this functionality from being supported.

We were seeing similar issues with referencing the shared analyzers in our mapping template file.

Provide logs (if relevant):

    2016-09-13 09:52:00,174][DEBUG][action.admin.indices.template.put] [Zartra] failed to put template [analyzers-en]
    [kOccXLKbR5CwAsXCt0v_3w] IndexCreationException[failed to create index]; nested: IllegalArgumentException[Custom Analyzer [default] failed to find filter under name [max_length]];
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:360)
        at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.validateAndAddTemplate(MetaDataIndexTemplateService.java:196)
        at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.access$200(MetaDataIndexTemplateService.java:57)
        at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService$2.execute(MetaDataIndexTemplateService.java:157)
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.IllegalArgumentException: Custom Analyzer [default] failed to find filter under name [max_length]
        at org.elasticsearch.index.analysis.CustomAnalyzerProvider.build(CustomAnalyzerProvider.java:76)
        at org.elasticsearch.index.analysis.AnalysisService.<init>(AnalysisService.java:216)
        at org.elasticsearch.index.analysis.AnalysisService.<init>(AnalysisService.java:70)
        at sun.reflect.GeneratedConstructorAccessor6.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:50)
        at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
        at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
        at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
        at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:886)
        at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
        at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
        at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
        at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
        at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
        at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
        at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
        at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
        at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:886)
        at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
        at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
        at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
        at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
        at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
        at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
        at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
        at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
        at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:886)
        at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
        at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
        at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
        at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:201)
        at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
        at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:879)
        at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
        at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
        at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
        at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:154)
        at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:358)
        ... 11 more
:CorFeatureIndices APIs >bug

Most helpful comment

As I explained in #20919, this is a design error: templates should be validated only as much as any _incomplete_ object can be validated. Elasticsearch shoud reject a template only if it contains plain inconsistencies or grammar violations.

On top of that, this is a breaking change which you failed to document!

I have a patch that fixes the too hasty conclusion that a referred analyzer should have been declared in the same template. I consider it elegant enough for production, but I didn't bother with updating the tests.
If anyone is interested, I'm willing to publish it.

All 30 comments

@johtani Could you take a look at this please?

Hi @mbarker

Before 2.4, we didn't validate any template on index template creation.
We added validation on index template creation in this PR.
In this PR, we validate each template as a complete mapping and settings,
we don't check combination because we cannot check all combination when template create.
Now, you should add the max_length filter setting each template for avoiding error.

@clintongormley Should we support to validate any template with only * template?
I think we can not predict templates combination except * template.

@johtani It seems to me like validating a template should be scoped to the context it should have available during indexing, unless I'm misunderstanding how ES supports multiple templates.

We certainly could replicate all of the analyzers/filters to all the templates where they are used, I was trying to avoid duplicating parts of the template.

If it would help I can give more details about how we are trying to compose the templates and the rationale behind it.

Thanks. I will think what the validation is more useful for users.

Validating template on index template creation is helpful for many users because users know the error earlier than validating creating index, especially logging use-case.
However in your situation, this validation is not useful because template is bigger than before 2.4.

e.g. If there is skip validation flag and simulate template with index name without creating index, is it useful?

There is something else going on here. I (like @mbarker ) didn't understand why this template succeeded:

POST /_template/analyzer-cs
{
  "template": "*-cs",
  "order": 30,
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "czech",
          "filter": [ "max_length" ]
        }
      }
    }
  }
}

while this template didn't:

POST /_template/analyzer-en
{
  "template": "*-en",
  "order": 30,
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "max_length"
          ]
        }
      }
    }
  }
}

The reason is that the first analyzer definition is invalid, but is not validated:

      "analyzer": {
        "default": {
          "type": "czech",
          "filter": [ "max_length" ]
        }
      }

You're using the czech analyzer and passing in a parameter filter, which the czech analyzer doesn't recognise but silently ignores. We should add validation to analyzers to complain about any left over parameters.

If there is skip validation flag and simulate template with index name without creating index, is it useful?

I like the idea of a skip evaluation flag. Not sure the simulate bit is needed - easy to test by just creating an index.

Or elasticsearch can create template always and if a template has some errors elasticsearch only return warning message. what do you think? @clintongormley

I'm wondering how hard it would be to figure the parent templates of a given template (because the wildcard is more general) and apply them as well at validation time.

You could always make up name for the index to validate based on the template matching, ie replace * with test and resolve dependencies that way.

The only caveat would be users have to upload templates in dependency order.

@jpountz @mbarker Thanks for your comment. Ah, you are right. I try to fix it.

If templates are validated using their dependencies when creating or updating, then what will happen when a template's dependency is deleted or updated? For example template A depends on B:

  1. template B created
  2. template A created
  3. template B deleted
  4. ?

Heres another thought: what if a template is invalid by itself, but the index is created with explicit settings that the template depends on?

Another possible issue: What if a template overrides something in a another template that results in a final mapping that is invalid?

@qwerty4030 Thanks for your comment!

In your 1st case, I think it is OK that elasticsearch does not validate at template deletion time.
We get the error only when user creates index using template B.

In your 3rd case, we can validate overriding only the parent templates.

I' not sure your 2nd case. Could you explain any examples?

Just tried this in 2.3.2:

  1. create template that is invalid by itself (references custom_analyzer that is not defined).
  2. attempt to create index that matches this template (fails with mapper_parsing_exception as expected).
  3. attempt to create index that matches this template and specifies the custom_analyzer in the settings (this works as expected).

To reproduce:

PUT /_template/test-template
{
  "template": "data-test-*",
  "order": 0,
  "mappings": {
    "test-type": {
      "properties": {
        "field": {
          "type": "string",
          "analyzer": "custom_analyzer"
        }
      }
    }
  }
}

PUT data-test-1

PUT data-test-1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}

Thanks. I think Elasticsearch should return the error in your cases, because that is tricky settings.

As I explained in #20919, this is a design error: templates should be validated only as much as any _incomplete_ object can be validated. Elasticsearch shoud reject a template only if it contains plain inconsistencies or grammar violations.

On top of that, this is a breaking change which you failed to document!

I have a patch that fixes the too hasty conclusion that a referred analyzer should have been declared in the same template. I consider it elegant enough for production, but I didn't bother with updating the tests.
If anyone is interested, I'm willing to publish it.

The code is in this branch. GitHub doesn't let me to submit a pull request against a tag, neither from a particular commit. :-1:

I'm wondering how hard it would be to figure the parent templates of a given template (because the wildcard is more general) and apply them as well at validation time.

It has occurred to me that we can never do this absolutely correctly. eg a template for logs-* may coincide with a template for *-2016-* or it may not. We can't tell without an index name.

I'm wondering if, instead of trying to be clever, we should just allow a flag to disable template validation and template PUT time.

The code is in this branch. GitHub doesn't let me to submit a pull request against a tag, neither from a particular commit. 馃憥

@acarstoiu there's a good reason for that... we're not planning on making a release from a patch applied to a tag.

@clintongormley let's try to focus on the matter, as I'm not interested in your SCM rules and I'm sure you can cherry-pick the fixing commit anytime you want. I merely justified the lack of a proper pull request.

I do not expect the commit to be accepted by itself, simply because it lacks the tests counterpart. I also did my best to preserve the coding style and the existing behaviour, but might not fit your taste. Other than that, it is a _good starting point_, we use it in production.

On the other hand, what you should do _urgently_ is to document this truely breaking change. Thank you.

@acarstoiu you don't seem to get how the open source world works. Frankly, your attitude makes me disinclined to converse with you further.

@clintongormley you do that, I agree.
As for the open source world, I already did two things: served this project a patch and urged repeatedly its team to publish this breaking change here for the benefit of _other users_.

And by the way, you can find in there lots of deprecations under the "breaking change" label, but when a true one is discovered, you keep it hidden.

@clintongormley I for one agree with @acarstoiu and frankly, your dismissal of this _extremely inconvenient non-documented breaking change that has completely affected our cluster and must now be downgraded across the board_ is very concerning in the least. You are literally costing us time and money.

you don't seem to get how the open source world works.

You don't seem to understand how businesses work and seem to have very little regard as to the users of your product when you inexplicably hide information and then dismiss users that hit the issue because you don't like their attitude when they're justifiably irritated?

Isn't one of the tenets of open source software transparency? If so, you've missed the mark on this one.

@clintongormley as mentioned in https://github.com/elastic/elasticsearch/issues/21105#issuecomment-263380122 I think this should be noted in the 2.4 migration docs as a breaking change. Took us a while to track down this issue while attempting to migrate our 2.3.x cluster.

@marshall007 Yes I agree - I've just been rather put off working on this issue by the comments of others. If you'd like to submit a PR adding the breaking changes docs, I'd be happy to merge it.

@clintongormley sure thing! I have it up in PR #21869.

@clintongormley sure thing! I have it up in PR #21869.

thanks @marshall007 - merged

@clintongormley @marshall007 is that PR supposed to update this page?
I'm getting a 404 for that URL. However 2.3 breaking changes URL works fine: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/breaking-changes-2.3.html

Thanks!

Thanks @qwerty4030 - that page wasn't being included in the docs. Should be fixed now: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/breaking-changes-2.4.html

@clintongormley that page still appears to be out of date, FYI. Looks like it hasn't been updated since https://github.com/elastic/elasticsearch/commit/537f4e1932b4d516cec8dcab1c942d3f31177dac.

Closing in favour of #21105

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kul picture kul  路  72Comments

JagathJayasinghe picture JagathJayasinghe  路  105Comments

monken picture monken  路  160Comments

rbraley picture rbraley  路  67Comments

clintongormley picture clintongormley  路  55Comments