Currently, much of what _could_ be validated in queries and aggregations goes unchecked because of the fundamental assumption that we might be querying across heterogenous indexes where certain fields may be "unmapped".
A mis-typed field name or an invalid nested reference goes unchecked and returns no results rather than throwing any kind of error or warning. The users that enforce strict mappings across indexes are penalized by elasticsearch's need to support users with sloppier configurations.
Bad queries on strict schemas fail silently.
This issue proposes that we provide a new strict=true query option that checks all field names in query and agg clauses to ensure at query time that they actually relate to mapped fields. Any nested clauses will also check that subclauses are referencing nested fields correctly.
If strict query mode is set to "true" then errors will be thrown for unmapped fields. If strict is "false" (the default) then warnings could be passed back with search results.
I agree it is a bit trappy to silently ignore unmapped fields. In my opinion, the following points are important:
true. Otherwise nobody will think about turning it on since everything will "work" without it. I agree this might be controversial since it might mean that eg. Kibana might have to turn it off automatically, but if the default is false then nobody will use this option and it won't be worth mantaining?It should be a per-request option, not per query
Bad English on my part :) That was what I intended.
Picking defaults will certainly be a controversial choice.
If we do this maybe we should undeprecate the indices query . It would provide a convenient marker for scoping nested clauses that should only be validated on certain indexes.
The suggested alternative when we deprecated indices query was to add regular match queries on the _index field which would make it harder for any new field validation logic in query parsers to reason about which other clauses to validate.
Indices query is removed from 6.0, I think we should rather find a different/better solution.
I think we should rather find a different/better solution.
It feels like it needs to be something like nested where it is a JSON container that defines a particular scope/context for all of its nested clauses.
@imotov had an additional suggestion which is to fail requests _only_ if a user-supplied field is unmapped on all of the indices accessed. This would require each shard to return a list of all unmapped fields referenced in the request. The coordinating node would then fail the request if _all_ shards listed the same fieldname as unmapped.
This would be largely backwards compatible, supporting those users querying across heterogenous indices but also supply the missing validation for those users with homogenous indices and bad queries. It would require no extra configuration but would be an added complexity for the internal search logic.
I suspect it will be hard to do any better than saying:
invalid reference to field [X] somewhere in this request
There would be a lot of context required to pinpoint that field X in the subclause of nested aggregation Y was actually the source of the error.
I started implementing Igor's suggestion but then came across an existing "ignore unmapped" setting which is an (undocumented?) index-level setting to allow unmapped fields or not in queries.
This is checked in QueryShardContext
This is arguably not as useful as a policy of failing queries that contain fieldnames that are unmapped across _all_ indices accessed. The flag to disable this "unanimous" policy should be a request-level setting.
One remaining question is how should the strict=true setting be interpreted in the context of the validate query API? The api looks to allow multiple indices in the URL but in the reference docs claims it arbitrarily picks only one of the shards to test the query against.
The validate query picks one shard per index so if you have multiple indices in your query it will send one request per index and the answer contains an entry for each index. There is also a new parameter since 5.4 called all_shards that send the query to all shards rather than picking a random one. Though this is not relevant for your use case since all shards in an index have the same mapping so the default should be ok to also validate when strict=true ?
Thanks for the clarification @jimczi , that makes things simpler.
cc @elastic/es-search-aggs
Closing this due to the complexity and reliability of a solution. We had a working PR but chose not to take it forward.
The requirement is that a field name used in the query could be missing from 99% of the indices being queried but if at least one index understands the field name then that is considered a valid query. This might be a common use case as users query across all indices in a cluster.
In practice this is difficult to implement. All node search responses would need to include a list of field names in the query they didn't understand. Only if ALL indices failed to understand a field name would it be considered an error. However, some query optimisations like can_match mean that certain shards may not exercise the full query in which case we can't fail the query. If an index doesn't run a query (eg. a time-based-index skipped from the query's date range) it cannot report back a list of unknown field names used in the query and we therefore cannot determine if a field name is universally missing. We fail to fail in these scenarios.
Most helpful comment
I agree it is a bit trappy to silently ignore unmapped fields. In my opinion, the following points are important:
true. Otherwise nobody will think about turning it on since everything will "work" without it. I agree this might be controversial since it might mean that eg. Kibana might have to turn it off automatically, but if the default isfalsethen nobody will use this option and it won't be worth mantaining?