Elasticsearch: Explore disabling the _all field by default

Created on 3 Aug 2016  路  30Comments  路  Source: elastic/elasticsearch

We should explore the idea of disabling the _all field by default. For
background, the _all field contains the contents of each of a document's
field. During indexing, these contents are copied to the _all field, analyzed
with a specific analyzer, and then indexed. At query time, some queries such as
the query_string and simple_query_string queries search the _all field if
no fields are specified.

There are a number of issues with the _all field:

  • It uses a fair amount of disk space, due to duplicating the values from all
    other fields to an additional field
  • _all has its own analyzer, which is confusing when expecting to query
    text analyzed a certain way (or with synonyms, for example) and discovering
    that it does not match due to a analysis difference
  • Additional indexing overhead caused by data duplication
  • Since the _all field is not retrievable or part of _source, its contents
    cannot easily be inspected for debugging purposes, some users do not know it
    even exists, which causes confusion at query time
  • Better alternatives exist, such as the copy_to value on mappings which can
    be used to create custom _all fields

If we were to change the _all field to be disabled by default, some queries
would have to be handled differently. The query_string and
simple_query_string queries would have to know a default field or fields to
query. Perhaps we would be able to change these queries to send a "fields": ["*"] parameter if no fields are specified in the JSON query, so that all field
values can still be queried.

:SearcMapping discuss v6.0.0-alpha1

Most helpful comment

I'd totally remove this feature in favor of copy_to.

I'm strongly against this. I wouldn't even want to change the default, but even if we do that let's keep _all within arm's reach until we know the change is accepted by users.

I don't think any harm is done by having all enabled by default. It's great in the beginning, and once you move past small data sizes and/or into production, you should take a look at your mappings anyway and disabling all if not needed is just one of the things you have to do, and it's something many, many tutorials, blog posts, docs, etc. out there mention.

To me the _all field is part of what makes the Elasticsearch experience so "magical" for first-time users. Full-text search OOTB.

Just a few days ago, I've had a newbie user refer to anything other than full-text query-string search as the "complicated queries" [sic]. They were using _search?q=test and never thinking twice about mappings, _all fields or any of this and I think it's great they don't have to initially.

I'd rather see a feature that allows removing _all (and other fields) from an existing index without reindexing every document.

_all has its own analyzer, which is confusing when expecting to query text analyzed a certain way (or with synonyms, for example) and discovering that it does not match due to a analysis difference
Since the _all field is not retrievable or part of _source, its contents cannot easily be inspected for debugging purposes, some users do not know it even exists, which causes confusion at query time

I agree it can be confusing, and I'd argue for including _all in the outputs of /_mapping, /_search, etc.

All 30 comments

I am +1 on this. Would like to add to some other benefits:

  • By disabling by default, it makes it a conscious decision to enable this and hence engineers will only enable it if they need it. We advise on this as a best practice in various discussions anyway.
  • Tools like Logstash specifically provide a default mapping in which they could keep compatibility if need be

+1 to this change.
I'd totally remove this feature in favor of copy_to.

Agreed with David: if we decide to disable it by default then I'd vote to remove it entirely. Since enabling the feature will require to take action anyway, I'd rather like users to set up a template that automatically adds a copy_to option to string fields than enable the _all field?

Perhaps we would be able to change these queries to send a "fields": ["*"] parameter if no fields are specified in the JSON query, so that all field values can still be queried.

This part worries me a bit: this would be very slow if there are many fields in mappings, which would be trappy? I would rather like to make field/fields a required parameter of the mach/query_string/... queries.

I am +1 here to. I think having catch all fields is good practice in many situations but we should try to cover it with documentation rather than punishing all the users that don't need that at all.

we might come up with better ideas like defining default search fields for an index or something like this if no fields are specified such that users can build their own _all field instead?

This part worries me a bit: this would be very slow if there are many fields in mappings, which would be trappy?

That's a really good point! The fields: ["*"] idea was a random brainstorm, I think we can do it in a more efficient manner as you mentioned.

I'm concerned by the quality of results if we steer people away from _all towards multi-field indexing and multi_match type queries _across_ fields.
The default ranking heuristics that work well on a single _all type field don't apply very well in a multi-field scenario (e.g. favouring any field _other_ than colour for the search term red [1]). There's some options in multi-match to address this but its confusing and doesn't work if you use query_string etc

For my money wholly unstructured queries are better off serviced by a wholly unstructured _all type field. Bad matching bad.

Of course the ideal scenario is matching well-written queries against content indexed in a way that retains the original structure.
By well-written queries I mean the intent is clear and things like field names are provided for each term. It does feel like there is scope for more tooling to rewrite user queries (field spotting, AND vs OR vs Phrase auto-detect, percolator spotting queries needing special attention) but that's probably for another issue.

Aside from the relevance ranking issues of multi-field searches I think it is useful to retain some notion of a "default search field" that generic tools like Kibana or MoreLikeThis/query_string queries will target in an index.

[1] https://discuss.elastic.co/t/nested-multi-match/56370/5

I think we should push people towards explicit copy_to or some kind of improved multi-field support that doesn't suffer from the issues @markharwood mentions. That way they understand the limitations of what they are doing because they actively made a choice. copy_to doesn't support weighting the copied fields differently without copying to the same field over and over again which is kind of nasty, but at least it is less magic. And less magic is good, especially for magic like the _all field which can have a fairly significant disk space cost.

I think it makes more sense to target 6.0 rather than 5.0 for this because it'll take a while for the ecosystem to absorb the change. Maybe it is actually more right to start at the edges, converting Kibana and Logstash to copy_to or multi field searches

I think it makes more sense to target 6.0 rather than 5.0 for this because it'll take a while for the ecosystem to absorb the change.

Why does 6.0 allow the ecosystem to "absorb" the change better or faster? The sooner we get it out there, the sooner new users are not bogged down by this and must make an explicit choice (setting up their own all field).

I've opened issues on the Beats, Logtash, and Kibana repos to kick off discussions about moving away from the _all field for index templates and querying.

I'd totally remove this feature in favor of copy_to.

I'm strongly against this. I wouldn't even want to change the default, but even if we do that let's keep _all within arm's reach until we know the change is accepted by users.

I don't think any harm is done by having all enabled by default. It's great in the beginning, and once you move past small data sizes and/or into production, you should take a look at your mappings anyway and disabling all if not needed is just one of the things you have to do, and it's something many, many tutorials, blog posts, docs, etc. out there mention.

To me the _all field is part of what makes the Elasticsearch experience so "magical" for first-time users. Full-text search OOTB.

Just a few days ago, I've had a newbie user refer to anything other than full-text query-string search as the "complicated queries" [sic]. They were using _search?q=test and never thinking twice about mappings, _all fields or any of this and I think it's great they don't have to initially.

I'd rather see a feature that allows removing _all (and other fields) from an existing index without reindexing every document.

_all has its own analyzer, which is confusing when expecting to query text analyzed a certain way (or with synonyms, for example) and discovering that it does not match due to a analysis difference
Since the _all field is not retrievable or part of _source, its contents cannot easily be inspected for debugging purposes, some users do not know it even exists, which causes confusion at query time

I agree it can be confusing, and I'd argue for including _all in the outputs of /_mapping, /_search, etc.

I agree it's a cool OOTB feature. So when you start discover Elasticsearch it's nice to have.
But when you move to production, I think a really few users need this exact feature.

Instead, they can use a template which copy all fields to _all field and they will get basically the same feature as we had previously.

May be we could have some default templates documented so people can easily apply them?

There are a number of sub-optimal things we do to improve the OOB experience, eg indexing all strings as both text (for search) and keyword (for sorting and aggregations). In fact, these OOB settings might be sufficient for some users who will never need to change anything.

For the user wanting smaller disk usage or better performance, we have guides available to help them make good choices.

While I would seldom use the _all field myself, I recognise that it makes it incredibly easy for a user to just get going with a simple search case or with logstash/beats and kibana.

Disabling _all now would severely hurt the user's experience in Kibana today. This is not something we can just rush in at the last minute. We first need to have a cross-stack discussion to come up with a plan for maintaining the same ease of use we have today.

Something else I'd be interested in getting opinions about would be to make the way _all works more explicit, eg. by disabling the copying magic and having the default mappings for strings look like this:

{
  "type": "text",
  "copy_to": "_all",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

So _all would remain, but only as a convention for the default catch-all field and without magic.

Disabling _all now would severely hurt the user's experience in Kibana today.

I'd add that I don't think the Kibana user is in any way special. The typical end-user's preference to search via simple keywords entered into a single edit box. That is the primary "search" use case and one that typically behaves badly on multi-field indexes.

+1 on Adrien's proposal or using a default template which can be overloaded.

13214 will need to be addressed first, otherwise query of * doesn't always return all documents with custom_all

I've been thinking about this for the last few days. I think one of the root
issues with the discussion is that there are two different levels of
"user-friendly" and "out-of-box experience" that we're talking about.

On one hand, enabling the _all field is friendly to the out-of-the-box
experience because it allows a user to use the query_string or
simple_query_string query without knowing anything about the data.

On the other hand, enabling the _all field is damaging to the out-of-the-box
experience when it comes to users experiencing large amounts of disk space,
slower indexing, and operational confusion when a user doesn't know what the
queries operates on or how it works. People expect us to be competitive when it
comes to disk and ingestion speeds for increasingly larger and larger amounts of
data. Enabling _all by default is not as nice an experience in that way.

I've noticed this trend (again, not a bad trend, just a trend toward different
poles) in a number of places in the ES codebase - do we optimize for
out-of-box experience meaning 'ease of use', or do we optimize for out-of-box
experience meaning 'scalability and performance'?
. We ran into the same
problem with the bootstrap checks and I think this is a similar situation.

Very true.

I think the conclusion on bootstrap checks was that we are trying to make OOTB decisions for production.
Although I loved when I discovered Elasticsearch this _all feature, I'm totally Ok to remove it now.

Ideally, if people have defined a _all field with copy_to feature (or a manually controlled _all field), I think we can try to be smart and auto detect it when we run query string queries or mark a field as "default" in mapping so when no field is defined in a query, we use this default one?

On the other hand, enabling the _all field is damaging to the out-of-the-box
experience when it comes to users experiencing large amounts of disk space,
slower indexing, and operational confusion when a user doesn't know what the
queries operates on or how it works. People expect us to be competitive when it
comes to disk and ingestion speeds for increasingly larger and larger amounts of
data. Enabling _all by default is not as nice an experience in that way.

I would argue that ingesting large amounts of data at top speed is already way beyond the initial out-of-the-box experience.

From what I've seen people that are looking to ingest many GB or TB per day understand that they can't just do it straight out-of-the-box, but will have to put some effort into understanding how the system works and they know and expect that there'll be a few knobs to turn to improve the results.

However, I think there's many users that expect being able to start an application out-of-the-box or use a search engine to do full-text search over some documents out-of-the-box.

I think there is a misconception that search really works correctly with _all, it doesn't!

highlighting is totally fubar'ed here, numeric stuff gets indexed into it... in a nearly useless manner, etc.

I'm not trying to use these as an argument that _all should be removed, but at the very least it should be fixed. By that i mean, look at better defaults, such as not tossing numeric stuff into it.

If we say _all is for full text search, then let it be only the concatenation of text fields, not other stuff that will just waste disk and destroy ranking :)

I think something like that is a lot more reasonable.

Also i do feel, disabling the stopwords that lucene has by default on all its analyzers makes the situation worse too: it makes text fields far more costly that it should be, I think that should be considered. We try to use a minimal list that is not as invasive as other full text search systems already and i think its a good default, its been shown to be a good default in relevance experiments too.

Just another idea to make a better out of box compromise.

Just another idea to make a better out of box compromise.

++ I think we should start with dropping non-textual fields either way no matter if we disable it by default or not. they should have include_in_all: false by default. That would be numerics, dates, ip etc?

I think we should start with dropping non-textual fields either way no matter if we disable it by default or not. they should have include_in_all: false by default.

I opened a PR to do/discuss this - https://github.com/elastic/elasticsearch/pull/20085

I really like @rmuir 's idea of dropping non-textual fields by default. I also agree that for new users and Kibana, _all is very important.

I also really like @cwurm's idea of discarding _all's analyzer and instead using the mappings that have already been defined on the string fields in an index as the basis for what gets shoved into _all. There are actually plenty of cases where I can make good use of _all -- but I could make even better use of it if it employed the same tokenizers and analyzers as the fields I'm shoving into it.

Another thing to think about: While suboptimal, storage is cheap relative to developer time and reindexing. I have often seen _all used as a work around because of an index mapping that was incorrect or poorly thought out. This is especially true with new users. A common example I've seen several times is someone correctly marking a field they intend to bucket or sort on as non analyzed but not realizing that they would be unable to perform wildcard-esque searches on that field unless they also maintained an analyzed copy (eg the string and string.raw convention employed by logstash and others). As an ES admin, it's definitely been very helpful for me at times to tell users, "that's okay, you can still query "_all" and we'll fix it moving forward" as opposed to "sorry, that sucks, we have to reindex all your data")

All of which to say, to the extent that _all is disabled or discouraged, I think it might be worth investing in the _reindex API -- both from an engineering perspective (more knobs to turn, more testing, better throttling, recovery of reindexing tasks that fail, etc) and from a documentation/tutorial perspective.

One doubt to add to the discussion above, is it worth a shot to disable the _all field and reindex all the data? We are seeing huge latencies while indexing/updating documents. Could this improve the performance?

We have been brainstorming how to handle the _all field for the last few
weeks. This has led us to revert the change that elides numerics from _all in
favor of the new ideas.

The results of the discussion are this:

  • Favor query-side handling this versus mapping-side changes. It would be better
    if we could handle an _all-like query entirely on the query side rather than
    requiring mapping changes for old and new indices to get the desired behavior.
  • Keep all queries constant score of 1 so it's clear to users it's not used
    for relevancy. Currently _all has some concerns about relevancy due to the
    hidden nature of how it's actually calculated. So any new query we introduce
    should have a constant score of 1 to indicate it is not meant for relevancy
    matching.
  • Introduce a new "all" query that internally rewrites into a multi-match query
    on the backend for all different fields
  • Revert the change for excluding numerics from _all by default (already done)
  • The new "all" query needs to handle parsing failures due to string != number
    exceptions

Work

With that discussion in mind, the following steps have been identified:

  • [x] Revert removing numerics from _all field (#20085)
  • [ ] Add new query type called "all" query that rewrites into multi_match (https://github.com/elastic/elasticsearch/pull/20925)

It should also be a constant_score query (ie, return a _score of 1.0) to
indicate that the query is not good at relevancy.

  • [ ] Switch regular query_string query and/or kibana to use new "all" query (https://github.com/elastic/kibana/issues/8007)

Depending on how it works, Kibana (and other tools) may be able to switch to it
instead of query_string. It remains to be seen how much functionality the new
query will have.

  • [ ] Switch regular simple_query_string query to use new "all" query

Again, optional depending on how well the query works.

  • [ ] Once queries are working, remove _all or disable-by-default

Once we have a true replacement that doesn't use the _all field, we can remove
it entirely, or stick with disabling it by default.

I like the idea of an all query. I think it might want to calculate relevancy scores though. As rightly noted here this might not be perfect, but is sufficient for a number of uses, e.g. when indexing mostly text fields.

And in an all query we might be able to account for the most egregious shortcomings, e.g. not take non-text fields into account for the score.

@dakrone in our use case we do not query the _all field, so disabling it completely in the mapping won't affect our searches, constant score is what we use mostly, just that since _all is indexed so I assume disabling it completely would save up some time and reduce latency. Can someone please advise.

@abhishekanand100 I think your question is independent of the topic under discussion here, you can already disable the _all field today. You might want to post in another forum, e.g. https://discuss.elastic.co/c/elasticsearch

@cwurm thanks for the info, will post a new thread on the link

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

dawi picture dawi  路  3Comments

matthughes picture matthughes  路  3Comments

jasontedor picture jasontedor  路  3Comments

ttaranov picture ttaranov  路  3Comments