Elasticsearch: Explore disabling the _all field by default

Created on 3 Aug 2016 · 30Comments · Source: elastic/elasticsearch

We should explore the idea of disabling the _all field by default. For
background, the _all field contains the contents of each of a document's
field. During indexing, these contents are copied to the _all field, analyzed
with a specific analyzer, and then indexed. At query time, some queries such as
the query_string and simple_query_string queries search the _all field if
no fields are specified.

There are a number of issues with the _all field:

It uses a fair amount of disk space, due to duplicating the values from all
other fields to an additional field
_all has its own analyzer, which is confusing when expecting to query
text analyzed a certain way (or with synonyms, for example) and discovering
that it does not match due to a analysis difference
Additional indexing overhead caused by data duplication
Since the _all field is not retrievable or part of _source, its contents
cannot easily be inspected for debugging purposes, some users do not know it
even exists, which causes confusion at query time
Better alternatives exist, such as the copy_to value on mappings which can
be used to create custom _all fields

If we were to change the _all field to be disabled by default, some queries
would have to be handled differently. The query_string and
simple_query_string queries would have to know a default field or fields to
query. Perhaps we would be able to change these queries to send a "fields": ["*"] parameter if no fields are specified in the JSON query, so that all field
values can still be queried.

:SearcMapping discuss v6.0.0-alpha1

Source

dakrone

👍13 ❤4

Most helpful comment

I'd totally remove this feature in favor of copy_to.

I'm strongly against this. I wouldn't even want to change the default, but even if we do that let's keep _all within arm's reach until we know the change is accepted by users.

I don't think any harm is done by having all enabled by default. It's great in the beginning, and once you move past small data sizes and/or into production, you should take a look at your mappings anyway and disabling all if not needed is just one of the things you have to do, and it's something many, many tutorials, blog posts, docs, etc. out there mention.

To me the _all field is part of what makes the Elasticsearch experience so "magical" for first-time users. Full-text search OOTB.

Just a few days ago, I've had a newbie user refer to anything other than full-text query-string search as the "complicated queries" [sic]. They were using _search?q=test and never thinking twice about mappings, _all fields or any of this and I think it's great they don't have to initially.

I'd rather see a feature that allows removing _all (and other fields) from an existing index without reindexing every document.

_all has its own analyzer, which is confusing when expecting to query text analyzed a certain way (or with synonyms, for example) and discovering that it does not match due to a analysis difference
Since the _all field is not retrievable or part of _source, its contents cannot easily be inspected for debugging purposes, some users do not know it even exists, which causes confusion at query time

I agree it can be confusing, and I'd argue for including _all in the outputs of /_mapping, /_search, etc.

cwurm on 17 Aug 2016

👍17

All 30 comments

I am +1 on this. Would like to add to some other benefits:

By disabling by default, it makes it a conscious decision to enable this and hence engineers will only enable it if they need it. We advise on this as a best practice in various discussions anyway.
Tools like Logstash specifically provide a default mapping in which they could keep compatibility if need be

djschny on 3 Aug 2016

+1 to this change.
I'd totally remove this feature in favor of copy_to.

dadoonet on 3 Aug 2016

👍3 👎2

Agreed with David: if we decide to disable it by default then I'd vote to remove it entirely. Since enabling the feature will require to take action anyway, I'd rather like users to set up a template that automatically adds a copy_to option to string fields than enable the _all field?

Perhaps we would be able to change these queries to send a "fields": ["*"] parameter if no fields are specified in the JSON query, so that all field values can still be queried.

This part worries me a bit: this would be very slow if there are many fields in mappings, which would be trappy? I would rather like to make field/fields a required parameter of the mach/query_string/... queries.

jpountz on 5 Aug 2016

👍2 👎1

I am +1 here to. I think having catch all fields is good practice in many situations but we should try to cover it with documentation rather than punishing all the users that don't need that at all.

we might come up with better ideas like defining default search fields for an index or something like this if no fields are specified such that users can build their own _all field instead?

s1monw on 5 Aug 2016

This part worries me a bit: this would be very slow if there are many fields in mappings, which would be trappy?

That's a really good point! The fields: ["*"] idea was a random brainstorm, I think we can do it in a more efficient manner as you mentioned.

dakrone on 5 Aug 2016

I'm concerned by the quality of results if we steer people away from _all towards multi-field indexing and multi_match type queries _across_ fields.
The default ranking heuristics that work well on a single _all type field don't apply very well in a multi-field scenario (e.g. favouring any field _other_ than colour for the search term red [1]). There's some options in multi-match to address this but its confusing and doesn't work if you use query_string etc

For my money wholly unstructured queries are better off serviced by a wholly unstructured _all type field. Bad matching bad.

Of course the ideal scenario is matching well-written queries against content indexed in a way that retains the original structure.
By well-written queries I mean the intent is clear and things like field names are provided for each term. It does feel like there is scope for more tooling to rewrite user queries (field spotting, AND vs OR vs Phrase auto-detect, percolator spotting queries needing special attention) but that's probably for another issue.

Aside from the relevance ranking issues of multi-field searches I think it is useful to retain some notion of a "default search field" that generic tools like Kibana or MoreLikeThis/query_string queries will target in an index.

[1] https://discuss.elastic.co/t/nested-multi-match/56370/5

markharwood on 5 Aug 2016

I think we should push people towards explicit copy_to or some kind of improved multi-field support that doesn't suffer from the issues @markharwood mentions. That way they understand the limitations of what they are doing because they actively made a choice. copy_to doesn't support weighting the copied fields differently without copying to the same field over and over again which is kind of nasty, but at least it is less magic. And less magic is good, especially for magic like the _all field which can have a fairly significant disk space cost.

I think it makes more sense to target 6.0 rather than 5.0 for this because it'll take a while for the ecosystem to absorb the change. Maybe it is actually more right to start at the edges, converting Kibana and Logstash to copy_to or multi field searches

nik9000 on 16 Aug 2016

👍1

I think it makes more sense to target 6.0 rather than 5.0 for this because it'll take a while for the ecosystem to absorb the change.

Why does 6.0 allow the ecosystem to "absorb" the change better or faster? The sooner we get it out there, the sooner new users are not bogged down by this and must make an explicit choice (setting up their own all field).

rjernst on 16 Aug 2016

I've opened issues on the Beats, Logtash, and Kibana repos to kick off discussions about moving away from the _all field for index templates and querying.

dakrone on 16 Aug 2016

I'd totally remove this feature in favor of copy_to.

I'm strongly against this. I wouldn't even want to change the default, but even if we do that let's keep _all within arm's reach until we know the change is accepted by users.

To me the _all field is part of what makes the Elasticsearch experience so "magical" for first-time users. Full-text search OOTB.

I'd rather see a feature that allows removing _all (and other fields) from an existing index without reindexing every document.

_all has its own analyzer, which is confusing when expecting to query text analyzed a certain way (or with synonyms, for example) and discovering that it does not match due to a analysis difference
Since the _all field is not retrievable or part of _source, its contents cannot easily be inspected for debugging purposes, some users do not know it even exists, which causes confusion at query time

I agree it can be confusing, and I'd argue for including _all in the outputs of /_mapping, /_search, etc.

cwurm on 17 Aug 2016

👍17

I agree it's a cool OOTB feature. So when you start discover Elasticsearch it's nice to have.
But when you move to production, I think a really few users need this exact feature.

Instead, they can use a template which copy all fields to _all field and they will get basically the same feature as we had previously.

May be we could have some default templates documented so people can easily apply them?

dadoonet on 17 Aug 2016

👍1

There are a number of sub-optimal things we do to improve the OOB experience, eg indexing all strings as both text (for search) and keyword (for sorting and aggregations). In fact, these OOB settings might be sufficient for some users who will never need to change anything.

For the user wanting smaller disk usage or better performance, we have guides available to help them make good choices.

While I would seldom use the _all field myself, I recognise that it makes it incredibly easy for a user to just get going with a simple search case or with logstash/beats and kibana.

Disabling _all now would severely hurt the user's experience in Kibana today. This is not something we can just rush in at the last minute. We first need to have a cross-stack discussion to come up with a plan for maintaining the same ease of use we have today.

clintongormley on 17 Aug 2016

👍4

Something else I'd be interested in getting opinions about would be to make the way _all works more explicit, eg. by disabling the copying magic and having the default mappings for strings look like this:

{
  "type": "text",
  "copy_to": "_all",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

So _all would remain, but only as a convention for the default catch-all field and without magic.

jpountz on 17 Aug 2016

👍1

Disabling _all now would severely hurt the user's experience in Kibana today.

I'd add that I don't think the Kibana user is in any way special. The typical end-user's preference to search via simple keywords entered into a single edit box. That is the primary "search" use case and one that typically behaves badly on multi-field indexes.

markharwood on 17 Aug 2016

+1 on Adrien's proposal or using a default template which can be overloaded.

dadoonet on 17 Aug 2016

13214 will need to be addressed first, otherwise query of * doesn't always return all documents with custom_all

jimmyjones2 on 18 Aug 2016

I've been thinking about this for the last few days. I think one of the root
issues with the discussion is that there are two different levels of
"user-friendly" and "out-of-box experience" that we're talking about.

On one hand, enabling the _all field is friendly to the out-of-the-box
experience because it allows a user to use the query_string or
simple_query_string query without knowing anything about the data.

On the other hand, enabling the _all field is damaging to the out-of-the-box
experience when it comes to users experiencing large amounts of disk space,
slower indexing, and operational confusion when a user doesn't know what the
queries operates on or how it works. People expect us to be competitive when it
comes to disk and ingestion speeds for increasingly larger and larger amounts of
data. Enabling _all by default is not as nice an experience in that way.

I've noticed this trend (again, not a bad trend, just a trend toward different
poles) in a number of places in the ES codebase - do we optimize for
out-of-box experience meaning 'ease of use', or do we optimize for out-of-box
experience meaning 'scalability and performance'?. We ran into the same
problem with the bootstrap checks and I think this is a similar situation.

dakrone on 18 Aug 2016

Very true.

I think the conclusion on bootstrap checks was that we are trying to make OOTB decisions for production.
Although I loved when I discovered Elasticsearch this _all feature, I'm totally Ok to remove it now.

Ideally, if people have defined a _all field with copy_to feature (or a manually controlled _all field), I think we can try to be smart and auto detect it when we run query string queries or mark a field as "default" in mapping so when no field is defined in a query, we use this default one?

dadoonet on 18 Aug 2016

On the other hand, enabling the _all field is damaging to the out-of-the-box
experience when it comes to users experiencing large amounts of disk space,
slower indexing, and operational confusion when a user doesn't know what the
queries operates on or how it works. People expect us to be competitive when it
comes to disk and ingestion speeds for increasingly larger and larger amounts of
data. Enabling _all by default is not as nice an experience in that way.

I would argue that ingesting large amounts of data at top speed is already way beyond the initial out-of-the-box experience.

From what I've seen people that are looking to ingest many GB or TB per day understand that they can't just do it straight out-of-the-box, but will have to put some effort into understanding how the system works and they know and expect that there'll be a few knobs to turn to improve the results.

However, I think there's many users that expect being able to start an application out-of-the-box or use a search engine to do full-text search over some documents out-of-the-box.

cwurm on 19 Aug 2016

👍3

I think there is a misconception that search really works correctly with _all, it doesn't!

highlighting is totally fubar'ed here, numeric stuff gets indexed into it... in a nearly useless manner, etc.

I'm not trying to use these as an argument that _all should be removed, but at the very least it should be fixed. By that i mean, look at better defaults, such as not tossing numeric stuff into it.

If we say _all is for full text search, then let it be only the concatenation of text fields, not other stuff that will just waste disk and destroy ranking :)

I think something like that is a lot more reasonable.

rmuir on 19 Aug 2016

👍2

Also i do feel, disabling the stopwords that lucene has by default on all its analyzers makes the situation worse too: it makes text fields far more costly that it should be, I think that should be considered. We try to use a minimal list that is not as invasive as other full text search systems already and i think its a good default, its been shown to be a good default in relevance experiments too.

Just another idea to make a better out of box compromise.

rmuir on 19 Aug 2016

Just another idea to make a better out of box compromise.

++ I think we should start with dropping non-textual fields either way no matter if we disable it by default or not. they should have include_in_all: false by default. That would be numerics, dates, ip etc?

s1monw on 19 Aug 2016

👍1

I think we should start with dropping non-textual fields either way no matter if we disable it by default or not. they should have include_in_all: false by default.

I opened a PR to do/discuss this - https://github.com/elastic/elasticsearch/pull/20085

dakrone on 19 Aug 2016

I really like @rmuir 's idea of dropping non-textual fields by default. I also agree that for new users and Kibana, _all is very important.

I also really like @cwurm's idea of discarding _all's analyzer and instead using the mappings that have already been defined on the string fields in an index as the basis for what gets shoved into _all. There are actually plenty of cases where I can make good use of _all -- but I could make even better use of it if it employed the same tokenizers and analyzers as the fields I'm shoving into it.

Another thing to think about: While suboptimal, storage is cheap relative to developer time and reindexing. I have often seen _all used as a work around because of an index mapping that was incorrect or poorly thought out. This is especially true with new users. A common example I've seen several times is someone correctly marking a field they intend to bucket or sort on as non analyzed but not realizing that they would be unable to perform wildcard-esque searches on that field unless they also maintained an analyzed copy (eg the string and string.raw convention employed by logstash and others). As an ES admin, it's definitely been very helpful for me at times to tell users, "that's okay, you can still query "_all" and we'll fix it moving forward" as opposed to "sorry, that sucks, we have to reindex all your data")

All of which to say, to the extent that _all is disabled or discouraged, I think it might be worth investing in the _reindex API -- both from an engineering perspective (more knobs to turn, more testing, better throttling, recovery of reindexing tasks that fail, etc) and from a documentation/tutorial perspective.

neuroticnetworks on 25 Aug 2016

One doubt to add to the discussion above, is it worth a shot to disable the _all field and reindex all the data? We are seeing huge latencies while indexing/updating documents. Could this improve the performance?

abhishekanand100 on 3 Oct 2016

We have been brainstorming how to handle the _all field for the last few
weeks. This has led us to revert the change that elides numerics from _all in
favor of the new ideas.

The results of the discussion are this:

Favor query-side handling this versus mapping-side changes. It would be better
if we could handle an _all-like query entirely on the query side rather than
requiring mapping changes for old and new indices to get the desired behavior.
Keep all queries constant score of 1 so it's clear to users it's not used
for relevancy. Currently _all has some concerns about relevancy due to the
hidden nature of how it's actually calculated. So any new query we introduce
should have a constant score of 1 to indicate it is not meant for relevancy
matching.
Introduce a new "all" query that internally rewrites into a multi-match query
on the backend for all different fields
Revert the change for excluding numerics from _all by default (already done)
The new "all" query needs to handle parsing failures due to string != number
exceptions

Work

With that discussion in mind, the following steps have been identified:

[x] Revert removing numerics from _all field (#20085)
[ ] Add new query type called "all" query that rewrites into multi_match (https://github.com/elastic/elasticsearch/pull/20925)

It should also be a constant_score query (ie, return a _score of 1.0) to
indicate that the query is not good at relevancy.

[ ] Switch regular query_string query and/or kibana to use new "all" query (https://github.com/elastic/kibana/issues/8007)

Depending on how it works, Kibana (and other tools) may be able to switch to it
instead of query_string. It remains to be seen how much functionality the new
query will have.

[ ] Switch regular simple_query_string query to use new "all" query

Again, optional depending on how well the query works.

[ ] Once queries are working, remove _all or disable-by-default

Once we have a true replacement that doesn't use the _all field, we can remove
it entirely, or stick with disabling it by default.

dakrone on 5 Oct 2016

👍2

I like the idea of an all query. I think it might want to calculate relevancy scores though. As rightly noted here this might not be perfect, but is sufficient for a number of uses, e.g. when indexing mostly text fields.

And in an all query we might be able to account for the most egregious shortcomings, e.g. not take non-text fields into account for the score.

cwurm on 6 Oct 2016

@dakrone in our use case we do not query the _all field, so disabling it completely in the mapping won't affect our searches, constant score is what we use mostly, just that since _all is indexed so I assume disabling it completely would save up some time and reduce latency. Can someone please advise.

abhishekanand100 on 6 Oct 2016

@abhishekanand100 I think your question is independent of the topic under discussion here, you can already disable the _all field today. You might want to post in another forum, e.g. https://discuss.elastic.co/c/elasticsearch

cwurm on 6 Oct 2016

@cwurm thanks for the info, will post a new thread on the link

abhishekanand100 on 6 Oct 2016

Was this page helpful?

0 / 5 - 0 ratings