Elasticsearch: A high level way of retrieving values for certain fields

Created on 13 Nov 2019 · 10Comments · Source: elastic/elasticsearch

Describe the feature:

More and more use cases arise that treat elasticsearch as a data store. Yet the landscape for retrieving fields today is complex. In fact, it requires expertise about a lot of different aspects. One needs to understand mappings, doc_values, stored fields. Complexities like becoming aware of the max doc_value field limit and then working around it by detecting a user requested more fields and trying to fetch them from _source instead.

Then, of course, there is multi-fields. Which variant should I pick? How do I even detect that a field has multi-fields in order to avoid retrieving the same field multiple times? There is an answer to this of course (check there is a parent field that is not an object) but this is hopefully illustrating how complex this is.

Writing code to do this for ML I have multiple stories about the complexities that arise. I think other users must have gone through a similar process.

I propose a new API that simply retrieves values given a list of fields. The API does not intend to do this in the most performant way. Rather, it intends to do it in the most user friendly way. It is an API that targets users that do not know the inner workings of elasticsearch and that have not yet detected a performance issue so that they begin an optimization journey (see "is it faster to retrieve from _source or doc_values" types of questions).

:SearcSearch >feature Search

Source

dimitris-athanasiou

👍4

Most helpful comment

We have run into this problem in Kibana, where we are primarily asking users to interact with dotted field names like system.cpu.user.pct or url.keyword in building their visualizations.
Because the dotted names are what we train users to see, we keep a cache of the dotted names from the field_caps API (the index pattern object), and use this when asking users to build queries or visualizations. Why don't the _search APIs construct dotted paths for us?

Proposal: Add a new parameter fields to the _search API which implements the high-level retrieval described here, combining the behavior of _source and docvalue_fields. It is important for use in Kibana to support unlimited wildcards. It is important for us to be able to display the entire document using a query like fields: '*' or fields: ['system.cpu.*'].

The kibana sample data contains both text and keyword mappings, and is a good illustration of the response shape that I would expect:

POST kibana_sample_data_logs/_search
{
  "query": { "match_all": {} },
  "_source": "",
  "fields": [{ "field": "*" }],
  "size": 10
}

"fields": {
  "bytes": [ 8679 ],
  "extension": "",
  "extension.keyword" : [ "" ],
  "geo.coordinates" : [ "32.69899999257177, -94.94886112399399" ],
  "geo.src" : [ "CN" ],
  "geo.dest" : [ "IT" ],
  "geo.srcdest" : [ "CN:IT" ]
  "host": "www.elastic.co",
  "host.keyword" : [ "www.elastic.co" ],
  "machine.os" : "win xp",
  "machine.os.keyword" : [ "win xp" ],
  "machine.ram" : 11811160064,
  "response": 200,
  "response.keyword": ["200"],
  "tags": ["success","info"],
  "tags.keyword": ["info", "success", "info", "success"],
}

The example request is easy to write for any user of Elasticsearch, and the response contains information that is from both doc_values and _source. This is a simple, high-level API that we could work with. Unfortunately, this isn't possible by combining any of the APIs that exist today for a few reasons.

Limitations of current APIs

I have been testing with ECS-based schemas like metricbeat, which on my cluster contains 3904 named paths in the mapping. Not all of these fields are actively used, but because the mapping is so large it causes problems. Here are the limitations I've found

_source: "*" does not include multi-mapped or alias fields
Making a _source request with a list of 3904 paths like _source: [...] causes the error:
{ "type" : "too_complex_to_determinize_exception", "reason" : "Determinizing automaton with 235539 states and 239442 transitions would result in more than 10000 states." }
It's not possible to get all docvalues with a wildcard on small indices. The query docvalue_fields: [{ field: "*" }] throws an error if there are any text fields at all:
> Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
It's not possible to get all docvalues on a large mapping like metricbeat. The request docvalue_fields: [{ field: "*" }] causes the error
> Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [2588]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.
Listing too many paths in the request for docvalue_fields also causes the same error:
> Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [3900]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.

All of these limitations make it hard to avoid using _source.

wylieconlon on 20 Feb 2020

👍3

All 10 comments

Pinging @elastic/es-search (:Search/Search)

elasticmachine on 13 Nov 2019

We discussed this issue in our search meeting and we've spotted two enhancements that could help to retrieve values more easily:

The field_caps API should expose the source path of the field if it's not present in the _source (alias, multi-fields, ...): https://github.com/elastic/elasticsearch/issues/49264
The format of values when retrieving the _source should be customizable in order to allow a date for instance to be returned as a timestamp since epoch rather than a string. This feature would be equivalent to the format option of the docvalues_field but it would be applied in the original source directly.

jimczi on 18 Nov 2019

Discussed in the meeting today, adding team-discuss to clarify the remaining scope once @jimczi is back (are we okay with the current plan or do we need to do a higher level api to handle the retrieval).

costin on 10 Feb 2020

I can imagine this as being necessary as well for feature extraction for our planned LTR work, both at training and inference time to extract document only features (i.e. features that are not query/context dependent).
/cc @davidkyle @jtibshirani

joshdevins on 10 Feb 2020

The kibana sample data contains both text and keyword mappings, and is a good illustration of the response shape that I would expect:

POST kibana_sample_data_logs/_search
{
  "query": { "match_all": {} },
  "_source": "",
  "fields": [{ "field": "*" }],
  "size": 10
}

"fields": {
  "bytes": [ 8679 ],
  "extension": "",
  "extension.keyword" : [ "" ],
  "geo.coordinates" : [ "32.69899999257177, -94.94886112399399" ],
  "geo.src" : [ "CN" ],
  "geo.dest" : [ "IT" ],
  "geo.srcdest" : [ "CN:IT" ]
  "host": "www.elastic.co",
  "host.keyword" : [ "www.elastic.co" ],
  "machine.os" : "win xp",
  "machine.os.keyword" : [ "win xp" ],
  "machine.ram" : 11811160064,
  "response": 200,
  "response.keyword": ["200"],
  "tags": ["success","info"],
  "tags.keyword": ["info", "success", "info", "success"],
}

Limitations of current APIs

_source: "*" does not include multi-mapped or alias fields
Making a _source request with a list of 3904 paths like _source: [...] causes the error:
{ "type" : "too_complex_to_determinize_exception", "reason" : "Determinizing automaton with 235539 states and 239442 transitions would result in more than 10000 states." }
It's not possible to get all docvalues with a wildcard on small indices. The query docvalue_fields: [{ field: "*" }] throws an error if there are any text fields at all:
> Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
It's not possible to get all docvalues on a large mapping like metricbeat. The request docvalue_fields: [{ field: "*" }] causes the error
> Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [2588]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.
Listing too many paths in the request for docvalue_fields also causes the same error:
> Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [3900]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.

All of these limitations make it hard to avoid using _source.

wylieconlon on 20 Feb 2020

👍3

I caught up with @jimczi offline to clarify our earlier discussion. Instead of immediately pushing ahead with the source_path (#49264) and formatters changes, we'd like to step back and consider the problem in a more end-to-end way. Like this, we can consider a coordinated API change that addresses the use case in a more direct + user-friendly way.

We can continue the discussion about field retrieval on this issue, building on @wylieconlon's helpful analysis. I'll remove 'team discuss' for now, but we can add it back if there's a particular item we'd like to discuss in person.