Describe the feature:
More and more use cases arise that treat elasticsearch as a data store. Yet the landscape for retrieving fields today is complex. In fact, it requires expertise about a lot of different aspects. One needs to understand mappings, doc_values, stored fields. Complexities like becoming aware of the max doc_value field limit and then working around it by detecting a user requested more fields and trying to fetch them from _source instead.
Then, of course, there is multi-fields. Which variant should I pick? How do I even detect that a field has multi-fields in order to avoid retrieving the same field multiple times? There is an answer to this of course (check there is a parent field that is not an object) but this is hopefully illustrating how complex this is.
Writing code to do this for ML I have multiple stories about the complexities that arise. I think other users must have gone through a similar process.
I propose a new API that simply retrieves values given a list of fields. The API does not intend to do this in the most performant way. Rather, it intends to do it in the most user friendly way. It is an API that targets users that do not know the inner workings of elasticsearch and that have not yet detected a performance issue so that they begin an optimization journey (see "is it faster to retrieve from _source or doc_values" types of questions).
Pinging @elastic/es-search (:Search/Search)
We discussed this issue in our search meeting and we've spotted two enhancements that could help to retrieve values more easily:
field_caps API should expose the source path of the field if it's not present in the _source (alias, multi-fields, ...): https://github.com/elastic/elasticsearch/issues/49264_source should be customizable in order to allow a date for instance to be returned as a timestamp since epoch rather than a string. This feature would be equivalent to the format option of the docvalues_field but it would be applied in the original source directly.Discussed in the meeting today, adding team-discuss to clarify the remaining scope once @jimczi is back (are we okay with the current plan or do we need to do a higher level api to handle the retrieval).
I can imagine this as being necessary as well for feature extraction for our planned LTR work, both at training and inference time to extract document only features (i.e. features that are not query/context dependent).
/cc @davidkyle @jtibshirani
We have run into this problem in Kibana, where we are primarily asking users to interact with dotted field names like system.cpu.user.pct or url.keyword in building their visualizations.
Because the dotted names are what we train users to see, we keep a cache of the dotted names from the field_caps API (the index pattern object), and use this when asking users to build queries or visualizations. Why don't the _search APIs construct dotted paths for us?
Proposal: Add a new parameter fields to the _search API which implements the high-level retrieval described here, combining the behavior of _source and docvalue_fields. It is important for use in Kibana to support unlimited wildcards. It is important for us to be able to display the entire document using a query like fields: '*' or fields: ['system.cpu.*'].
The kibana sample data contains both text and keyword mappings, and is a good illustration of the response shape that I would expect:
POST kibana_sample_data_logs/_search
{
"query": { "match_all": {} },
"_source": "",
"fields": [{ "field": "*" }],
"size": 10
}
"fields": {
"bytes": [ 8679 ],
"extension": "",
"extension.keyword" : [ "" ],
"geo.coordinates" : [ "32.69899999257177, -94.94886112399399" ],
"geo.src" : [ "CN" ],
"geo.dest" : [ "IT" ],
"geo.srcdest" : [ "CN:IT" ]
"host": "www.elastic.co",
"host.keyword" : [ "www.elastic.co" ],
"machine.os" : "win xp",
"machine.os.keyword" : [ "win xp" ],
"machine.ram" : 11811160064,
"response": 200,
"response.keyword": ["200"],
"tags": ["success","info"],
"tags.keyword": ["info", "success", "info", "success"],
}
The example request is easy to write for any user of Elasticsearch, and the response contains information that is from both doc_values and _source. This is a simple, high-level API that we could work with. Unfortunately, this isn't possible by combining any of the APIs that exist today for a few reasons.
I have been testing with ECS-based schemas like metricbeat, which on my cluster contains 3904 named paths in the mapping. Not all of these fields are actively used, but because the mapping is so large it causes problems. Here are the limitations I've found
_source: "*" does not include multi-mapped or alias fields_source request with a list of 3904 paths like _source: [...] causes the error:
{
"type" : "too_complex_to_determinize_exception",
"reason" : "Determinizing automaton with 235539 states and 239442 transitions would result in more than 10000 states."
}
docvalue_fields: [{ field: "*" }] throws an error if there are any text fields at all:docvalue_fields: [{ field: "*" }] causes the errordocvalue_fields also causes the same error:All of these limitations make it hard to avoid using _source.
I caught up with @jimczi offline to clarify our earlier discussion. Instead of immediately pushing ahead with the source_path (#49264) and formatters changes, we'd like to step back and consider the problem in a more end-to-end way. Like this, we can consider a coordinated API change that addresses the use case in a more direct + user-friendly way.
We can continue the discussion about field retrieval on this issue, building on @wylieconlon's helpful analysis. I'll remove 'team discuss' for now, but we can add it back if there's a particular item we'd like to discuss in person.
+1 to move forward with something along the lines of @wylieconlon 's above proposal.
Great, I've assigned this to myself and am working on a design doc. Once the design is more settled I'll post it here or open a new meta-issue.
I opened a meta-issue to track implementation details: https://github.com/elastic/elasticsearch/issues/55363.
Closing, since the feature branch was merged in #60100.
Most helpful comment
We have run into this problem in Kibana, where we are primarily asking users to interact with dotted field names like
system.cpu.user.pctorurl.keywordin building their visualizations.Because the dotted names are what we train users to see, we keep a cache of the dotted names from the
field_capsAPI (the index pattern object), and use this when asking users to build queries or visualizations. Why don't the_searchAPIs construct dotted paths for us?Proposal: Add a new parameter
fieldsto the _search API which implements the high-level retrieval described here, combining the behavior of_sourceanddocvalue_fields. It is important for use in Kibana to support unlimited wildcards. It is important for us to be able to display the entire document using a query likefields: '*'orfields: ['system.cpu.*'].The kibana sample data contains both
textandkeywordmappings, and is a good illustration of the response shape that I would expect:The example request is easy to write for any user of Elasticsearch, and the response contains information that is from both
doc_valuesand_source. This is a simple, high-level API that we could work with. Unfortunately, this isn't possible by combining any of the APIs that exist today for a few reasons.Limitations of current APIs
I have been testing with ECS-based schemas like metricbeat, which on my cluster contains 3904 named paths in the mapping. Not all of these fields are actively used, but because the mapping is so large it causes problems. Here are the limitations I've found
_source: "*"does not include multi-mapped or alias fields_sourcerequest with a list of 3904 paths like_source: [...]causes the error:{ "type" : "too_complex_to_determinize_exception", "reason" : "Determinizing automaton with 235539 states and 239442 transitions would result in more than 10000 states." }docvalue_fields: [{ field: "*" }]throws an error if there are anytextfields at all:> Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
docvalue_fields: [{ field: "*" }]causes the error> Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [2588]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.
docvalue_fieldsalso causes the same error:> Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [3900]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.
All of these limitations make it hard to avoid using
_source.