Elasticsearch: Elasticsearch should support returning doc values in get and update APIs too

Created on 14 Nov 2017  路  13Comments  路  Source: elastic/elasticsearch

Search API supports returning fields' doc values:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-docvalue-fields.html

Because doc values offer a more compact representation of the given fields than the _source does with JSON, I think it would be beneficial to support returning them with more APIs. This would even enable excluding all fields from _source and serve them only from doc values.

Prominent examples come to my mind are:

  • get API (similar to search and current stored fields, it could support this: GET twitter/tweet/2?routing=user1&docvalue_fields=tags,counter)
  • update API (for example after a scripted update with a _source excluded field it would be useful to return the doc value field)
:DistributeCRUD >enhancement team-discuss

Most helpful comment

I'm requesting this feature because I want to store data as much efficiently I can in Elasticsearch.
And this can be achieved if I don't store anything at all in the source, only in doc values (in native format). But the infrastructure is missing for this to be achieved, hence the issue.

All 13 comments

If compactness is the goal, I think source filtering achieves that? E.g. the GET api has: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#get-source-filtering to retrieve just the fields you're interested in.

There's also response filtering via filter_path, which applies to any request: https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#common-options-response-filtering and allows arbitrary filtering of the response.

Now, the case can be made for doc values because they are post-analysis whereas the source is pre-analysis. So that's a potentially useful thing about returning the doc values instead of filtered source.

I'm requesting this feature because I want to store data as much efficiently I can in Elasticsearch.
And this can be achieved if I don't store anything at all in the source, only in doc values (in native format). But the infrastructure is missing for this to be achieved, hence the issue.

BTW, just to show how much you can gain from leaving fields out from source:

Stored type | Number of docs | Stored data in field | idx size (SRC1,DV0,IDX0) | idx size (SRC1,DV1,IDX0) | idx size (SRC0,DV1,IDX0)
----------- | ------------ | ------------ | ------------ | ------------ | ------------
binary | 1000 | 1024B random[1] | 1,392,330 | 2,419,600 | 1,042,213
long | 1000 | 1000 random longs[2] | 7,222,369 | 10,222,447 | 2,214,039

SRC0|1 means: exclude or include the test field in source
DV0|1 means: disable doc values on the field or not
IDX0|1 means: disable or enable indexing on field

As you can see, quite significant gains can be realized to store (and serve) data from doc value fields.
And I guess it's quite a low hanging fruit to enable serving doc values like stored fields and the like today.

Index sizes above are after a flush and a max_num_segments=1 force merge command, reported by Elasticsearch in bytes.

BTW, it's nice to see how efficient is Elasticsearch in storing (binary) data in this way. In the above example, storing 1kB in 1000 docs (which is 1,024,000 bytes in net) makes a 1,042,213 bytes index (according to Elastic), which is only 1.77% larger. Checking the index size on the file system gives 1,051,648 bytes, which is still pretty nice.

Discussed in FixIt Friday The get API is very similar to a routed _search where the query is a simple term query on the _id field so we discussed the opportunity to use _search internally for get requests. This would give the opportunity to use any feature in search for the get case so docvalues_field would be accessible in a get request.
The main issue we have right now with docvalues_fieldis that it return values as they are stored in doc values and there is no way to change the format in the response. There is an open issue for this https://github.com/elastic/elasticsearch/issues/26948
This way you could use the value returned in docvalues_field directly as input to ingest another document.

I guess "use _search internally for get" will preserve the current behaviour of get, so it will do a refresh and only if it's needed (there are inflight changes)?
If so, that would be great.
I'm currently fine with what _search returns in docvalues_field. If these changes won't affect the efficiency of doc values, I'm fine with #26948 too. :)

BTW, I think the other issue doesn't really affects this, if you convert get to _search. If that issue will be fixed, it will happen in get too.
So what's the conclusion, do you plan to do this?

BTW, I think the other issue doesn't really affects this, if you convert get to _search. If that issue will be fixed, it will happen in get too.

Yes

So what's the conclusion, do you plan to do this?

We have an agreement that we should use _search internally for get request.
This would save some lines in the code and let you use the feature that you want.
I'll mark this issue as adoptme so it's not an high priority but we have an interest for it ;).

I guess "use _search internally for get" will preserve the current behaviour of get, so it will do a refresh and only if it's needed (there are inflight changes)?
If so, that would be great.

And yes it should preserve real time refresh

I understand there is an open issue around formatting, but would it make sense to always store source as doc values given the space savings? If that is not feasible right now (other than formatting), would that be a good long-term goal? To be clear, I am talking about an internal change not an API change.

As a user with a ~1.5 TB index, dramatic space savings shown in the table above is very appealing.

@rpedela: doc values won't preserve original structure. For the simplest case, if you store a list in _source, it will preserve order, doc values won't.
Not to speak about more complicated structures.
So storing data in doc values in mainly useful if you have "flat" data, meaning pure text (keywords) or numbers/bools or unordered lists of them.

Would #26948 solve that problem?

would it make sense to always store source as doc values given the space savings?

No, this would be a bad trade-off. Doc values is columnar storage, fetching values of X different fields requires to perform a seek in each column, which is going to be slow. On the other hand, _source guarantees at most one disk seek, regardless of the number of fields to fetch.

I believe you already configured index.codec: best_compression?

ping @elastic/es-search-aggs . I think we may need to revisit this now that get can read docs from the translog again (what do you then with a doc value fetch option?).

We discussed internally and decided that we don't want to pursue this feature for the moment.
As @bleskes mentioned the update API can read documents from the translog (https://github.com/elastic/elasticsearch/pull/29264) so doc_values field would not be available in this case. We also want to have consistency in our APIs so adding this feature to the get API only would introduce a discrepancy. For these reasons I am going to close this issue, we can revisit in the future if updates are implemented differently.

Was this page helpful?
0 / 5 - 0 ratings