Example:
Using Twitter as an example, each user is a document, and each tweet is a document nested under the user. For active users, each document can end up with thousands of tweets and thus a single document can be a few megabytes in size.
{
"userId": "1",
"tweets": [
{
"id": 1,
"message": "tweet 1",
},
{
"id": 2,
"message": "tweet 2"
},
...
]
}
Use Case:
We want to find users that have used a specific hashtag in their tweets and view only those tweets. We use source filtering and nested inner hit queries to get back just the users and matching tweets.
Problem:
Even though we are using source filtering, ElasticSearch will load the entire document into memory before doing source filtering. Since each record is so large, that means with any real throughput, we see constant garbage collection happening in the logs.
Feature Request:
Can you load filtered source in a more memory efficient manner - where you do not have to load the entire source into memory first?
If the issue is about not loading the _source in memory, then this is a high hanging fruit. However, you mentioned that the issue is mostly about garbage collection in your case, which I think we could improve by avoiding going through a map of maps intermediate representation, which I suspect is the source of all that garbage.
Also relates to #9034.
:+1:
:+1:
yeah, it seems related to #9034. in this case, since the large items are under a single nested field it would also require each nested item to be stored separately.
We discussed about this offline in our "Fix-it Friday" meeting and we agreed that we could still reduce the garbage collection issue by filtering the source of documents in a streaming fashion instead of the current in-memory map implementation. We could use the same feature as what is used in response filtering for filter_path.
I'll give it a try in the next few weeks and update this issue.
So I finally looked at this. I created https://github.com/tlrx/elasticsearch/commit/3362a50bc8c0280a39607d661a23d4b2cdaeb5f0 that uses a streamed based implementation of the source filtering that replaces the current in-memory maps implementation. The results are similar to what I saw more than one year ago when I looked at this optimization, but at that time we didn't have Rally so tests were hard to reproduce.
tl;dr
Benchmarks show that both implementations have _almost_ the same performance, because most of the time is spent loading and parsing the source and these steps are always executed however the filtering is done. Differences appear only for edge cases like the one described in this issue (ie, a document with a lot of fields where most of them are filtered out). The streamed based implementation has less memory pressure since it creates a lot less objects, so I think it is a good long term solution. Sadly, the filtering methods does not behave exactly the same making the change not trivial and other features like inner hits, highlighting or scripted fields requires the source to be parsed as a map anyway. So I think we should investigate #9034 instead of optimizing for edge cases like this issue.
A new XContentHelper.filter(BytesReference, XContentType, String[], String[]) has been added in https://github.com/tlrx/elasticsearch/tree/use-streamed-based-source-filtering. It uses Jackson streaming filtering under the hood. Implementation is quite straightforward. This method is used in the FetchSourceSubPhase to filter the source.
The implementation was tested with Rally using a JFR telemetry and memory profiling enabled on our default benchmarks. Note that the JFR options has been changed to use a custom profile and I created a new challenge with only searches with source filtering operations.
It has been tested with multiple benchmarks but pmc gave the most eloquent results because it contains a large body field that can be filtered out:
Rally results for the map based filtering indicate a median throughput of 187.097 ops/s and a 99th percentile latency of 470.179 ms compared to 195.384ops/s and 196.684 ms for the streaming based results.
Looking at the JFR records is interesting and show less memory usage using streaming based filtering:


And less GCs using streaming based filtering, which is expected.


The allocations statistics also show much less allocations for the streaming based filtering (4949 allocations for 3,5 Gb in TLAB, 7567 for 501Mb outside TLAB) compared to map based filtering (9676 allocations for 6,5 Gb in TLAB, 32428 for 2,96Gb outside TLAB)


JFR records:
While investigating the change I noticed that our filtering methods do not behave the same so I created #25491 so that all methods share a same set of tests. But there are still some differences: map based filtering prints out empty objects (#4715) while the streaming based implementation excludes empty objects. Also, map based filtering handles dot in field names as sub objects (#20736) and streaming based does not work exactly like this and requires some non trivial changes.
Also, some features require the source to be parsed as a map in order to work (like highlighting or scripted fields). If combined with source filtering, we don't want to parse the source as raw bytes for source filtering and another time as a map for the highlighting. Changing the way it works is not easy and I think we could instead investigate other solution like #9034 instead of optimizing the source filtering for edge cases like the one described in this issue.
I'd be happy to hear any thoughts or comments on this! I might have miss something...
It is a pity that we managed to come up with different semantics about filtering values in a document. I'd be keen to switching to stream-based filtering even if that implies minor bw breaks.
I'd be keen to switching to stream-based filtering even if that implies minor bw breaks.
++ is there a chance we can stick with object based parsing based on the index version created or some setting and remove it in 7.0?
Any updates on whether this may be included in 7.0?
Pinging @elastic/es-search-aggs
@osman No updates for now.
Pinging @elastic/es-core-infra (:Core/Infra/Scripting)
We use source filtering and nested inner hit queries to get back just the users and matching tweets.
I noticed that when using source filtering in inner_hits, we were reloading and reparsing the _source for each nested document. So we recently merged https://github.com/elastic/elasticsearch/pull/60494 to only load and parse the _source once per root document. This doesn't address the memory consumption of source filtering itself, but could help here (if I'm understanding the use case right).
Most helpful comment
So I finally looked at this. I created https://github.com/tlrx/elasticsearch/commit/3362a50bc8c0280a39607d661a23d4b2cdaeb5f0 that uses a streamed based implementation of the source filtering that replaces the current in-memory maps implementation. The results are similar to what I saw more than one year ago when I looked at this optimization, but at that time we didn't have Rally so tests were hard to reproduce.
tl;dr
Benchmarks show that both implementations have _almost_ the same performance, because most of the time is spent loading and parsing the source and these steps are always executed however the filtering is done. Differences appear only for edge cases like the one described in this issue (ie, a document with a lot of fields where most of them are filtered out). The streamed based implementation has less memory pressure since it creates a lot less objects, so I think it is a good long term solution. Sadly, the filtering methods does not behave exactly the same making the change not trivial and other features like inner hits, highlighting or scripted fields requires the source to be parsed as a map anyway. So I think we should investigate #9034 instead of optimizing for edge cases like this issue.
A new
XContentHelper.filter(BytesReference, XContentType, String[], String[])has been added in https://github.com/tlrx/elasticsearch/tree/use-streamed-based-source-filtering. It uses Jackson streaming filtering under the hood. Implementation is quite straightforward. This method is used in theFetchSourceSubPhaseto filter the source.The implementation was tested with Rally using a JFR telemetry and memory profiling enabled on our default benchmarks. Note that the JFR options has been changed to use a custom profile and I created a new challenge with only searches with source filtering operations.
It has been tested with multiple benchmarks but pmc gave the most eloquent results because it contains a large
bodyfield that can be filtered out:Rally results for the map based filtering indicate a median throughput of 187.097 ops/s and a 99th percentile latency of 470.179 ms compared to 195.384ops/s and 196.684 ms for the streaming based results.
Memory overview
Looking at the JFR records is interesting and show less memory usage using streaming based filtering:
Map based filtering
Streaming based filtering
Garbage collections
And less GCs using streaming based filtering, which is expected.
Map based filtering
Streaming based filtering
Allocations
The allocations statistics also show much less allocations for the streaming based filtering (4949 allocations for 3,5 Gb in TLAB, 7567 for 501Mb outside TLAB) compared to map based filtering (9676 allocations for 6,5 Gb in TLAB, 32428 for 2,96Gb outside TLAB)
Map based filtering
Streaming based filtering
JFR records:
Other considerations
While investigating the change I noticed that our filtering methods do not behave the same so I created #25491 so that all methods share a same set of tests. But there are still some differences: map based filtering prints out empty objects (#4715) while the streaming based implementation excludes empty objects. Also, map based filtering handles dot in field names as sub objects (#20736) and streaming based does not work exactly like this and requires some non trivial changes.
Also, some features require the source to be parsed as a map in order to work (like highlighting or scripted fields). If combined with source filtering, we don't want to parse the source as raw bytes for source filtering and another time as a map for the highlighting. Changing the way it works is not easy and I think we could instead investigate other solution like #9034 instead of optimizing the source filtering for edge cases like the one described in this issue.
I'd be happy to hear any thoughts or comments on this! I might have miss something...