The current find
API on the SavedObjectRepository cannot page through large data sets due to the index.max_result_window
setting in Elasticsearch which defaults to 10,000 objects. This is starting to limit what plugins can build and we have several types of SOs that may have large numbers of objects now (SIEM's exception lists come to mind).
To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default. Clients would need to be aware of this and handle it properly and it may not be easy to realize this in development.
Another option could be _async_search
APIs, but those are not available in OSS distributions.
This issue definitely needs further investigation, but I wanted to open it to start collecting usecases where it would be useful.
related: https://github.com/elastic/kibana/issues/22636
https://github.com/elastic/kibana/issues/64715
Pinging @elastic/kibana-platform (Team:Platform)
Thanks for filing this issue, @joshdover.
I've just encountered this limitation in the scope of https://github.com/elastic/kibana/pull/72420 as well. In a nutshell, when admin wants to bulk rotate SO encryption key we should fetch and process all SO (ideally in batches with a configurable size to balance the load), from all spaces, for all SO types that may have encrypted attributes. And these days we may have quite a lot of them (alerts, actions, fleet related SOs).
The fact that we also update _some_ of the fetched results makes paging with only perPage
/page
even more complex even if we have less than 10k of SOs.
Do you happen to have any recommended workarounds for SO use cases like this? If not, is there anything we can help with to boost priority for this enhancement?
cc @elastic/kibana-security
To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default.
This is obviously a very stupid option in term of memory usage, but still asking: could using the scroll
API internally in the SO repository be an option? We could, when page+perPage > index.max_result_window
, use the scroll API under the hood, fetch all, and return the aggregated results?
scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default
This is a configurable via the scroll
option though, so If we expose a new scroll
API to the SOR/SOC, we could just expose it. But yea, the ttl handling would have to be handled by the consumer anyway, there is not much we could do when _search/scroll
is called with an expired TTL.
Hi Josh, as we discussed last week, the current limitation impacts the scalability of the Fleet effort. Every agent that connects to Fleet is stored as a saved object that can be managed in the UI. The limitation is currently not too bad for us as we are actively working on improving performance to handle a large number of agents, so the number of users who will reach this limit is small. But we will soon want to get to a point where we can handle >10k agents smoothly so the number of large-scale users will increase. #78520 for describes how the current SO client limits us and our current UI workaround.
cc @ph for awareness & prioritization
In the 7.12 release, the team is going to investigate the basic architecture.
To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default.
This is obviously a very stupid option in term of memory usage, but still asking: could using the
scroll
API internally in the SO repository be an option? We could, when page+perPage >index.max_result_window
, use the scroll API under the hood, fetch all, and return the aggregated results?scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default
This is a configurable via the
scroll
option though, so If we expose a newscroll
API to the SOR/SOC, we could just expose it. But yea, the ttl handling would have to be handled by the consumer anyway, there is not much we could do when_search/scroll
is called with an expired TTL.
@pgayvallet, so we did try to use the scroll
api of Elasticsearch at the beginning of SIEM, but we had a problem with this approach because we did not have any EUI table at this time who is working with the scroll API. Of course, we thought about using simple pagination just like that < >
but we got user feedback about it and they did not like it because they did not have the feeling that we had their data in hands. So we had to refactor our query to use the simple from and size
and we remove the last page when we have more than 10000 rows, by doing that we were able to get back on our feet.
I am sharing that because using scroll
API will break most of our table since if you click on page 3, you won't know the cursor of page 1 but not the cursor of page 3 or you will have to do three queries to get to know the cursor of page 3. Anyway, I will love to be aware of your approach here since I think every solution is dealing with the same kind of problem.
I've been looking at async_search
, scroll
and search_after
, and I came to the conclusion that neither of these options would totally address the problem we are facing here.
For scenarios where we are doing bulk processing of a very large number of object on the server-side, all these solutions would work, as they all would allow to 'scroll' all the results for a query that would exceed the index.max_result_window
value.
However, as @XavierM mentioned in his comment, one of the most common scenario where we face this limitation is when displaying the saved objects in a paginated table in the UI.
To demonstrate, using as example the saved object management table, where we are displaying all the visible saved objects:
In this table, we are paginating the results per pages of, say, 100 items. The pagination buttons allow the user to navigate forward, back, to the first or to the last page. Currently, when accessing the page PAGE that displays PER_PAGE results, we are calling _search
with from: PER_PAGE * (PAGE - 1), size: PER_PAGE
. These parameters are deterministic and can be computed independently of the page the user is current on.
As the user is able to navigate from any page to any other page, backward or forward, none of the suggested solutions would work to address this index.max_result_window
limitation:
async_search
would just allow us to retrieved partial results while the search is running (which achieves nothing), or to return the full list of results. This full list is of little help to display a specific page, and would force us to 'cache' the full list of result to perform pagination on our side, either on the client (not really an option) or the server (which introduces quite a lot of complexity regarding cache invalidation / entry removal on LARGE cached data)
search_after
only allows us to fetch the results following the last performed request. Meaning that when we are displaying page X
, we can only display page X+1
next. Which doesn't answer the use case to navigate from any arbitrary page to another one.
scroll
is even worse in that case, as the search context got a TTL, this would just doesn't work with user-initiated requests (w). It's even stated in the official ES doc: The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests. The search_after parameter circumvents this problem by providing a live cursor
Which is why I'm wondering: As our indices are now system indices, could we just ask the ES team to change the default value of index.max_result_window
on these system indices to a higher value? I mean, it wouldn't solve the problem per say, but by setting this value to 100k by default instead of 10k, we could probably work around for most, if not all, of our (current) volumetry without any change on the codebase. Or course, we should also confirm with ES that this would be alright performance-wise.
As our indices are now system indices, could we just ask the ES team to change the default value of
index.max_result_window
on these system indices to a higher value?
I don't think we'd need to wait for system indices for this? We should be able to set this setting directly on the index during the migration process.
Or course, we should also confirm with ES that this would be alright performance-wise.
This is my primary question. I'm curious if the Elasticsearch performance issues scale with the number of documents or the size (as in bytes) of the results. Since we're primarily paginating large numbers of really small documents, I'm hoping it's the latter.
Would anyone from @elastic/es-perf be able to shed light on this? Specifically, what is the reason for the index.max_result_window
setting defaulting to 10k and what types of problems typically surface when increasing this limit? We're trying to paginate through 100k+ of small documents in the UI and increasing this limit on the .kibana
index would be the 'easiest' solution from our perspective.
I think changing index.max_result_window
might be sufficient for saved objects management because it's relatively rare that users use the UI or export objects. But the performance penalty is probably too high for regular searches from plugins that need to page through more than 10k results. So I think this will eventually bite us.
I think we'll have to do something similar to what @XavierM mentioned where the UI works around the problem. I'm not sure what page size we currently use, but more than 100 results probably don't fit on a screen which means 10k results is at least 100 pages. I don't think users will ever need more pages than that, they should rather narrow down their search. So then the UI could use from
and size
and display a message like "Your search results were too large to display, only showing the first 10 000 results". We could also use a pattern similar to gmail's "Select all conversations that match this search" to allow users to export all the saved objects that match a search even if there's more than 10k results. That would require the export API to accept a query (and maybe a KQL filter too).
We would have to add a tiebreaker field to all saved objects to allow the find API to support > 10k search results using search_after
.
I overall agree that implementing search_after
for the SO find
API seems the most straightforward.
We would have to add a tiebreaker field to all saved objects
The good old {type}|{id}
already used in a lot of places, at least on the client-side.
Note that one notable constraint/limitation of such approach is that we would only be allowed to sort by this tie_breaker_id
in the successive _find
requests, as search_after
requires the value of ALL the sorting fields. So the SavedObjectsFindOptions.sortField
would be blanked in that case. This seems acceptable though.
Also, this would require to migrate all SO objects during a migration to populate this field, which is kinda unsupported at the moment (even if adding 'internal' migrations to core to impact all types should be doable)
Most helpful comment
In the 7.12 release, the team is going to investigate the basic architecture.