Describe the feature:
When you fire off a search request to Elasticsearch, you're stuck waiting until the result comes back. Normally, that's very, very fast. But occasionally an egregious search/dataset can take a while to get through, so we added the ability to kill them through the task manager. That's great, but it's difficult to use from the UI that executes the search.
Consider:
How does the UI match up the search that was executed with the list of tasks that are in the system? The UI could try some heuristics to match up the search descriptions in the task manager to the original request, but it'd have to be done based upon heuristics and those heuristics become very difficult given Elasticsearch will have rewritten the query and that there could be multiple searches that match the heuristic.
It'd be nice if you could associate some ID with a request at search time and have that ID show up in the task manager. That way, when the UI executes the request, it could specify an ID it could reference later if it needs to kill the request.
I assume you are talking about something like generating a unique request id along the lines of https://blog.ryandlane.com/2014/12/11/using-lua-in-nginx-for-unique-request-ids-and-millisecond-times-in-logs/. If so, I am hugely in favor of this idea, especially if the search id were carried through to the slow query logs. If it were, that would be extremely helpful vis a vis efforts around improving slow query logging (eg https://github.com/elastic/elasticsearch/issues/9172 and https://github.com/elastic/elasticsearch/issues/12187#issuecomment-249218547). It could also potentially lend itself to @PhaedrusTheGreek's idea of breaking down API response time (https://github.com/elastic/elasticsearch/issues/21073) or even logging it outside of just the Profile API.
We've talked about this on and off for a while.
If we do this I think it'd be easier if this were a thing for tasks in general rather than just searches. It might work like task status. It is a general thing but each request has to "opt in" to it. There should be a "standard" way to opt into it.
I think it'd be hard if we wanted to force these IDs to be unique because we don't have a good place for that.
I'm thinking of a task metadata url parameter which could be search using the list tasks API. Or something like that. @imotov, what do you think?
@nik9000 maybe we can somehow expose a whitelisted subset of headers from ThreadContext at the moment of the task creation. This way it would be possible to add stuff on the rest layer in a general way to all requests. Otherwise, each request you would want to "opt in" will have to add a place to "stash" the information you want to expose via task manager api.
Maybe! If we can get it at the rest layer that'd be cool.
This is the single most important feature for our environment. We have users that will run multiple searches in a row, and some are quite large. The ability to tag their searches and cancel specific searches prior to the latest that is still running would be an incredible benefit.
+1! Ability to abort specific prior searches on demand would be huge. I would gladly manage UUIDs on our end and pass them up to just be appended to the task at search time if that means I could do it through a single REST call. The inability to abort ES tasks has been a problem we have had since 0.90.
馃憤 This is critical for our application because we have long-running analytic reports that sometimes are canceled by users, but the es cluster keeps going until it's done - and takes down the cluster in the process because users might immediately queue up different reports now that they've "canceled" the previous one.
Allowing to assign an ID _(not necessary unique one)_ for a search and be able to cancel it when its needed is crucial for heavy usage scenarios. 馃挕 Not having it is causing queuing up and leads to search rejections and it literally ties our hand and becomes bottleneck in our operation. 馃槥 Please make this issue priority 馃 Thanks in advance
It would be really useful to have some sort of control over canceling queries for my use cases too. Thanks for considering it.
+1
@jrubensteinsp, @lusid, @daedalus28, @cilerler, @dshishkov, @Akrion it seems that you all work for the same company. We are trying to make sure that this feature covers a variety of use cases and it would be helpful for us to understand if you have multiple use cases for this feature at your company or all these comments are essentially about the same application. If you have multiple use cases, it would really help us if you could describe what they are and how they defer from each other?
@imotov you are right, we all from the same company but we are accessing Elastic from different applications and we realized that we all suffering from the same issue. "Not having a capability to cancel a query".
In a very simple way, common desired implementation would be
assigning a key on our end _(no round trip)_ and be able to cancel related queries based on that key
Thank you for your time and attention!
I would echo the need for being able to cancel long running queries. In MySQL land, a lot of times you'll have a daemon running pt-kill
on the server https://www.percona.com/doc/percona-toolkit/2.1/pt-kill.html. One of the nicest features of pt-kill is it can kill queries that match a certain pattern while leaving others alone.
If you could associate a task id (especially one you have some control over assigning, or at least prefixing) with a search, it would be very straightforward to write a similar tool for ES -- one that looks for long running queries and, assuming the ID associated with them are identified as killable, kills them.
On hot-warm
deployments of ES especially, this would be pretty helpful for us. Our warm nodes tend to have much more data on them than our hot ones and Kibana queries against those warm nodes occasionally knock the warm nodes offline (which in turn puts pressure on the masters in the form of recovery tasks, which in turns slows the whole cluster down). If we could stalk the running search tasks and kill anything that's taking too long and doesn't have a prefix on its id to mark it as non-killable, that'd really do a lot for our overall cluster stability. Of the last 10 times ES has required manual intervention, all of them had to do with inefficient queries running against our warms and knocking them offline. I suspect stalking and killing problem queries would have prevented all or most of those issues from happening.
@imotov I'm not from that company but I also have a use case for this. :)
We are rendering images from Elasticsearch data and show them in the browser. Rendering an image, depending on the query parameters, can involve quite a few aggregations and it can take half a minute or so to gather the data.
When the user resizes the window or navigates to a different page, the browser cancels the HTTP request by closing the connection. We already propagate this through the load balancers to the render services, which will then cancel their work. However, currently it seems Elasticsearch will continue to work on those canceled queries.
It seems we could use the Task API to cancel the running search/aggregation, but only if we can find it's corresponding task ID.
In our code we already know which request we just canceled. If we could add an ID (could be a UUID, or in our case we might simply use a sequence number because only one application is using the cluster) and search for it using the Task API, then we could cancel the search by making a Task API request and save valuable CPU seconds/IOPs.
Most helpful comment
馃憤 This is critical for our application because we have long-running analytic reports that sometimes are canceled by users, but the es cluster keeps going until it's done - and takes down the cluster in the process because users might immediately queue up different reports now that they've "canceled" the previous one.