Updated the description to something more explicit of what is needed from the task manager.

mikecote on 18 Sep 2019

Pinging @elastic/kibana-stack-services (Team:Stack Services)

elasticmachine on 9 Oct 2019

Making this a Discuss Issue (for a while at least) as I feel this requires some debate.

Current State

So currently, if I'm understanding thing correctly, when an Alert is created/enabled we schedule a task without specifying a runAt field or interval field to Task Manager.
What this means is the the Alert is always run _"immediately"_ after creation/enablement.

As part of the implementation of Alerting we provide a TaskTypeDefinition for TaskManager which gets called whenever an Alert needs to be evaluated.
In the TaskTypeDefinition we use the ability of tasks to override the runAt field as part of their run to use the interval (which is specified on the Alert) to calculate when the next run should take place.

The issue we're hitting is that as the Alert is run immediately after creation/enablement, resulting in the next runAt being calculated based on the interval as it was at the that time, it doesn't get a chance to recalculate the runAt until it gets run again. This means that if we update the alert at any time, including changing the interval, the changes won't take affect until it is rerun by Task Manager at some later point.

Issues To Think About

There are a couple of things that jump out at me here

This means that the use of interval in Alerting is different than the use of interval in regular tasks in Task Manager, but I can't see a good reason for this distinction. It means we have two places to maintain and keep in mind when thinking about intervals, and also means behaviour can diverge over time. For example, if we implement updating as part of this issue, then the semantics between updating a task vs. an alert could easily diverge.
In fact, as the interval is stored differently in alerts, an Alerting task doesn't actually appear to have an interval at all from Task Manager's perspective, which harms maintainability already.
In relation to the semantic around updating an interval, there is a nuance we need to discuss: if an interval on a task/alert is updated, _do we recalculate the runAt as (scheduledAt / lastRun) + interval or updatedTime + interval? My instinct is that the interval should be calculated relative to the last run (or in lieu of a previous run, the scheduledAt time), but want o hear thoughts.

Suggested Plan Of Action

Even though it isn't actually addressing the cause for this Issue, we begin by adding a reschedule api in Task Manager which is used to reschedule an already scheduled task, which will support updating the interval and will calculate the correct runAt based on the semantics that we'll agree after discussing _issue 2_ above.
Then we move Alerting to use Task Manager's built in intervals (as per https://github.com/elastic/kibana/issues/46001) as this will reduce the dangers in _issue 1_ above.
Change Alerting's update api to use Task Manager's new reschedule api.

Note: Steps 2 & 3 will likely be done together in a single PR, as separating them might break existing update functionality in Alerting... but not 100% yet.

Any thoughts?

gmmorris on 14 Nov 2019

Another issue to think about that we haven't tackled yet: TM should really support cron type scheduling as well as the somewhat vague "interval" it already supports. Watcher itself is kinda uber-flexible here, I think I'd be happy with just an interval or cron specification allowed, for now - https://www.elastic.co/guide/en/elasticsearch/reference/current/trigger-schedule.html .

So, just something to keep in the back of our heads, nothing to do about it now.

pmuellr on 14 Nov 2019

👍1

Then we move Alerting to use Task Manager's built in intervals (as per #46001) as this will reduce the dangers in issue 1 above.

It could be that this was done to handle the case where an action actually knows WHEN it should retry (eg, Slack 429 rate limiting responses, that indicate the time when you can try again). Since this already doesn't work quite right with our current slack action, I don't think this should hold us back from implementing as you suggest (have alerting use TM scheduling). Oh, actually that would be in actions, not alerts, but maybe actions are also doing their own scheduling? Will need to look closer at that, I think.

In any case, TM clearly has to deal with scheduling, so as much as possible it's probably best to have it be the source of truth and single way to deal with it.

pmuellr on 14 Nov 2019

Not clear why we need a separate reschedule API, since presumably an update can also essentially do a reschedule. I'm always a little leery of having "more than one way to do it". :-)

pmuellr on 14 Nov 2019

Not clear why we need a separate reschedule API, since presumably an update can also essentially do a reschedule. I'm always a little leery of having "more than one way to do it". :-)

This isn't quite in addition, as Task Manager doesn't have an update api at the moment, only a schedule.

That said, the Store has schedule and update and I'm now considering the addition of reschedule because they are actually quite different.

schedule - task a TaskTypeDefinition, instanciate a Task and schedule it to run immediately (or in the future if a runAt is specified).
reschedule - take an existing task and recalculate its existing runAt field, taking into account its current state, previous interval if it has one. Ideally we'll end up with a normal Task reshceduling using this API, instead of it happening inside of TaskRunner.
update - Take an existing Task instance and update whichever fields I tell you to, no questions asked. Very much an internal thing used to manage state.

gmmorris on 14 Nov 2019

Oh, actually that would be in actions, not alerts, but maybe actions are also doing their own scheduling? Will need to look closer at that, I think.

I think Actions return a boolean retry which then Alerting acts upon.

gmmorris on 14 Nov 2019

So, playing around with the code today I came to the conclusion that there's actually a difference between update and reschedule in the TaskStore:

update is a naive update as it takes whatever attributes you give it, such as runAt etc. and it used by the TaskManager, TaskRunner etc. while manipulating an existing Task
reschedule on the other hand takes whatever attributes of a Task and updates the existing task with these fields (like update), but it also needs to apply some scheduling semantics, such as resetting the runAt, and presumably wiping some scheduling state, such as retryAt and status fields. So, it is actually a mix between schedule and update, so reschedule does feel right to me.

Any thoughts @pmuellr ?

There is another, related, question that comes to mind- should rescheduling update the scheduledAt field? I think @mikecote will have thoughts on this...

gmmorris on 14 Nov 2019

argh, just realised the unit tests validating dates in TaskStore are unreliable as we mock the time and so actually ignore the time returned by the repository… if the wrong date is present - we reset it to the mock, so it doesn’t catch breakages. :grimacing:
I don’t think anything is broken, but mocking time is dangerous… I’ll look into cleaning that up while I’m in there.

gmmorris on 15 Nov 2019

argh, just realised the unit tests validating dates in TaskStore are unreliable as we mock the time and so actually ignore the time returned by the repository… if the wrong date is present - we reset it to the mock, so it doesn’t catch breakages. 😬
I don’t think anything is broken, but mocking time is dangerous… I’ll look into cleaning that up while I’m in there.

yup, I can confirm nothing is broken, it just obfuscated a potential bug. This is now fixed on the branch for this issue.

gmmorris on 15 Nov 2019

Here's another issue to iron out:

When we _schedule_ a Task with an interval we run the task and the interval comes into effect when the scheduling the next run.
In the case where we're using the reschedule api with a new interval, the default behaviour you might expect is for the rescheduled task to be run immediately (like when calling schedule) and then for the new interval to apply to the next scheduled run.
My instinct, though, is not to run the task immediately, but rather just adjust the interval based on when the task was last ran - wanted to verify we agree on this behaviour.

gmmorris on 15 Nov 2019

Further investigation has unearthed a few more moving parts to take into account:

This work is essentially introducing a second place where a task get rescheduled, as this also happens as part of the TaskRunner's work when a recurring task completes. I'll see if we can merge these into a single place for the logic to reside, which will help make TaskRunner smaller, which was already a concern I had as it does many different things.
There are several states a task might be in when reschedule is called which all need to be addressed:
A non recurring Task whose RunAt is in the future (we don't currently create these as far as I'm aware, but the API allows it)
A recurring task which has been scheduled to run in the future at some point
A task (recurring or otherwise) which is mid run and we need to decide whether to reschedule in such a state of flux
A task (recurring or otherwise) which has failed and is . about to be retried in the near future, does rescheduling reset the runAt same as it would with an idle task?

Still contemplating these cases and the repercussion of rescheduling them.

It turns out we don't know when a recurring task was last run , only when it was scheduled (startedAt gets wiped when a task completes and gets rescheduled) so calculating the runAt when an interval is changed is tricky as scheduledAt + interval is very likely to be in the past, resulting in an immediate run of the task. I'm looking into using the existing interval to subtract from the current runAt to get the previous run time and then add the new interval to it, but that makes out scheduling logic even more messy, so I'm trying to simplify the moving parts while I'm in there.

gmmorris on 15 Nov 2019

The issue we're hitting is that as the Alert is run immediately after creation/enablement

By design, I believe we're comfortable with this feature. Though it doesn't get a chance to recalculate the runAt until it gets run again is definitely part of the problem. A user can update an alert anytime (not just after initially creating the alert) and the issue outlined exists.

This means that the use of interval in Alerting is different than the use of interval in regular tasks in Task Manager

I believe they behave the same? but are coded differently. We changed the task manager's interval feature to match what we're doing in Alerting and what we think tasks should do in general.

In relation to the semantic around updating an interval, there is a nuance we need to discuss: if an interval on a task/alert is updated, _do we recalculate the runAt as (scheduledAt / lastRun) + interval or updatedTime + interval?

I think now + interval would be the easiest unless we feel we should go down the path of calculating the last time the task ran by doing (runAt - previous interval) + new interval

@peterschretlen any thoughts on how we expect alerts to re-schedule when changing the interval?

Change Alerting's update api to use Task Manager's new reschedule api.

I'm not sure we would need a reschedule API for alerting. We would need to update the task's interval parameter which I'm assuming would cause a re-calculation of runAt. This could bring a few more complications like "how do I know if the interval changed" and "did I previously have an interval" but I think making the update API a bit smarter would be my pick.

My instinct, though, is not to run the task immediately, but rather just adjust the interval based on when the task was last ran - wanted to verify we agree on this behaviour.

++ for the reschedule API if we develop it.

There are several states a task might be in when reschedule is called which all need to be addressed

Yeah, I think we'll have to add some logic to the update API.

@pmuellr

Another issue to think about that we haven't tackled yet: TM should really support cron type scheduling as well as the somewhat vague "interval" it already supports.

Agreed, I found this issue https://github.com/elastic/kibana/issues/50272

mikecote on 18 Nov 2019

I believe they behave the same? but are coded differently. We changed the task manager's interval feature to match what we're doing in Alerting and what we think tasks should do in general.

They're mildly different actually, but not for any reason I can identify.
Alerting defaults to Date.now() if the startedAt + interval are in the past (TM doesn't do this), and the interval is stored in params instead of the interval field.
In anycase - the logic is duplicated between the two, and mildly different - merging these seems the right direction.

I think now + interval would be the easiest unless we feel we should go down the path of calculating the last time the task ran by doing (runAt - previous interval) + new interval

I've gone with (runAt - previous interval) + new interval, it wasn't that messy, and it feels more _correct_, but can easily strip this out.

I'm not sure we would need a reschedule API for alerting. We would need to update the task's interval parameter which I'm assuming would cause a re-calculation of runAt. This could bring a few more complications like "how do I know if the interval changed" and "did I previously have an interval" but I think making the update API a bit smarter would be my pick.

TaskManager doesn't currently have an update api, only a schedule api, so either way we're exposing a new api.
I don't want to expose the update api as it feels internal and naive and lets you change lots of internal fields (anything on ConcreteTaskInstance that I don't think we should let any PlugIn update blindly as it can easily cause a broken state in a task...

Instead I prefer to expose a reschedule api that only lets you change fields that were available in the original schedule.

There are several states a task might be in when reschedule is called which all need to be addressed

Yeah, I think we'll have to add some logic to the update API.

Hence, better to separate update from reschedule by keeping update internal, and exposing a new api.

gmmorris on 18 Nov 2019

I'm working on a draft of this over here: https://github.com/elastic/kibana/pull/50718

Some of the decisions I've taken in this PR (all open to debate and change obviously, this is just as I iron out the moving parts):

Keeping update as an internal API, rather than an open one, and instead expose a reschedule api that takes the same potential arguments as schedule, but none are required except for id which we use to find the current values of the unspecified fields.
If an interval is updated, we reset the existing runAt and recalculate it. If an interval was already set we use it to calculate the runAt by calculating the previous run and adding the new interval to it. If the task didn't have an interval before, we calculate the new runAt as scheduledAt + interval which _might_ be in the past at this point, but this state should only happen if the task was originally scheduled with a custom runAt and no interval, and I'm not aware of any situation where we do that ATM.
3.I currently only allow updating an idle task, until we discuss how we'd like to tackle the case of trying to reschedule a running or failed task.

I'd love to hear early thoughts on this branch, but definitely treat this as a DRAFT, no need to go into well thought out reviewing as I'll likely change things around as I try to reduce duplication of logic. :)

gmmorris on 18 Nov 2019

@peterschretlen any thoughts on how we expect alerts to re-schedule when changing the interval?

I think either is fine but with (runAt - previous interval) + new interval, what happens when your new interval is shorter than your old interval - you end up with a runAt time in the past, what happens to the task in that case? Does it run right away?

Say it is 5 pm right now. I update my task which has: runAt is 6 pm, previous interval is 6 hours and new interval is 1 minute then won't my new runAt be in the past - 12:01pm / 4h 59min ago? What happens to the task when it is scheduled in the past? Maybe now + interval is simpler...

In general I think interval on a task is problematic because we have no fixed reference point and leads to problems like this - maybe we should start thinking about #50272 and schedules? The draft PR looks good, but the more we build around the idea of tasks having an interval (and others build on top of it) the harder it will be to migrate to more explicit scheduling.

peterschretlen on 18 Nov 2019

what happens when your new interval is shorter than your old interval

In this scenario, I was thinking we would schedule the task to run now instead.

In general I think interval on a task is problematic because we have no fixed reference point and leads to problems like this

We are at a good stage right now to prevent this technical debt on the interval feature. I don't think anyone uses the feature at this time (including alerting) and we could remove the the feature entirely.

To solve the alerting issue, we can either work on a solution until #50272 is implemented or focus on implementing cron like syntax right now. For guidance on this, I am not sure of the efforts it would take to implement #50272 but we can easily work on an alternative solution in the meantime.

mikecote on 18 Nov 2019

In general we use retryAt and runAt as our sorting order, so if an interval adjustment makes your task 4 hours in the past, it'll be picked up before a task of now.

We could set now as the lower bound 🤷‍♂️to be honest I'm not sure what would be _correct_ here.... so good either way.

I haven't taken a close enough look at #50272 yet, so unsure about the work needed... are we sure no one is using intervals as they are? 🤔

gmmorris on 19 Nov 2019

Had a quick catchup with @mikecote & @peterschretlen and we decided the following:

We'll move ahead with the interval solution in order to resolve the problem Alerting currently experiences when intervals are updated.
We'll prioritise the #50272 as next after the New Platform migration of Task Manager, as we feel it's high enough priority relative to other things.
In order to keep things simpler in Task Manager, reschedule will work like schedule in that calling it immediately runs the task and when it completes it will recalculate the runAt using the new interval.
The status of the task shouldn't affect the ability to reschedule so calling the api when a task is running or failed should be allowed.

A follow up conversation with @mikecote suggested we weren't sure whether schedule should support all fields available in schedule or just the fields needed to actually cause a new schedule to be set.
As neither of us felt too strongly either way, I've decided to limit the fields to runAt and interval at the moment, as adding the other fields later is easy, but reducing fields later will be harder if it gets used.

gmmorris on 19 Nov 2019

👍1

Sorry for the late reply - I didn't realize we didn't have a public update() method, and I was worried about having two ways of updating the schedule, given the new reschedule() api. With no public update() method, I'm obviously happy with having a new reschedule() api :-).

I guess if we DO add an update() API of some sort, we'll need to figure out what the semantics of that are - does it automagically reschedule, or just update relevant things without neccessarily rescheduling, or maybe takes an additional option indicating whether rescheduling should happen. Kinda thing.

pmuellr on 20 Nov 2019

Haha no worries, yeah, my thoughts exactly but hopefully we won't need an
additional API. 🤞

On Wed, Nov 20, 2019, 14:57 Patrick Mueller notifications@github.com
wrote:

Sorry for the late reply - I didn't realize we didn't have a public
update() method, and I was worried about having two ways of updating the
schedule, given the new reschedule() api. With no public update() method,
I'm obviously happy with having a new reschedule() api :-).

I guess if we DO add an update() API of some sort, we'll need to figure
out what the semantics of that are - does it automagically reschedule, or
just update relevant things without neccessarily rescheduling, or maybe
takes an additional option indicating whether rescheduling should happen.
Kinda thing.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/kibana/issues/45152?email_source=notifications&email_token=AAC6JIBZIGA66JBURKFXMKLQUVF4JA5CNFSM4IU4SPLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESHYVI#issuecomment-556039253,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAC6JIC2WXHJYQZSJURMO5TQUVF4JANCNFSM4IU4SPLA
.

gmmorris on 20 Nov 2019

Going back on forth on this with @mikecote I've come to the conclusion that this current model is a little dangerous, especially with the expected growth of the use of Task Manager in Alerting.
The current model relies heavily on the versioning of documents and naively tries to update documents in the hope it hasn't been touched.

This is due to the fact that we use a single document to manage both configuration and running-state of a task. This model doesn't scale, especially in a distributed system, and I think we should hold off on merging this change as it introduces a greater likelihood of clashes that in some cases might be hard to recover from in a manner that can be reasoned about.

I'm prioritising looking into how we might be able to separate the configuration of a task from it's running-state, in such a way that configuration updates can't clash with state updates.
The sooner we tackle this the better.

gmmorris on 22 Nov 2019

Since this request is being driven by alerting, perhaps we should revisit the assumption that a reschedule API (or any kind of API for configuration update) is needed.

Rather than fighting the issues with optimistic concurrency control, what if we consider task configuration immutable?

To update the configuration, you have to tear down and re-create the task. That pushes responsibility on the client. I don’t know if that would just introduce other problems, but if we’re seeing lots of collisions like this maybe we need to revisit the assumption we need a reschedule api on task manager.

At some point I think it makes sense to try and solve the underlying problem of separating updates to state and configuration. As-is though that problem is by design, and it may not be worth revisiting the design as part of this issue.

peterschretlen on 22 Nov 2019

👍1

Some thoughts from @mikecote on potential issues with immutable config:

There could be some challenges if the task state doesn’t transfer (resets throttling, loses alert instance state, etc) but maybe another use case to move alert instance logic out or we copy it over.

peterschretlen on 22 Nov 2019

Agreed, as the original need for reschedule (as opposed to remove + schedule) was that we would have to handle failure mid way and recovery, but we're now doing that anyway in TM and due to the generic nature of TM, it's a more complicated task to achieve there.

Some thoughts from @mikecote on potential issues with immutable config:

There could be some challenges if the task state doesn’t transfer (resets throttling, loses alert instance state, etc) but maybe another use case to move alert instance logic out or we copy it over.

If the immutable task approach were to be handled in TM we would have to ensure 100% safety in carrying over state, so this would still require some kind of reconciliation across states.
If, knowing the specific needs of Alerting, we can live with some state loss then it's probably preferable to handle this at Alerting level rather than Task Manager accepting that kind of data loss in general.

gmmorris on 22 Nov 2019

If we take a step back and look at the original problem, it basically comes down the alert not picking up the updated interval right away. We could solve this problem by just running the alert after update ("run now" feature) and let the post execution runAt calculation handle the rest. This would sort of mean we still don't use the interval feature of task manager but maybe because of OCC we shouldn't.

This will at least guarantee the alert won't lose state but at the same time not guarantee the updated interval will take effect immediately. But it will eventually which I think is more important than potentially corrupting a task.

This then moves the remaining problems to the "run now" API for things like "what if it's already running now" (which could be an error or pretend it in fact did run immediately). We can also add some UX to handle these corner cases like a note to say "it may take longer to update the alert's check interval".

This idea is still very fresh and probably has its downsides I didn't get a chance to think of, but food for thought.

mikecote on 22 Nov 2019

I'm not sure I follow the idea.
Wouldn't that still require setting the runAt of the task and encountering the same issue?
I might be missing something.

Also... sorry, but what's OCC?

gmmorris on 22 Nov 2019

Wouldn't that still require setting the runAt of the task and encountering the same issue?

The scenario here is that we would / could swallow the error and let the next execution return a new runAt with the updated interval.

what's OCC?

Optimistic Concurrency Control. The cause of the lovely 409 errors.

mikecote on 22 Nov 2019

ahhh 🤦‍♂ ofcourse.

gmmorris on 22 Nov 2019

Wouldn't that still require setting the runAt of the task and encountering the same issue?

The scenario here is that we would / could swallow the error and let the next execution return a new runAt with the updated interval.

Ah, you mean Alerting would handle that?
Because TM can't afford to ignore the error if it happens in the lifecycle.

gmmorris on 22 Nov 2019

Ah, you mean Alerting would handle that?

Yeah I think we just do that for now, it would check the box for alerting and we can look at maybe letting TM do so later. We could just focus on a run now API which basically calls the claim process for a specific task and skips it if it's not idle state.

I think we learned a lot with this exercise about what updating a task means and how it can go wrong in a few ways. I'm not opposed to still explore the delete + schedule new one either but figured I would put this as an option.

Curious what @peterschretlen and @pmuellr think?

mikecote on 22 Nov 2019

Yeah the "run now" approach sounds promising. And clean - I suspect it would have fewer edge cases than a "delete + schedule" approach.

peterschretlen on 22 Nov 2019

Cool, I'll look into that api, that should be fine.
I'll also amend / add issues and pick them up.

gmmorris on 25 Nov 2019

I've began looking into the work needed for run now.
I believe that if we can plug it into the normal lifecycle of Task Manager, essentially triggering an early polling phase with the requested task being handled first (perhaps even above Worker capacity, though not sure we want to do that), then it shouldn't add too much complexity to Task Manager.

One thought though is that we need to give some information about the Task run back to the API caller (I presume), so I was wondering if it's just about giving back _success_ vs. _failure_ (w/ error message) information? Or do we need to give more information like updated state?

Either way there's work in TM to do, as we don't currently expose anything about success/failure other than internally in the runner, but this seems like a good opportunity to figure out how to do it as we said we'd want that for diagnosis anyway.

gmmorris on 25 Nov 2019

Yeah, we should develop this #50214 in preparation for this #50215. We'll be able to re-use this run now API from task manager when creating an API to run alerts immediately.

I think basically setting a runAt to now is good enough and maybe calling a function to search for tasks based on availability of workers. If you want to go one step further, maybe the minimum between now and the existing runAt in the scenario the task manager has a backlog of tasks to run.

We can worry about bypassing available workers and such at a later time if necessary. We can make the function communicate when the task isn't idle, got updated during attempt to set runAt (409) or queued / success.

This will be mostly for server logging purposes for the alert update API to communicate if the task didn't get queued. That way we've done our best to make the new interval take effect but worst case it'll happen at the next run. Also with a server log this will make easier to debug in the future.

mikecote on 26 Nov 2019

hmm, I wasn't going to update the runAt but rather extend the claim so that on the next poll it'll pick that task up no matter what.
The thinking is that this would mean we avoid the issue of multiple Kibana updating a task in parallel, so it feels safer, and it also means we can reply to the API call with meaningful data (such as task has been run successfully / failed to run with the following error).

If there's no need to return such a response (and we rely on the log instead) then I'd still prefer to utilise the claim step rather than update the field, but then we can simply reply with whether the task has been successfully queued or not.

The one downside to this approach is that whichever Kibana gets the request will handle the execution (rather than allowing any Kibana to pick it up) but this feels much safer than naively updating the field (as it's not just 409s on update I'm worried about, but also 409s due to synchronisation issues).

gmmorris on 26 Nov 2019

You're right, this is probably the approach we should go. Thinking of it, the alert run now API wouldn't respond until execution is complete.

mikecote on 26 Nov 2019

👍1

runNow has been merged into master and I'm working on using it to rerun the alert the moment it's interval is changed.
At the moment, the update method on the AlertsClient returns the updated alert when it .is updated successfully and throws an error if it fails.

We are now going to have two "successful" outcomes to running update:

We've updated the alert, and the interval has successfully been rerun so the interval is up to date.
We've updated the alert, but running it has failed ,so it will update the interval on the next run.

Any ideas how we should notify the user about these things?
The first case can simply return the updated alert. The second, on the other hand, is tricky, as we want the user to understand that the alert has actually been updated successfully, but we also want them to understand that the alert won't run as per the updated interval, but rather the previous value.

Any ideas?

gmmorris on 16 Dec 2019

We discussed the above issue on slack and decided the following:
We won't wait for the runNow to complete, so return value remains the same - but we will log if it fails.

gmmorris on 16 Dec 2019

Kibana: [DISCUSS] Task manager update API to allow changing a task's interval

All 40 comments

Current State

Issues To Think About

Suggested Plan Of Action

Related issues