Cylc-flow: Post-SoD task hold semantics

Created on 5 Aug 2020 · 25Comments · Source: cylc/cylc-flow

“Hold” currently means: prevent waiting task proxies from submitting their jobs. Post-SoD we can still hold a suite, but holding individual tasks is considerably less feasible because the only waiting tasks that exist now as task proxies are those with multiple parents and partially unsatisfied prerequisites. Further, even those waiting tasks might be hidden in future: proposal and #3822. And besides, users should not have to understand the difference between abstract (graph) and concrete (task proxy in the pool) waiting tasks.

Do we really need the concept of individual task hold?

If we do, options include:

(maybe) holding an active task causes its spawned children to be held?
(probably) allow users to hold abstract tasks, which will require the scheduler to check a list of tasks-to-hold at spawn time

sod-follow-up

Source

hjoliver

All 25 comments

I think having the ability to hold an individual task makes sense and is useful.

If it's difficult to hold tasks before they exist in the pool we could, potentially, consider an xtrigger implementation similar to the proposed retry approach.

oliver-sanders on 5 Aug 2020

Yeah, or a broadcast style implementation?

n=1, would make it easier to select the non-existent also.

dwsutherland on 6 Aug 2020

n=1, would make it easier to select the non-existent also.

Selection is easy, since we plan to allow n=1,2,3,... up to "nmax" which is "all tasks in the cycle point".

The question is just how to record what's been held when the corresponding task proxies don't exist yet in the scheduler. (c.f. SoS where you could only hold tasks that already have task proxies in the pool).

Note that holding only matters once a task proxy exists, so we can still "action" the hold in the old way (essentially, set a held flag in the task proxy, which tells it not to submit its job even if prereqs all satisfied).

Regarding xtrigger- or broadcast-style implementations - what's wrong with this simple approach?:

scheduler maintains a list of task IDs that are held
when a task is spawned, check if is in that list; if it is, hold it (i.e. spawn it, but set its held flag)
when a held task is released, remove it from that list, and if an actual held task proxy exists, unset its held flag.

hjoliver on 6 Aug 2020

Regarding xtrigger- or broadcast-style implementations - what's wrong with this simple approach?:

Would be fine, though because the value is not stored on a per-task basis it would introduce a lot of lookups.

That list will need to go the to DB and we will need to re-work the is_held functionality as you will have to ask the task_pool whether a task is held rather than the task_proxy itself.

oliver-sanders on 7 Aug 2020

(BTW I don't see this as a super-high priority because suite-hold still works fine post-SoD and is OK for now as a proxy for task-hold - after all, suite-hold doesn't affect tasks that are already active, and you can still trigger individual tasks if you want to when the suite is held ... so 8.0.0 milestone will do here).

hjoliver on 8 Aug 2020

Coming back to this I'm starting to lean towards making "is_held" an XTrigger.

If we go down the XTrigger route with the retry states (#3423) and the queued state then it would make some sense to do the same with the old held state too. After all "held" is a trigger and is very much external in nature.

We could generalise the hold/release mutation interface into addXtrigger/satisfyXtrigger or something like that.
Rather than display a paused badge in the task state icon we could display an XTrigger badge if there are any unsatisfied XTriggers.
We could display the held xtrigger along with the others in the task summary in the UI (above the jobs).
We could satisfy a held xtrigger the same way as a clock trigger, retry timer or queue in the UI.

Unrelated note which is it:

XTrigger
Xtrigger
xTrigger
xtrigger

oliver-sanders on 16 Sep 2020

Good idea :+1:

Unrelated note which is it:

I prefer (and have always used) "xtrigger". It's just a name that means "external trigger". IMO use of capital "X" makes it look like an acronym (e.g. short for "Xylophone Trigger") or elevates its importance unnecessarily (we don't capitalise every other concept/feature of Cylc). And xTrigger? - I hate camel case.

hjoliver on 17 Sep 2020

Looks like we'll go for option 2 in the description

allow users to hold abstract tasks, which will require the scheduler to check a list of tasks-to-hold at spawn time

From @oliver-sanders on this PR:

n=1 matching + a new list in the task_pool caching hold requests for proxies that aren't in the pool yet backed up by a new DB table should do the trick

hjoliver on 24 Mar 2021

To quickly summarise:

Add a set of "tasks to hold" in the TaskPool.
When a user "holds" a task, if it isn't in the pool add it to the list.
When the user "releases" a task, if it isn't in the pool remove it from the list.
Check this set when spawning new tasks.
Synchronise this set with a new table in the DB.
On restart load the set from this DB table.

This new table (tasks_to_hold?) should work in a similar way to the task pool table.

oliver-sanders on 27 Apr 2021

I see @hjoliver is assigned, shall I replace you?

MetRonnie on 27 Apr 2021

Yes please! I'll switch the assignment over now...

hjoliver on 28 Apr 2021

To note down what was discussed today and add a bit more:

Globs (e.g. cylc hold <flow> '*', cylc hold <flow> 'foo.*', cylc hold <flow> '*.5') are a problem because either you can just store the glob in the tasks_to_hold set (but then releasing tasks becomes a problem) or you can try to expand the glob (but there isn't really a way to do this for tasks outside the n = 0 window).

However, it should be possible to handle globs and normal task identifiers separately, so normal task identifiers get put in the tasks_to_hold set while globs only get expanded in the n = 0 window.

(For the scenario of trying to do cylc hold <flow> '*.5' outside the n = 0 window, there's always cylc hold <flow> --after=5)

MetRonnie on 4 May 2021

👍1

Agreed.

Plus, it would be hard to determine when the glob is "finished".

However, this does make globs pretty much useless in this context (n=0 tasks are already active!)

hjoliver on 4 May 2021

Another thing to note: cylc hold <flow> foo is the same as cylc hold <flow> 'foo.*'.

I guess we should document the fact that holding active tasks is unlikely to do anything as they have probably already submitted their jobs.

MetRonnie on 5 May 2021

I wasn't aware of this behaviour, it doesn't appear to be documented and I don't think it's helpful so happy to drop.

The documented example is:

cylc hold my_flow 'mytask.*'

Which makes the glob explicit and is much clearer.

oliver-sanders on 5 May 2021

It isn't specific to hold, it's in the TaskPool.filter_task_proxies() method
https://github.com/cylc/cylc-flow/blob/f5802012d1bdd03af52afae205fed2a1b19f8597/cylc/flow/task_pool.py#L1375-L1377

MetRonnie on 5 May 2021

Another issue is the status part of cylc hold <flow> 'name[.point][:status]'. Are we really going to store the name, point and status in the tasks_to_hold set? Is it actually necessary to be able to hold based on the status?

MetRonnie on 5 May 2021

Status is really a part of the globbing system as it is used to filter a larger number of matching tasks by status.

When explicitly specifying tasks by name / cycle it serves no purpose so I wouldn't expect it to be included in this task_to_hold set.

oliver-sanders on 5 May 2021

I'm beginning to wonder if we should instead not accept globs for cylc hold? If their n-window behaviour is inconsistent with explicitly specifying the task name + point, and if they are unlikely to do anything as the tasks have probably already submitted their jobs..?

Or, we could only accept globs for the task names - this shouldn't be a problem as we know what the full set of task names are regardless of n-window.

MetRonnie on 5 May 2021

There is one valid use case for using globs with cylc hold:

# prevent retries from taking effect
cylc hold <flow> '*:failed'  # due to execution issues
cylc hold <flow> '*:submit-failed'  # due to submission issues

This is a pretty handy one for operations.

I think:

cylc hold can continue to share globing behaviour with the other task pool operations.
All task pool globs should only expand over the n=0 window.
Selectors (i.e. :<state>) should be included with globs (i.e. only applied over the n=0 window).
The "tasks-to-hold" list should only accept/contain explicit task name / cycle point pairs (i.e. no globs or selectors).

oliver-sanders on 6 May 2021

👍1

Do we want to allow holding future task families e.g. cylc hold myflow FAM.4? (It will be the tasks that are members of the family that get added to the tasks_to_hold set, not the family itself)

MetRonnie on 12 May 2021

Yes, for a consistent CLI and because it's easy to do (i.e. to expand to family members).

hjoliver on 12 May 2021

Would be good, however, I'm not sure how we would remove these families from the hold list, it would also infer we could do cylc hold myflow root.4 which is effectively globing.

oliver-sanders on 13 May 2021

Just store the child tasks in the hold list, not the family, right?

hjoliver on 13 May 2021

Could end up with some family members that don't get run left in the set forever but there shouldn't be enough of them to cause a problem. Can use recurrence aware family expansion to reduce this.

oliver-sanders on 13 May 2021

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Scheduler keeps running after CTRL + C (Py3.8)

kinow · 4Comments

Replace scheduler.py main loop by asyncio loop

kinow · 4Comments

make main loop more asynchronous

oliver-sanders · 4Comments

cylc-cat-log doesn't respect shebang python

kinow · 3Comments

Traceback with cylc play and Cylc 7 workflows

oliver-sanders · 3Comments