Cylc-flow: custom xtriggers: (suicide-like?) "dead" function clean up

Created on 23 Jul 2019 · 6Comments · Source: cylc/cylc-flow

This relates somewhat to (aspects of) other Issues RE xtriggers (#3221, #2712) but is not covered by these. (Note it is not a case of the now-closed #3054, as it regards runtime behaviour for valid recognised trigger functions.)

Custom external trigger functions that reach a stage where they cannot interface with Cylc's suite server program to specify a satisfied trigger (I'll refer to them as "dead" here), will result in upstream dependencies persisting in the waiting state, & accordingly can stall the suite.

For standard task triggering this is of course dealt with by means of suicide triggers. But with xtriggers, a "failure" case can't be specified in the graph (by means of a qualifier as for tasks, or otherwise) so I believe there is no way to handle such "dead" custom external functions so that dependent tasks can be removed from the scheduling & the suite can move on, save of course from a manual reset out of waiting, but that seems unwieldy for a permanent solution.

Reasons a custom xtrigger might become "dead" (a term I am coining for convenience here & to avoid using any terms that already have meaning in Cylc) are that (see matching examples below):

[Ex1] it logically always returns the signature for trigger unsatisfied;
it does not return the on-interface signature of (False, {}) or (True, results) for some dict results, because:
1. [Ex2] it hits a parsing error or an exception which terminates the logic so it never gets to the valid return statements in the function;
2. [Ex3] the wrong return signature is produced by the xtrigger (& given the opaqueness of the current state of xtriggers, where only a --debug start-up option reveals any traceback from the xtrigger function itself, & it requires trawling through the suite log, this isn't a contrived case) either because the wrong signature was unknowingly set by the user, or results ended up not being a valid dict somehow.

Release version(s) and/or repository branch(es) affected?

master branch i.e. 8.0a0 + earlier versions.

Steps to reproduce the bug

Run any of these examples, noting that the upstream tasks (e.g. foo in Ex1) hang in the waiting state. Ex2 can be toggled between "dead" & not "dead", so that is the best one to try out.

Example 1: always returns fail signature

Just run the echo in-built example within the suite as outlined in the docs (here) & also in the xcylc-xtriggers directory (here).

Example 2: hits error or exception before return

Representative xtrigger breakable_script.py, which contains a Python error, here inside a block so it does not encounter the error unless specified via the argument break_everything.

def breakable_script(break_everything):
    if break_everything:
        asdlkjs  # gibberish, not valid Python (easiest way to hit an error!)
    else:
        return True, {"AmIbroken": "no"}

Minimal working example suite to use to apply the above xtrigger:

[scheduling]
    initial cycle point = now
    [[xtriggers]]
        breakable_xtrigger = breakable_script(break_everything=True)
    [[dependencies]]
        [[[PT1M]]]
            graph = """
                    @breakable_xtrigger => wait_around
                    to_get_some_cycles_going =>  wait_around
                    """

If break_everything=False is instead set, the xtrigger will be satisfied & dependencies can run, but as-is the syntax error that the xtrigger terminates on will cause the suite to stall waiting for wait_around over all 'max active cycle points'.

This particular function error could easily be picked up pre-run by some form of validation, but in realistic use cases there could be exceptions that are thrown which are scenario- (input etc.) dependent can't be picked up before the suite starts.

Example 3: Invalid return signature

def wrong_interface_script():
    return {False: {}}  # not a two-tuple of Boolean and dict

& run this in a minimal suite very similar to that in Example 2.

Expected behavior

There is some means to remove tasks with a dependency on a custom external trigger, to handle "dead" xtigger functions.

bug?

Source

sadielbartholomew

All 6 comments

I'm not convinced on the validity of this issue @sadielbartholomew.

An external trigger represents some external condition that is either True (satisfied) or False (not satisfied yet). Tasks that depend on them wait for the condition to become True. This is quite different to depending on a task status - where succeeded, failed, started, submitted etc. are all final states.

[Ex1] it logically always returns the signature for trigger unsatisfied;

If that's the case, a downstream task should never trigger (unless manually triggered) because the dependence is on the trigger condition being satisfied.

[Ex2 and 3] it does not return the on-interface signature of (False, {}) or (True, results) for some dict results ...

Then the trigger function is buggy and needs to be fixed! Trigger functions, by definition, are supposed to always return (True/False, dict) although it is arguably reasonable to treat a raised exception as unsatisfied (so try again and hope the exception doesn't happen again).

hjoliver on 25 Jul 2019

👍1

However, it may be reasonable/useful to:

allow tasks to trigger off some kind of trigger function expiry, where a trigger function failed to return True up to some user-specified limit (e.g. try 100 times then expire)
allow more nuanced trigger function return semantics, e.g.:
a. (satisfied, results) - external condition satisfied
b. (unsatisifed) - external condition not satisfied yet
c. (abort) - external condition will never be satisfied, so no point in continuing to try

(2.a. and 2.b. are the status quo; The new bits, 1. and 2.c. are essentially the same, except that the abort/expire status is decided by the suite in the first case, and by the function itself in the second case).

But I don't think there's a use case for automatically reacting (in the workflow) to trigger functions that fail with an internal error or return an illegal result. Those are bugs, somewhat analogous to a buggy forecast task that never actually runs the underlying atmos model - it just has to be fixed or else your workflow is fundamentally broken.

hjoliver on 25 Jul 2019

👍1

@sadielbartholomew - actually I really like the idea of an xtrigger function being able to report back "I will never be satisfied so you might as well quit trying", and tasks being able to trigger off of that (including suiciding the tasks that are waiting on the xtrigger to be satisfied).

If we can call that a "dead" xtrigger then I think you have raised a very good point here. (I just don't agree that an internal error or invalid return value should be considered "dead")

Although before agreeing to implement, we should probably come up with a seemingly-valid use case. Maybe something like this: don't keep checking endlessly for a particularly DB entry if the DB can't even be reached, or appears to be hopelessly corrupted, or has some other indicator entry that says the requested data had to be skipped today?

hjoliver on 25 Jul 2019

👍1

Hi @hjoliver, so you know I have had a skim read of your comments (on the two xtrigger Issues) & will get back to you fully but I am a bit short of time right now so will have to do so later on, sorry. But, in short, I understand your points & reasoning, & I'm happy to close one or more of these Issues if there aren't really "issues" as such (I guess what I was looking for is some clarity on xtriggers & how acceptable the symptoms of some edge cases are).

sadielbartholomew on 25 Jul 2019

@sadielbartholomew - that's fine :+1: ... if you (after reading my argument in detail) agree that there is no case for automatic handling (in the workflow) of xtrigger function internal errors or illegal return values, then I think we could close this issue, but put up another one to support "I will never be satisfied so you might as well quit asking" return semantics - which could be automatically handled in the workflow, and which is really just a tweak of what you proposed here (and which I had not thought of prior to your proposal).

hjoliver on 26 Jul 2019

👍1

@hjoliver, again sorry for the delay in responding. Having considered your points:

if you (after reading my argument in detail) agree that there is no case for automatic handling (in the workflow) of xtrigger function internal errors or illegal return values

Yes, after reconsideration & considering especially your analogy to a buggy standard task, that seems correct. So, & I've just made the cases I outlined in my opening comment numbered to aid referencing: for case 2ii, it is just a case of indicating to the user via warnings or errors that the xtrigger has the wrong interface, & #3237 should cover that one for the more realistic cases there, where the dict is illegal.

Then, I believe, ensuring that the right interface (i.e. that for either satisfied or unsatisfied) is returned to account for any possible logical & context-based flow through the function is ultimately a suite design problem, which is actually information already requested by someone in cylc/cylc-doc#10. On that topic, It seems to me the essence of designing a good xtrigger is to make sure it always returns the false signature, unless the exact conditions needed (accounting for any aspects which could go reasonably go wrong, for example for an answer you gave on the forum regarding xtrigger resets that a required "externally-generated file is complete and valid, not just present" hence not an incomplete or corrupted file) are met, in which case return the true signature.

For cases 1 & 2i, the motivation of this Issue is really to be able to "clean up" for cases where, as you put it:

"I will never be satisfied so you might as well quit asking"

but also I think to account for another case of "This has not satisfied for after so many tries it is no longer relevant". The former seems like it would usually have to be caught on the function-level (e.g. to pick up on a corrupted or inaccessible external file or DB) but the latter would need to be determined at suite-level (based on the context of the attained status other tasks, like for a clock-expire task). So while I like your idea of aborting/expiring, I am torn as to which of your suggestions along that line:

The new bits, 1. and 2.c. are essentially the same, except that the abort/expire status is decided by the suite in the first case, and by the function itself in the second case

would be best.

I'll continue think on this one, & get back to you shortly!

Although before agreeing to implement, we should probably come up with a seemingly-valid use case. Maybe something like this: don't keep checking endlessly for a particularly DB entry if the DB can't even be reached, or appears to be hopelessly corrupted, or has some other indicator entry that says the requested data had to be skipped today?

Those sound like valid, & not overly-contrived, user cases to me. Others I can add are on the same lines, but not just for the DB case:

missing external files: data files that were not generated for some reason, e.g. observational data that was not taken for a given case;
inaccessible locations: the function is unable to connect to message broker servers, or hosts holding required external files...

sadielbartholomew on 11 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cylc Review WSGI script not a callable?

kinow · 4Comments

Scheduler keeps running after CTRL + C (Py3.8)

kinow · 4Comments

empy: Python3 and the future of cylc support

oliver-sanders · 5Comments

Replace scheduler.py main loop by asyncio loop

kinow · 4Comments

03-clock-triggered-non-utc-mode.t failing in NZ time

kinow · 4Comments