Cylc-flow: Exit status specific triggers for highly-flexible scheduling

Created on 25 Jul 2019  路  7Comments  路  Source: cylc/cylc-flow

As well as the current failure task state resulting from any non-zero exit status from that task's script, we could support triggering off of specific exit statuses. For example (using a syntax with parentheses for illustration, though I am aware that syntax may not be viable):

    graph = """
        foo:fail => bar     # Standard failure: captures all non-zero codes
        bar:fail(1) => pub  # New: trigger pub if bar fails with exit status 1...
        bar:fail(2) => wop  # ...but trigger wop if it instead fails with exit status 2.
                            # Any other bar exit status does not trigger anything.
    """

this graph would distinguish & take a different scheduling course depending on whether bar fails with exit code 1, or 2, or any other non-zero code.

While users are perhaps unlikely to have need to differentiate between direct script setting exit cases, I raise this because with this feature exit codes would essentially become parameters allowing for greatly extended control in scheduling. Instead of only having standard task "final" states of succeeded, failed & submit-failed (& in a sense expired, which is a final state of sorts I understand), there would be essentially unlimited (in practice, 255) possible endpoints available for users to catch in their scripts to trigger off a myriad of possible cases arising in them. Though, it would be a separate specification (e.g. the parentheses syntax); I am not suggesting the standard failure(& success) cases should go, as users would often not need this advanced flexibility.

Illustrative example

As a superficial example, note how various end cases of interest can be used to branch the scheduling in the below. Naturally, in a real case, the code would be much more involved; imagine the sys.exit(N) calls are placed at points of interest in the script control flow each with some chosen N = 0, ..., 255.

Suite.rc snippet:

[runtime]
    [[my_task]]
        script = "failure-mode-demo.py"

Python script bin/failure-mode-demo.py

# ...
# ...
# ... More involved code here! 'this' variable may get set.
# ...
# ...
if not this:
    sys.exit(1)  # endpoint 1: exit code 1, failure mode
try:
    import my_module
    my_module.some_operation(this)  # say this logically can hit a TypeError
except ImportError:
    sys.exit(2)  # endpoint 2: exit code 2, different failure mode
except TypeError:
    sys.exit(3)  # endpoint 3: exit code 3, different failure mode
# endpoint 4: exit code 0, success
speculative

All 7 comments

I'll just note that we can already achieve the same thing with custom task messages - by translating (in job scripting) application return codes into meaningful messages, and triggering tasks off of those. However, for applications that do have well-defined return codes for specific error conditions, this is a good proposal (as it reduces effort - no need to use custom task messages).

Ah, nice, that's a good point! THanks @hjoliver. I guess the crux of this Issue then becomes making it simpler & more explicit to set exit code specific triggering up, via the suite.rc instead of individal custom task messages.

It's a speculative one perhaps for future, so there isn't too much more to say right now I don't think!

we can already achieve the same thing with custom task messages

Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever:

foo:succeed & foo:msg1 => bar
foo:succeed & foo:msg2 => baz
bar | baz => pub

This would definitely be a nice feature, I think we may have talked about it in a June meeting a couple of years back? I remember a discussion about the awkwardness of doing this nicely at the moment as script might not be set to a single executable but could be an inline bash-script. There could also be pre-script, init-script, env-script etc, any of which could have produced the non-zero return code.

  • One of the main positive uses I can imagine would be handling XCPU events.
  • One of the main negative uses I can imagine is using non-zero exit codes to communicate different success outcomes (as a proxy to task messages).

we can already achieve the same thing with custom task messages

Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever:

Indeed. Custom messages allow to kick off dependent tasks midway execution of the triggering task, which is sometimes really useful (e.g. a polling task waiting for forecast of successive leadtimes and kicking off their processing as they become available).

In the current set up, the main issues are:

  • We have a job script that can fail with different return code at any point.
  • The value of *script is not a single command, but a script fragment that can run multiple commands.

What can we do?

  • The job script can be written in such a way that only the return value of the final statement of script gets used to determine the return code. In https://github.com/cylc/cylc-flow/blob/db8872086857fd8d4ad5dff5b6765bb9c770dcb2/cylc/flow/etc/job.sh#L137-L139 we would capture the return code only when running the script part.
  • An expected non-zero return code of the above will simply be recorded and the job script will continue to run to completion. On completion, the succeeded message will include the return code.
  • The task message API will be updated to understand the return code in a succeeded message.

Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever

Kinda, but also kinda not, but also more kinda than kinda not. As I suggested you would detect the underlying exit status in the script then send the custom message before exiting (immediately or later, do what you need). So for this use case the custom message is more or less as good as a task exit status, and you don't need to worry about using the actual task exit status in the graph as well.

That's not to deny that proper exit statuses would be better, however! (Just saying it's easy enough to workaround with current custom messages).

@matthewrmshin's suggestion may be good,

(#3440 should allow to capture the exit code from user scripts in a consistent manner.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kinow picture kinow  路  4Comments

kinow picture kinow  路  4Comments

kinow picture kinow  路  4Comments

oliver-sanders picture oliver-sanders  路  3Comments

oliver-sanders picture oliver-sanders  路  3Comments