Cylc-flow: xtriggers: only first broadcast results recorded

Created on 6 Aug 2019  路  8Comments  路  Source: cylc/cylc-flow

Describe the bug

Only the first (by ordering) key-value item from the dictionary of results in the success return signature of an xtrigger gets broadcast actually gets recorded (logs, DB) as being passed on to dependent tasks, when all items are meant to be recorded as passed on. For example, if an xtrigger call returns (True, {"ONE": "one", "TWO": "two", "THREE": "three"}), only "ONE": "one" is logged as passed via an environment variable <xtrigger_name>_ONE=one.

See below for a concrete example.

NB: I have checked & observed that the problem is not present for regular (non-xtrigger) broadcasts, just for broadcasts made by any xtrigger.

Release version(s) and/or repository branch(es) affected?
This bug had been tested for, & found to present, for both 8.0a0 & 7.8.3.

Steps to reproduce the bug

  1. Set up an xtrigger which always succeeds to return a results dict with multiple items, e.g:
def succeed(*args):
    return (True, {"ONE": "one", "TWO": "two", "THREE": "three"})
  1. Apply it in a simple dummy suite, e.g the one in the docs here, with echo_1 = echo(...) replaced by succeeding = succeed(*args), where succeed.py is under <suite-dir>/lib/python.

  2. Observe the suite log to only show that the environment variable for the first key, "ONE", has been broadcast:

2019-08-06T19:43:10+01:00 INFO - [foo.20190806T1943+01] -health check settings: submission timeout=None
2019-08-06T19:43:11+01:00 INFO - Broadcast set:
    + [foo.20190806T2043+01] [environment]succeeding_ONE=one
2019-08-06T19:43:11+01:00 INFO - [foo.20190806T2043+01] -submit-num=01, owner@host= ...
  1. Switch the ordering of the results dict & observe that the suite log display changes accordingly, e.g. after moving "TWO:"two" to the front (to be the zeroth item):
2019-08-06T19:43:57+01:00 INFO - [foo.20190806T1943+01] -health check settings: submission timeout=None
2019-08-06T19:43:58+01:00 INFO - Broadcast set:
    + [foo.20190806T2043+01] [environment]succeeding_TWO=two
2019-08-06T19:43:58+01:00 INFO - [foo.20190806T2043+01] -submit-num=01, owner@host=...

Expected behavior
All defined key-value items specified in a (valid) results dict get logged & stored in the DB when they are passed to tasks depending on the xtrigger.

Screenshots

The 'broadcasts list' page on Cylc Review also supports this, e.g. at step 3 above it shows:

broadcasts_scr

bug

All 8 comments

Ouch, that's bad. I'm pretty sure existing xtrigger tests should be testing this...

Yep, tests/xtriggers/01-suite_state.t checks the job log of an xtrigger-dependent job for the results of a multi-item broadcast. So, this is mysterious...

Good news, it is only the logging that is truncated, not the broadcast itself. Running your example as described above (but with x as the xtrigger label), the job log of foo.1 shows:

$ (venv) [oliverh@drugs-and-money]$ cylc cat-log -f j sadie foo.1
...
cylc__job__inst__user_env() {
    # TASK RUNTIME ENVIRONMENT:
    export x_three x_two x_one
    x_three="THREE"
    x_two="TWO"
    x_one="ONE"
}
...

So, all of the xtrigger return values do get broadcast to the task. But the suite log does indeed only show the first.

The problem seems to be here in cylc/flow/broadcast_report.py:

98     for broadcast_change in get_broadcast_change_iter(
99             modified_settings, is_cancel):
100         msg += CHANGE_FMT % broadcast_change

Iteration stops at the first item.

But I'll leave this to you @sadielbartholomew, as you've assigned yourself and might have worked on it already!

Doh, somehow I missed that you had already put up a PR for this! (Was going through my GitHub notification emails sequentially). :grimacing:

And you seem to have come to a different conclusion to me. Moving over to #3276 ...

Thanks for looking into it @hjoliver, it seems like it is a subtle bug.

Ouch, that's bad. I'm pretty sure existing xtrigger tests should be testing this...

Good news, it is only the logging that is truncated, not the broadcast itself

Sorry for the partial misdiagnosis, I had to stop work before I got onto look at the testing aspect in #3276, at which point I will have realised it was just the logging, though to defend my conclusions before I called it a day:

  • I inspected some non-xtrigger broadcasts via the suite logs of basic suites with broadcasts, & the broadcasts there were all fine, notably I definitely saw all expected cases having >1 broadcast detailed under a INFO - Broadcast set: ... line, suggesting there was nothing wrong with the logging (hence it is puzzling if it is a problem in cylc/flow/broadcast_report.py as you suggested, but I'll look into it).
  • While investigating I had found a way (as put up in #3276) to fix the problem in xtrigger- rather than broadcast- related code, which to me confirmed it was on the former not the latter side.

Anyway, I want to ensure I'm fixing the buggy code rather than inadvertently bodging perfectly fine code to produce the right behaviour, as based on your investigations & reporting comments seems I have likely one in #3276! I'll look into it further & talk to Matt about the logging & then make amendments there accordingly.

though to defend my conclusions before I called it a day

:+1: - yes I can absolutely see how you would think, from the evidence, that this must be an xtrigger broadcast bug (not just logging)!

Don't (blindly) trust the logs is probably the main take-away lesson for me here! :innocent: -> skeptical :tophat:

Was this page helpful?
0 / 5 - 0 ratings