Describe the bug
For the past months functional/broadcast/00-simple.t has been failing on GH actions, for the bash docker env with bash 5.
sort couldn't find a log file2020-11-10T19:25:03.0578052Z ok 15001 ms ( 0.01 usr 0.01 sys + 5.42 cusr 1.05 csys = 6.49 CPU)
2020-11-10T19:25:09.4406890Z sort: cannot read: /root/cylc-run/cylctb-20201110T192357Z/functional/broadcast/00-simple/share/broadcast.log: No such file or directory
2020-11-10T19:25:09.4460942Z --- broadcast.log.sorted 2020-11-10 19:25:09.438984194 +0000
2020-11-10T19:25:09.4461690Z +++ broadcast.ref 2020-11-10 19:24:57.614978310 +0000
2020-11-10T19:25:09.4462063Z @@ -0,0 +1,34 @@
The line of the functional test where the error happens is
sort "${SUITE_RUN_DIR}/share/broadcast.log" >'broadcast.log.sorted'
Release version(s) and/or repository branch(es) affected?
Every release version :grimacing:
Steps to reproduce the bug
Create a PR and wait to see what happens with the GH action.
Expected behavior
Screenshots
Additional context
Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).
My best guess now, is that the disks in GH action environment might not be super fast, and maybe when the command runs in the container, the Cylc workflow command finishes, and then the sort command runs before broadcast.log is created.
Not sure if that's a possible scenario. But the only possible fix I can think of for now is
diff --git a/tests/functional/broadcast/00-simple.t b/tests/functional/broadcast/00-simple.t
index 1a275b662..f9c4d7093 100755
--- a/tests/functional/broadcast/00-simple.t
+++ b/tests/functional/broadcast/00-simple.t
@@ -25,7 +25,9 @@ run_ok "${TEST_NAME_BASE}-validate" cylc validate "${SUITE_NAME}"
# Debug mode not needed here (and set-x w/ XTRACEFD broken some bash versions)
suite_run_ok "${TEST_NAME_BASE}-run" \
cylc run --no-detach --reference-test "${SUITE_NAME}"
-sort "${SUITE_RUN_DIR}/share/broadcast.log" >'broadcast.log.sorted'
+LOG_FILE="${SUITE_RUN_DIR}/share/broadcast.log"
+while [ ! -f "${LOG_FILE}" ]; do sleep 1; done
+sort "${LOG_FILE}" >'broadcast.log.sorted'
cmp_ok 'broadcast.ref' 'broadcast.log.sorted'
DB_FILE="${SUITE_RUN_DIR}/log/db"
Latest on this:
diff on broadcast client output doesn't get the expected resultIn all failures (5 so far) I've seen exactly the same result, in prep's job.out:
2020-11-12T22:59:59Z INFO - started
--- /tmp/tmp.VWeyM9guE5/cylctb-20201112T225956Z/functional/broadcast/00-simple/expected-prep.out 2020-11-12 22:59:57.353960914 +0000
+++ prep.out 2020-11-12 23:00:07.665916159 +0000
@@ -42,9 +42,6 @@
|-m8
| `-environment
| `-BCAST M8
- |-m9
- | `-environment
- | `-BCAST M9
`-root
`-environment
`-BCAST ROOT
The broadcast to m9 is the last one before cylc broadcast --display is called to generate this output.
The scheduler log shows that the m9 broadcast was received, but the broadcast display command immediately after does not show it. :confused:
From prep.out in the test:
Expected:
Broadcast set:
+ [m9.*] [environment]BCAST=M9
*
|-ENS
| `-environment
| `-BCAST ENS
|-ENS1
| `-environment
| `-BCAST ENS1
|-bar
| `-environment
| `-BCAST BAR
|-m7
| `-environment
| `-BCAST M7
|-m8
| `-environment
| `-BCAST M8
|-m9
| `-environment
| `-BCAST M9
`-root
`-environment
`-BCAST ROOT
Fail case:
Broadcast set:
+ [m9.*] [environment]BCAST=M9
*
|-ENS
| `-environment
| `-BCAST ENS
|-ENS1
| `-environment
| `-BCAST ENS1
|-bar
| `-environment
| `-BCAST BAR
|-m7
| `-environment
| `-BCAST M7
|-m8
| `-environment
| `-BCAST M8
`-root
`-environment
`-BCAST ROOT
This shows that the m9 broadcast was received (also confirmed by the scheduler log), but the subsequent cylc broadcast --display does not show it.
Can't get it to fail locally, been running this for a while:
while ./etc/bin/run-functional-tests tests/f/broadcast/00-simple.t; do echo OK; done
Order of commands in the prep task:
# reset broadcast to m9 of ENS1 at all cycles
cylc broadcast -n m9 -s "[environment]BCAST = M9" $CYLC_SUITE_NAME
# clear broadcast for ENS3 (will not work):
! cylc broadcast -n ENS3 --clear $CYLC_SUITE_NAME
# list the result to prep task stdout:
cylc bcast --display $CYLC_SUITE_NAME
The only thing in between broadcast to m9 and display is the --clear one, which the log confirms fails as expected:
2020-11-13T03:38:06Z ERROR - No broadcast to cancel/clear for these options:
--namespaces=ENS3
What would happen if we added a sleep between the commands? Would it succeed perhaps?
What would happen if we added a sleep between the commands? Would it succeed perhaps?
If we're right about the cause (see Element chat with @dwsutherland ) then a sleep before cylc broadcast --display would fix it (time for the datastore to get updated). But sleeps beget flakiness (and slow tests). Instead I've tried putting the --display test in a dependent task, which should delay it by at least one main loop iteration.