Cylc-flow: bash test f/broadcast/simple-00.t intermitently fails

Created on 11 Nov 2020  路  8Comments  路  Source: cylc/cylc-flow

Describe the bug

For the past months functional/broadcast/00-simple.t has been failing on GH actions, for the bash docker env with bash 5.

  • It does not always fail
  • I was never able to reproduce it running the containers locally
  • When that happens, there's one line that suggests sort couldn't find a log file
2020-11-10T19:25:03.0578052Z ok    15001 ms ( 0.01 usr  0.01 sys +  5.42 cusr  1.05 csys =  6.49 CPU)
2020-11-10T19:25:09.4406890Z sort: cannot read: /root/cylc-run/cylctb-20201110T192357Z/functional/broadcast/00-simple/share/broadcast.log: No such file or directory
2020-11-10T19:25:09.4460942Z --- broadcast.log.sorted   2020-11-10 19:25:09.438984194 +0000
2020-11-10T19:25:09.4461690Z +++ broadcast.ref  2020-11-10 19:24:57.614978310 +0000
2020-11-10T19:25:09.4462063Z @@ -0,0 +1,34 @@

The line of the functional test where the error happens is

sort "${SUITE_RUN_DIR}/share/broadcast.log" >'broadcast.log.sorted'

Release version(s) and/or repository branch(es) affected?

Every release version :grimacing:

Steps to reproduce the bug

Create a PR and wait to see what happens with the GH action.

Expected behavior

Screenshots

Additional context

Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).

bug

All 8 comments

My best guess now, is that the disks in GH action environment might not be super fast, and maybe when the command runs in the container, the Cylc workflow command finishes, and then the sort command runs before broadcast.log is created.

Not sure if that's a possible scenario. But the only possible fix I can think of for now is

diff --git a/tests/functional/broadcast/00-simple.t b/tests/functional/broadcast/00-simple.t
index 1a275b662..f9c4d7093 100755
--- a/tests/functional/broadcast/00-simple.t
+++ b/tests/functional/broadcast/00-simple.t
@@ -25,7 +25,9 @@ run_ok "${TEST_NAME_BASE}-validate" cylc validate "${SUITE_NAME}"
 # Debug mode not needed here (and set-x w/ XTRACEFD broken some bash versions)
 suite_run_ok "${TEST_NAME_BASE}-run" \
     cylc run --no-detach --reference-test "${SUITE_NAME}"
-sort "${SUITE_RUN_DIR}/share/broadcast.log" >'broadcast.log.sorted'
+LOG_FILE="${SUITE_RUN_DIR}/share/broadcast.log"
+while [ ! -f "${LOG_FILE}" ]; do sleep 1; done
+sort "${LOG_FILE}" >'broadcast.log.sorted'
 cmp_ok 'broadcast.ref' 'broadcast.log.sorted'

 DB_FILE="${SUITE_RUN_DIR}/log/db"

Latest on this:

  • it can fail with any bash version
  • prep task aborts when its diff on broadcast client output doesn't get the expected result

In all failures (5 so far) I've seen exactly the same result, in prep's job.out:

2020-11-12T22:59:59Z INFO - started
--- /tmp/tmp.VWeyM9guE5/cylctb-20201112T225956Z/functional/broadcast/00-simple/expected-prep.out   2020-11-12 22:59:57.353960914 +0000
+++ prep.out   2020-11-12 23:00:07.665916159 +0000
@@ -42,9 +42,6 @@
  |-m8    
  | `-environment  
  |   `-BCAST      M8  
- |-m9    
- | `-environment  
- |   `-BCAST      M9  
  `-root    
    `-environment  
      `-BCAST      ROOT

The broadcast to m9 is the last one before cylc broadcast --display is called to generate this output.

The scheduler log shows that the m9 broadcast was received, but the broadcast display command immediately after does not show it. :confused:

From prep.out in the test:

Expected:

Broadcast set:
+ [m9.*] [environment]BCAST=M9
*
 |-ENS
 | `-environment
 |   `-BCAST      ENS
 |-ENS1
 | `-environment
 |   `-BCAST      ENS1
 |-bar
 | `-environment
 |   `-BCAST      BAR
 |-m7
 | `-environment
 |   `-BCAST      M7
 |-m8
 | `-environment
 |   `-BCAST      M8
 |-m9
 | `-environment
 |   `-BCAST      M9
 `-root
   `-environment
     `-BCAST      ROOT

Fail case:

Broadcast set:
+ [m9.*] [environment]BCAST=M9
*    
 |-ENS    
 | `-environment  
 |   `-BCAST      ENS 
 |-ENS1    
 | `-environment  
 |   `-BCAST      ENS1
 |-bar    
 | `-environment  
 |   `-BCAST      BAR 
 |-m7    
 | `-environment  
 |   `-BCAST      M7  
 |-m8    
 | `-environment  
 |   `-BCAST      M8  
 `-root    
   `-environment  
     `-BCAST      ROOT

This shows that the m9 broadcast was received (also confirmed by the scheduler log), but the subsequent cylc broadcast --display does not show it.

Can't get it to fail locally, been running this for a while:

while ./etc/bin/run-functional-tests tests/f/broadcast/00-simple.t; do echo OK; done

Order of commands in the prep task:

    # reset broadcast to m9 of ENS1 at all cycles
        cylc broadcast -n m9 -s "[environment]BCAST = M9" $CYLC_SUITE_NAME
    # clear broadcast for ENS3 (will not work):
        ! cylc broadcast -n ENS3 --clear $CYLC_SUITE_NAME
    # list the result to prep task stdout:
         cylc bcast --display $CYLC_SUITE_NAME

The only thing in between broadcast to m9 and display is the --clear one, which the log confirms fails as expected:

2020-11-13T03:38:06Z ERROR - No broadcast to cancel/clear for these options:
      --namespaces=ENS3

What would happen if we added a sleep between the commands? Would it succeed perhaps?

What would happen if we added a sleep between the commands? Would it succeed perhaps?

If we're right about the cause (see Element chat with @dwsutherland ) then a sleep before cylc broadcast --display would fix it (time for the datastore to get updated). But sleeps beget flakiness (and slow tests). Instead I've tried putting the --display test in a dependent task, which should delay it by at least one main loop iteration.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oliver-sanders picture oliver-sanders  路  5Comments

hjoliver picture hjoliver  路  5Comments

dwsutherland picture dwsutherland  路  3Comments

oliver-sanders picture oliver-sanders  路  4Comments

kinow picture kinow  路  3Comments