Cylc-flow: repeatable flaky test

Created on 28 Mar 2019 · 13Comments · Source: cylc/cylc-flow

On current master, in my environment, tests/shutdown/18-client-on-dead-suite.t seems to always pass on its own:

$ cylc test-b -v ./tests/shutdown/18-client-on-dead-suite.t

ok 1 - 18-client-on-dead-suite-validate
ok 2 - 18-client-on-dead-suite-1
ok 3 - 18-client-on-dead-suite-1.stderr-contains-ok
ok 4 - 18-client-on-dead-suite-2
ok 5 - 18-client-on-dead-suite-2.stderr-contains-ok
ok
All tests successful.
Files=1, Tests=5, 12 wallclock secs ( 0.03 usr  0.00 sys +  3.75 cusr  0.49 csys =  4.27 CPU)
Result: PASS

But if I run it with another test, it seems to always fail, like this:

$ export CYLC_TEST_DEBUG=true 
cylc test-b -v ./tests/special/04-clock-triggered.t \
   ./tests/shutdown/18-client-on-dead-suite.t                                                                          
===(       4;6  2/5  2/4 )==============================================
18-client-on-dead-suite 18-client-on-dead-suite-1.stderr-contains-ok
Missing lines:
Request returned error: Suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite" already stopped

18-client-on-dead-suite 18-client-on-dead-suite-2.stderr-contains-ok
Missing lines:
Contact info not found for suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite", suite not running?

    stdout and stderr stored in: /tmp/oliverh/cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite
Failed 2/5 subtests 
./tests/special/04-clock-triggered.t ........ 
ok 3 - 04-clock-triggered-run-past
ok 4 - 04-clock-triggered-run-later
ok

Test Summary Report
-------------------
./tests/shutdown/18-client-on-dead-suite.t (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3, 5
Files=2, Tests=9, 29 wallclock secs ( 0.02 usr  0.01 sys +  8.33 cusr  0.98 csys =  9.34 CPU)
Result: FAIL

Source

hjoliver

Most helpful comment

I know what's going on here...

Is the cylc ping client somehow connecting to the wrong suite?

Yes, reliably every time!

In tests/shutdown/18-client-on-dead-suite.t the suite is killed leaving behind the contact file. So when later in the test cylc ping attempts to connect to the suite there is always a risk that a new suite will have started up on that port causing the test to fail with:

Request returned error: Could not decrypt response. Has the passphrase changed?

So this test was, by design always going to be flaky.

In the new ZMQ implementation the port is not chosen at random, ZMQ picks the lowest available port in the range so it's gone from being slightly flaky to reliably flaky.

There is no real reason for picking the port this way, it was slightly nicer during the debug phase. I think there is a TODO in there somewhere. I think there might be a nice way of doing random selection in ZMQ itself.

oliver-sanders on 28 Mar 2019

👍2

All 13 comments

At first glance (and maybe second glance) I can't see how this test could fail. Tests 3 and 5 simply cylc ping an already-killed suite, and the ping client should print out the expected lines.

hjoliver on 28 Mar 2019

(Occasionally 1/5 tests fail when run alone, instead of 0/5; and occasionaily 1/5 fail when run with the other test, instead of 2/5 ... so it is "flaky").

hjoliver on 28 Mar 2019

Ah, in failing cases, cylc ping returns this (in the ping test stderr file):

Request returned error: Could not decrypt response. Has the passphrase changed?

hjoliver on 28 Mar 2019

Is the cylc ping client somehow connecting to the wrong suite?

hjoliver on 28 Mar 2019

😕1

Mentioning #2894 issue here so we have a reference in GitHub, just in case it may be helpful later :+1:

kinow on 28 Mar 2019

And have confirmed, the exact same behaviour in my environment with the master branch.

$ uname -a
Linux ranma 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ python --version
Python 3.7.2

kinow on 28 Mar 2019

I know what's going on here...

Is the cylc ping client somehow connecting to the wrong suite?

Yes, reliably every time!

Request returned error: Could not decrypt response. Has the passphrase changed?

So this test was, by design always going to be flaky.

In the new ZMQ implementation the port is not chosen at random, ZMQ picks the lowest available port in the range so it's gone from being slightly flaky to reliably flaky.

oliver-sanders on 28 Mar 2019

👍2

I guess this is a case where auto-rerunning failed tests isn't always the most helpful thing to do.

oliver-sanders on 28 Mar 2019

Ah, brilliant- it all makes sense. That's a relief, thanks @oliver-sanders 🍺

hjoliver on 28 Mar 2019

(I had forgotten you'd switched to sequential port acquisition).

hjoliver on 28 Mar 2019

It was just a stopgap I never got rid of.

oliver-sanders on 29 Mar 2019

3004 will reduce the flakyness of this test proportionate to the number of suites divided by the number of ports. Not good but much better, is this enough to close the issue for now.

oliver-sanders on 29 Mar 2019

I think that's good enough, with a comment in the test to indicate exactly why it might occasionally fail.

hjoliver on 31 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Traceback with cylc play and Cylc 7 workflows

oliver-sanders · 3Comments

Parallel cylc trigger edit problem

dpmatthews · 3Comments

Generalise [runtime] metadata items

hjoliver · 5Comments

Update Protobuf to 3.12+ - Fields with default value not set in deltas

dwsutherland · 3Comments

Revisit CYLC_DIR use

kinow · 3Comments