On current master, in my environment, tests/shutdown/18-client-on-dead-suite.t seems to always pass on its own:
$ cylc test-b -v ./tests/shutdown/18-client-on-dead-suite.t
ok 1 - 18-client-on-dead-suite-validate
ok 2 - 18-client-on-dead-suite-1
ok 3 - 18-client-on-dead-suite-1.stderr-contains-ok
ok 4 - 18-client-on-dead-suite-2
ok 5 - 18-client-on-dead-suite-2.stderr-contains-ok
ok
All tests successful.
Files=1, Tests=5, 12 wallclock secs ( 0.03 usr 0.00 sys + 3.75 cusr 0.49 csys = 4.27 CPU)
Result: PASS
But if I run it with another test, it seems to always fail, like this:
$ export CYLC_TEST_DEBUG=true
cylc test-b -v ./tests/special/04-clock-triggered.t \
./tests/shutdown/18-client-on-dead-suite.t
===( 4;6 2/5 2/4 )==============================================
18-client-on-dead-suite 18-client-on-dead-suite-1.stderr-contains-ok
Missing lines:
Request returned error: Suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite" already stopped
18-client-on-dead-suite 18-client-on-dead-suite-2.stderr-contains-ok
Missing lines:
Contact info not found for suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite", suite not running?
stdout and stderr stored in: /tmp/oliverh/cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite
Failed 2/5 subtests
./tests/special/04-clock-triggered.t ........
ok 3 - 04-clock-triggered-run-past
ok 4 - 04-clock-triggered-run-later
ok
Test Summary Report
-------------------
./tests/shutdown/18-client-on-dead-suite.t (Wstat: 0 Tests: 5 Failed: 2)
Failed tests: 3, 5
Files=2, Tests=9, 29 wallclock secs ( 0.02 usr 0.01 sys + 8.33 cusr 0.98 csys = 9.34 CPU)
Result: FAIL
At first glance (and maybe second glance) I can't see how this test could fail. Tests 3 and 5 simply cylc ping an already-killed suite, and the ping client should print out the expected lines.
(Occasionally 1/5 tests fail when run alone, instead of 0/5; and occasionaily 1/5 fail when run with the other test, instead of 2/5 ... so it is "flaky").
Ah, in failing cases, cylc ping returns this (in the ping test stderr file):
Request returned error: Could not decrypt response. Has the passphrase changed?
Is the cylc ping client somehow connecting to the wrong suite?
Mentioning #2894 issue here so we have a reference in GitHub, just in case it may be helpful later :+1:
And have confirmed, the exact same behaviour in my environment with the master branch.
$ uname -a
Linux ranma 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ python --version
Python 3.7.2
I know what's going on here...
Is the cylc ping client somehow connecting to the wrong suite?
Yes, reliably every time!
In tests/shutdown/18-client-on-dead-suite.t the suite is killed leaving behind the contact file. So when later in the test cylc ping attempts to connect to the suite there is always a risk that a new suite will have started up on that port causing the test to fail with:
Request returned error: Could not decrypt response. Has the passphrase changed?
So this test was, by design always going to be flaky.
In the new ZMQ implementation the port is not chosen at random, ZMQ picks the lowest available port in the range so it's gone from being slightly flaky to reliably flaky.
There is no real reason for picking the port this way, it was slightly nicer during the debug phase. I think there is a TODO in there somewhere. I think there might be a nice way of doing random selection in ZMQ itself.
I guess this is a case where auto-rerunning failed tests isn't always the most helpful thing to do.
Ah, brilliant- it all makes sense. That's a relief, thanks @oliver-sanders 馃嵑
(I had forgotten you'd switched to sequential port acquisition).
It was just a stopgap I never got rid of.
I think that's good enough, with a comment in the test to indicate exactly why it might occasionally fail.
Most helpful comment
I know what's going on here...
Yes, reliably every time!
In
tests/shutdown/18-client-on-dead-suite.tthe suite is killed leaving behind the contact file. So when later in the testcylc pingattempts to connect to the suite there is always a risk that a new suite will have started up on that port causing the test to fail with:So this test was, by design always going to be flaky.
In the new ZMQ implementation the port is not chosen at random, ZMQ picks the lowest available port in the range so it's gone from being slightly flaky to reliably flaky.
There is no real reason for picking the port this way, it was slightly nicer during the debug phase. I think there is a TODO in there somewhere. I think there might be a nice way of doing random selection in ZMQ itself.