Cylc-flow: test battery: flaky tests

Created on 27 Nov 2018 · 34Comments · Source: cylc/cylc-flow

Just to record that that the test battery can fail randomly or in certain scenarios (under a time zone, when an environment variable is present, etc). This ticket is just to check if it'd be possible to make it more stable, perhaps avoid re-running it, and, of course, sporadically failures.

[Edited by @matthewrmshin ] Since #3224, sensitive tests have gained their own hierarchy in the source tree under flakytests/ and run under Travis CI in serial. New flaky tests should be moved to the hierarchy, with the view that it may be possible to move some back to live with the other non-flaky tests. (I have commented out the original list - as the source tree is now the master list.)

List (post #3224 and #3286):

./tests/cylc-reset/03-output-2.t - failed in #3311

2019-08-25T21:54:01Z DEBUG - zmq:send {'data': (True, 'Messages queued: 1')}
2019-08-25T21:54:02Z CRITICAL - [t2.1] status=running: (received)failed/XCPU at 2019-08-25T21:54:00Z for job(01)
2019-08-25T21:54:02Z DEBUG - [t2.1] -running => failed
2019-08-25T21:54:02Z CRITICAL - [t2.1] -job(01) failed
2019-08-25T21:54:02Z ERROR - AUTOMATIC(ON-TASK-FAILURE)
    Traceback (most recent call last):
      File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 252, in start
        self.run()
      File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1505, in run
        self.suite_shutdown()
      File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1211, in suite_shutdown
        raise SchedulerError(self.stop_mode.value)
    cylc.flow.scheduler.SchedulerError: AUTOMATIC(ON-TASK-FAILURE)
2019-08-25T21:54:02Z ERROR - Suite shutting down - AUTOMATIC(ON-TASK-FAILURE)

./tests/registration/02-on-the-fly.t - failed in #3311
cylctb-20190825T214903Z/02-on-the-fly localhost 43027 TIMEOUT ==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stderr ==== ClientTimeout: Timeout waiting for server response. ==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stdout ==== ==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-2-validate.stdout-contains-ok.stderr ==== Missing lines: REGISTERED cylctb-20190825T214903Z/02-on-the-fly -> /home/travis/cylc-run/cylctb-20190825T214903Z/02-on-the-fly

(When adding new ones, please bear in mind that the list is in alphabetical order.)

bug

Source

kinow

Most helpful comment

how to hammer flaky tests on Travis CI

Activate Travis CI in your own clone.
Make a new branch with a single commit that changes .travis/cover.py to run just the flaky test(s)
push the the branch to your clone ... observe build results, re-trigger at will

hjoliver on 15 May 2019

👍3 🎉1

All 34 comments

Leaving unassigned, good issue to any other newcomer to the project too 🎉

kinow on 27 Nov 2018

Sorry hijacked the issue to record other flaky tests as well.

matthewrmshin on 16 Jan 2019

🎉1

Thanks @matthewrmshin !!! With that list, working on this issue will be thousand times easier. Thanks!!!

kinow on 16 Jan 2019

Yes, I also notice that tests/execution-time-limit/04-poll.t fails randomly at times. I'll take a look at these ones and the one currently being discussed in #2929.

matthewrmshin on 17 Jan 2019

Another vote for ./tests/jobscript/19-exit-script.t.

hjoliver on 17 Jan 2019

Could we (eventually) actually automate a procedure that can quantify 'flakiness' to highlight individual test issues?

Notably, we could create a script/workflow that will run (say, overnight) the test battery &/or Travis CI a number of times & compare failures on each set to pick up on any bad tests. Then we could send out emails prompting us to get any issues fixed.

There is actually some interesting literature about, such as this 2014 study which I just skim read, if anyone is looking for some bed-time reading!

sadielbartholomew on 17 Jan 2019

👍1

Travis CI already "knows" which tests are flaky (in its environment, at least) because it sometimes has to run those ones twice to get the test battery to pass. Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

hjoliver on 17 Jan 2019

it sometimes has to run those ones twice to get the test battery to pass

I am aware of that, but I think two is statistically not the greatest sample size for indication of test reliability. Though your comment made me wonder if there is a Travis CI configuration setting for the number of re-runs? If we up that, then only the relatively small number of possibly flaky tests will need to be re-run, instead of manually having to restart whole jobs in test battery runs whenever flaky tests crop up.

Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

That would certainly be useful & hopefully simple to configure.

I'll have a look through the Travis CI docs to see what is possible.

sadielbartholomew on 17 Jan 2019

if there is a Travis CI configuration setting for the number of re-runs?

I believe we manage the rerunning of failed tests ourselves, in our Travis CI config.

hjoliver on 17 Jan 2019

👍1

(By "knows" I meant our T-CI build script knows, so maybe we could just add a line to that to send a notification out somehow).

hjoliver on 17 Jan 2019

two is statistically not the greatest sample size for indication of test reliability

True! But we would gradually identify all flaky tests as they randomly fail over many separate T-CI runs.

hjoliver on 17 Jan 2019

👍1

The best outcome for this issue is that all tests are fixed and we'll no longer have to re-run them as we do now.

matthewrmshin on 17 Jan 2019

👍1

The best outcome for this issue

Sure. But thinking about the future, flaky tests do cause issues here & then, so having some procedures for monitoring & fixing unreliable ones could be nice.

sadielbartholomew on 17 Jan 2019

A general comment - we should provide clear comments in our functional tests on exactly how they are supposed to work, because some of them do very strange things in order to set up certain test conditions that are difficult to reproduce "naturally". Case in point: https://github.com/cylc/cylc/issues/2929#issuecomment-455145411 - git blame fingers me, but I didn't explain in the code and just had to waste time figuring out how my own test worked all over again!! :grimacing:

hjoliver on 17 Jan 2019

👍2 😄1

@hjoliver Don't worry. I did manage to work out what was going on - and it was (sort of) my fault.

matthewrmshin on 17 Jan 2019

👍1

Note on ulimit's in the Travis distribution used in our builds:

0.00s$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 29790
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 30000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 29790
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comparing with my local environment, it has the same settings or higher number for memory/files/etc. Except max locked memory (kbytes, -l) 64. In my environment is says max locked memory (kbytes, -l) 16384.

max locked memory (kbytes, -l)

The maximum size that may be locked into memory. Memory locking ensures the memory is always in RAM and never moved to the swap disk

But I do not believe this could cause random failures... I guess.

kinow on 28 Mar 2019

Seeing this error a lot lately:

[14:11:23] ./tests/cylc-poll/16-execution-time-limit.t ..... 2/4 Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/isodatetime/parsers.py", line 237, in parse
    date_info, time_info, parsed_expr = self.get_info(timepoint_string)
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/isodatetime/parsers.py", line 430, in get_info
    keys, date_info = self.get_date_info(date)
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/isodatetime/parsers.py", line 390, in get_date_info
    raise ISO8601SyntaxError("date", date_string)
isodatetime.parsers.ISO8601SyntaxError: Invalid ISO 8601 date representation: 
[14:11:23] ./tests/cylc-poll/16-execution-time-limit.t ..... 4/4 
    stdout and stderr stored in: /tmp/travis/cylctb-20190512T141038Z/cylc-poll/16-execution-time-limit
    suite logs can be found under: /home/travis/cylc-run/cylctb-20190512T141038Z/cylc-poll/16-execution-time-limit/log/suite
[14:11:23] ./tests/cylc-poll/16-execution-time-limit.t ..... Failed 3/4 subtests 
[14:11:56]
Test Summary Report
-------------------
./tests/cylc-poll/16-execution-time-limit.t   (Wstat: 0 Tests: 4 Failed: 3)
  Failed tests:  2-4
Files=4, Tests=24, 78 wallclock secs ( 0.06 usr  0.01 sys + 29.87 cusr  3.90 csys = 33.84 CPU)
Result: FAIL

hjoliver on 12 May 2019

Hmmm, From isoadatetime:

    BAD_TIME_INPUT = "Invalid ISO 8601 {0} representation: {1}"

So:

isodatetime.parsers.ISO8601SyntaxError: Invalid ISO 8601 date representation:

Means isodatetime is being passed an empty string?

Should probably change to:

    BAD_TIME_INPUT = 'Invalid ISO 8601 {0} representation "{1}"'

oliver-sanders on 13 May 2019

Means isodatetime is being passed an empty string?

Yes, in tests/cylc-poll/16-execution-time-limit.t either PREDICTED_POLL_TIME or ACTUAL_POLL_TIME must be getting set to an empty string.

hjoliver on 14 May 2019

I've just run this test many times locally (by itself and in parallel with 10 others) and cannot get it to fail. Is the suite log captured in the Travis output for the failures you have been seeing?

oliver-sanders on 14 May 2019

Chiming in; most recent builds in cylc/cylc-flow are normally re-triggered, overwriting the previous runs. But luckily @hjoliver has Travis-CI activated for his fork.

Found this one in his list of his builds for cylc-flow, where it looks like 16-execution-time-limit.tfailed.

Hope that helps!

kinow on 14 May 2019

OK, the test is failing as the "health check settings" log line is different:

2019-05-14T10:40:39Z INFO - [foo.1] -health check settings: submission timeout=None, polling intervals=PT2S,...

Running locally I see

2019-05-14T11:26:44+01:00 INFO - [foo.1] -health check settings: execution timeout=PT10S, polling intervals=PT10S,...

oliver-sanders on 14 May 2019

how to hammer flaky tests on Travis CI

Activate Travis CI in your own clone.
Make a new branch with a single commit that changes .travis/cover.py to run just the flaky test(s)
push the the branch to your clone ... observe build results, re-trigger at will

hjoliver on 15 May 2019

👍3 🎉1

(Having done that with cylc-poll/16-execution-time-limit.t, it always passes when run alone! :grimacing: )

hjoliver on 15 May 2019

Bit off topic I think, but an interesting post about a team that collected reasons for their flaky tests and produced a nice summary of each issue, post mortem, etc.

https://samsaffron.com/archive/2019/05/15/tests-that-sometimes-fail

I like the author's optimism (I think he's co-founder of discourse BTW)

I would like to disagree a bit with Martin. Sometimes I find flaky tests are useful at finding underlying flaws in our application. In some cases when fixing a flaky test, the fix is in the app, not in the test.

kinow on 29 May 2019

👍1

Haha, the first paragraphs describe our situation exactly :grimacing:

hjoliver on 29 May 2019

😄2

A few Travis CI runs on the chunk 3/4 have failed on tests/job-submission/18-check-chunking.t (which possibly influences subsequent tests if still run) with some shared errors including a repeated report of Cannot allocate memory: see for example this log & this one, which after the other error messages includes a traceback as follows:

Traceback (most recent call last):
  File ".travis/cover.py", line 38, in <module>
    main()
  File ".travis/cover.py", line 33, in main
    call(fn_tests + ' --state=failed -j 5', shell=True)  # nosec
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 317, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 769, in __init__
    restore_signals, start_new_session)
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 1447, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I've added this to the "flaky test" list above.

sadielbartholomew on 11 Jul 2019

👍1

Log for a failure in Travis where a test failed due to "Contact info not found for suite" https://gist.github.com/kinow/1429a911fae2e7cbbb7dcec63e64c653

Looks like when this error occurs, some tests that are not in the list of flaky tests fail. Which could indicate resource unintentionally shared by tests?

kinow on 24 Jul 2019

May have a fix for tests/restart/21-task-elapsed.t

oliver-sanders on 26 Jul 2019

🚀1 🎉1

@matthewrmshin I've added two tests that failed today in a trivial PR, but it was just so I don't forget about them.

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

kinow on 26 Aug 2019

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

:+1:

hjoliver on 26 Aug 2019

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

hjoliver on 26 Aug 2019

👍1

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

+1 !

kinow on 26 Aug 2019

Will raise a PR for the flaky tests in a few minutes. Closing this one for now 🎉

kinow on 26 Aug 2019

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Generalise [runtime] metadata items

hjoliver · 5Comments

GTK/Gobject unnecessary for cylc graph --reference (7.8.x)

sadielbartholomew · 5Comments

cylc review: Server 404 error when trying Display Options for suites contain special characters

kinow · 3Comments

03-clock-triggered-non-utc-mode.t failing in NZ time

kinow · 4Comments

Should xrandom be defensive against ZeroDivisionError?

kinow · 4Comments