Cylc-flow: test battery: flaky tests

Created on 27 Nov 2018  路  34Comments  路  Source: cylc/cylc-flow

Just to record that that the test battery can fail randomly or in certain scenarios (under a time zone, when an environment variable is present, etc). This ticket is just to check if it'd be possible to make it more stable, perhaps avoid re-running it, and, of course, sporadically failures.

[Edited by @matthewrmshin ] Since #3224, sensitive tests have gained their own hierarchy in the source tree under flakytests/ and run under Travis CI in serial. New flaky tests should be moved to the hierarchy, with the view that it may be possible to move some back to live with the other non-flaky tests. (I have commented out the original list - as the source tree is now the master list.)

List (post #3224 and #3286):

  • ./tests/cylc-reset/03-output-2.t - failed in #3311
2019-08-25T21:54:01Z DEBUG - zmq:send {'data': (True, 'Messages queued: 1')}
2019-08-25T21:54:02Z CRITICAL - [t2.1] status=running: (received)failed/XCPU at 2019-08-25T21:54:00Z for job(01)
2019-08-25T21:54:02Z DEBUG - [t2.1] -running => failed
2019-08-25T21:54:02Z CRITICAL - [t2.1] -job(01) failed
2019-08-25T21:54:02Z ERROR - AUTOMATIC(ON-TASK-FAILURE)
    Traceback (most recent call last):
      File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 252, in start
        self.run()
      File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1505, in run
        self.suite_shutdown()
      File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1211, in suite_shutdown
        raise SchedulerError(self.stop_mode.value)
    cylc.flow.scheduler.SchedulerError: AUTOMATIC(ON-TASK-FAILURE)
2019-08-25T21:54:02Z ERROR - Suite shutting down - AUTOMATIC(ON-TASK-FAILURE)
  • ./tests/registration/02-on-the-fly.t - failed in #3311
    cylctb-20190825T214903Z/02-on-the-fly localhost 43027 TIMEOUT ==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stderr ==== ClientTimeout: Timeout waiting for server response. ==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stdout ==== ==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-2-validate.stdout-contains-ok.stderr ==== Missing lines: REGISTERED cylctb-20190825T214903Z/02-on-the-fly -> /home/travis/cylc-run/cylctb-20190825T214903Z/02-on-the-fly

(When adding new ones, please bear in mind that the list is in alphabetical order.)

bug

Most helpful comment

how to hammer flaky tests on Travis CI

  1. Activate Travis CI in your own clone.
  2. Make a new branch with a single commit that changes .travis/cover.py to run just the flaky test(s)
  3. push the the branch to your clone ... observe build results, re-trigger at will

All 34 comments

Leaving unassigned, good issue to any other newcomer to the project too 馃帀

Sorry hijacked the issue to record other flaky tests as well.

Thanks @matthewrmshin !!! With that list, working on this issue will be thousand times easier. Thanks!!!

Yes, I also notice that tests/execution-time-limit/04-poll.t fails randomly at times. I'll take a look at these ones and the one currently being discussed in #2929.

Another vote for ./tests/jobscript/19-exit-script.t.

Could we (eventually) actually automate a procedure that can quantify 'flakiness' to highlight individual test issues?

Notably, we could create a script/workflow that will run (say, overnight) the test battery &/or Travis CI a number of times & compare failures on each set to pick up on any bad tests. Then we could send out emails prompting us to get any issues fixed.

There is actually some interesting literature about, such as this 2014 study which I just skim read, if anyone is looking for some bed-time reading!

Travis CI already "knows" which tests are flaky (in its environment, at least) because it sometimes has to run those ones twice to get the test battery to pass. Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

it sometimes has to run those ones twice to get the test battery to pass

I am aware of that, but I think two is statistically not the greatest sample size for indication of test reliability. Though your comment made me wonder if there is a Travis CI configuration setting for the number of re-runs? If we up that, then only the relatively small number of possibly flaky tests will need to be re-run, instead of manually having to restart whole jobs in test battery runs whenever flaky tests crop up.

Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

That would certainly be useful & hopefully simple to configure.

I'll have a look through the Travis CI docs to see what is possible.

if there is a Travis CI configuration setting for the number of re-runs?

I believe we manage the rerunning of failed tests ourselves, in our Travis CI config.

(By "knows" I meant our T-CI build script knows, so maybe we could just add a line to that to send a notification out somehow).

two is statistically not the greatest sample size for indication of test reliability

True! But we would gradually identify all flaky tests as they randomly fail over many separate T-CI runs.

The best outcome for this issue is that all tests are fixed and we'll no longer have to re-run them as we do now.

The best outcome for this issue

Sure. But thinking about the future, flaky tests do cause issues here & then, so having some procedures for monitoring & fixing unreliable ones could be nice.

A general comment - we should provide clear comments in our functional tests on exactly how they are supposed to work, because some of them do very strange things in order to set up certain test conditions that are difficult to reproduce "naturally". Case in point: https://github.com/cylc/cylc/issues/2929#issuecomment-455145411 - git blame fingers me, but I didn't explain in the code and just had to waste time figuring out how my own test worked all over again!! :grimacing:

@hjoliver Don't worry. I did manage to work out what was going on - and it was (sort of) my fault.

Note on ulimit's in the Travis distribution used in our builds:

0.00s$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 29790
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 30000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 29790
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comparing with my local environment, it has the same settings or higher number for memory/files/etc. Except max locked memory (kbytes, -l) 64. In my environment is says max locked memory (kbytes, -l) 16384.

max locked memory (kbytes, -l)

The maximum size that may be locked into memory. Memory locking ensures the memory is always in RAM and never moved to the swap disk

But I do not believe this could cause random failures... I guess.

Seeing this error a lot lately:

[14:11:23] ./tests/cylc-poll/16-execution-time-limit.t ..... 2/4 Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/isodatetime/parsers.py", line 237, in parse
    date_info, time_info, parsed_expr = self.get_info(timepoint_string)
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/isodatetime/parsers.py", line 430, in get_info
    keys, date_info = self.get_date_info(date)
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/isodatetime/parsers.py", line 390, in get_date_info
    raise ISO8601SyntaxError("date", date_string)
isodatetime.parsers.ISO8601SyntaxError: Invalid ISO 8601 date representation: 
[14:11:23] ./tests/cylc-poll/16-execution-time-limit.t ..... 4/4 
    stdout and stderr stored in: /tmp/travis/cylctb-20190512T141038Z/cylc-poll/16-execution-time-limit
    suite logs can be found under: /home/travis/cylc-run/cylctb-20190512T141038Z/cylc-poll/16-execution-time-limit/log/suite
[14:11:23] ./tests/cylc-poll/16-execution-time-limit.t ..... Failed 3/4 subtests 
[14:11:56]
Test Summary Report
-------------------
./tests/cylc-poll/16-execution-time-limit.t   (Wstat: 0 Tests: 4 Failed: 3)
  Failed tests:  2-4
Files=4, Tests=24, 78 wallclock secs ( 0.06 usr  0.01 sys + 29.87 cusr  3.90 csys = 33.84 CPU)
Result: FAIL

Hmmm, From isoadatetime:

    BAD_TIME_INPUT = "Invalid ISO 8601 {0} representation: {1}"

So:

isodatetime.parsers.ISO8601SyntaxError: Invalid ISO 8601 date representation: 

Means isodatetime is being passed an empty string?

Should probably change to:

    BAD_TIME_INPUT = 'Invalid ISO 8601 {0} representation "{1}"'

Means isodatetime is being passed an empty string?

Yes, in tests/cylc-poll/16-execution-time-limit.t either PREDICTED_POLL_TIME or ACTUAL_POLL_TIME must be getting set to an empty string.

I've just run this test many times locally (by itself and in parallel with 10 others) and cannot get it to fail. Is the suite log captured in the Travis output for the failures you have been seeing?

Chiming in; most recent builds in cylc/cylc-flow are normally re-triggered, overwriting the previous runs. But luckily @hjoliver has Travis-CI activated for his fork.

Found this one in his list of his builds for cylc-flow, where it looks like 16-execution-time-limit.tfailed.

Hope that helps!

OK, the test is failing as the "health check settings" log line is different:

2019-05-14T10:40:39Z INFO - [foo.1] -health check settings: submission timeout=None, polling intervals=PT2S,...

Running locally I see

2019-05-14T11:26:44+01:00 INFO - [foo.1] -health check settings: execution timeout=PT10S, polling intervals=PT10S,...

how to hammer flaky tests on Travis CI

  1. Activate Travis CI in your own clone.
  2. Make a new branch with a single commit that changes .travis/cover.py to run just the flaky test(s)
  3. push the the branch to your clone ... observe build results, re-trigger at will

(Having done that with cylc-poll/16-execution-time-limit.t, it always passes when run alone! :grimacing: )

Bit off topic I think, but an interesting post about a team that collected reasons for their flaky tests and produced a nice summary of each issue, post mortem, etc.

https://samsaffron.com/archive/2019/05/15/tests-that-sometimes-fail

I like the author's optimism (I think he's co-founder of discourse BTW)

I would like to disagree a bit with Martin. Sometimes I find flaky tests are useful at finding underlying flaws in our application. In some cases when fixing a flaky test, the fix is in the app, not in the test.

Haha, the first paragraphs describe our situation exactly :grimacing:

A few Travis CI runs on the chunk 3/4 have failed on tests/job-submission/18-check-chunking.t (which possibly influences subsequent tests if still run) with some shared errors including a repeated report of Cannot allocate memory: see for example this log & this one, which after the other error messages includes a traceback as follows:

Traceback (most recent call last):
  File ".travis/cover.py", line 38, in <module>
    main()
  File ".travis/cover.py", line 33, in main
    call(fn_tests + ' --state=failed -j 5', shell=True)  # nosec
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 317, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 769, in __init__
    restore_signals, start_new_session)
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 1447, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I've added this to the "flaky test" list above.

Log for a failure in Travis where a test failed due to "Contact info not found for suite" https://gist.github.com/kinow/1429a911fae2e7cbbb7dcec63e64c653

Looks like when this error occurs, some tests that are not in the list of flaky tests fail. Which could indicate resource unintentionally shared by tests?

May have a fix for tests/restart/21-task-elapsed.t

@matthewrmshin I've added two tests that failed today in a trivial PR, but it was just so I don't forget about them.

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

:+1:

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

+1 !

Will raise a PR for the flaky tests in a few minutes. Closing this one for now 馃帀

Was this page helpful?
0 / 5 - 0 ratings