Couchdb: Fix flaky tests on Jenkins

Created on 5 Feb 2019 · 17Comments · Source: apache/couchdb

Summary

Now that elixir test suite was added to Travis it proved to be rather unreliable (for me it's failing 4 times out of 5 on the random tests). This is a problem as it notably delays merging _any_ pull request and in general not causing a great deal of confidence in elixir tests to indicate anything useful.

Desired Behaviour

Ideally we'd want to elixir tests to fail only for a legit reasons, but rare and insignificant amount of random failures is permissible.

Possible Solution

We can tag all flaky tests as flaky as an initial stage of #1885 and make make elixir to skip them until Travis will start reliably (or at least more reliably) pass. This will give us a complete list of the tests to improve and as we go through them one-by-one we'll remove the flaky tag enabling them back.

enhancement testsuite

Source

eiri

Most helpful comment

I'm working on the tests for Jenkins here https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/jenkins-fix-elixir/7/pipeline/47/

Currently I've added another tag :skip_on_jenkins Some tests just don't play nicely with Jenkins but pass fine on travis.

garrensmith on 18 Feb 2019

👍2

All 17 comments

@wohali @garrensmith I'd be interested to hear your opinion.

eiri on 5 Feb 2019

@eiri I'm happy to disable specific flaky tests. But I don't want to disable the full test suite.

garrensmith on 5 Feb 2019

+1 to this.

garrensmith on 5 Feb 2019

@eiri See https://github.com/apache/couchdb/pull/1883 - you absolutely have my "permission" :)

Once #1885 is merged you can then change make elixir and merge the fix for this.

wohali on 5 Feb 2019

Also see #1735 .

wohali on 5 Feb 2019

I have also found that in many cases it's possible to de-flake a test simply by wrapping the flaky parts in a retry like so:

retry_until(fn ->
  flaky_code
end)

Of course, you have to include the request as well as the failing assertion in the fn, but as long the condition is met within the timeout (default 5 seconds), it should help.

jaydoane on 6 Feb 2019

how about we run the tests, but don't fail the build if they fail until we worked out the kinks?

janl on 6 Feb 2019

👍1

@janl That'd work too, the only thing is that with the tagging we'll have a list of the things to fix and ability to turn them back one-by-one, where with your approach it's all-or-nothing. But it's much faster and simpler to do, that's important :)

@jaydoane I guess that'd work, but this is a bit different from the topic, it's not about how to fix the tests, but how to progress with other job while they not fixed.

eiri on 7 Feb 2019

An obvious concern with both tagging and turning off or ignoring all the failures approaches is that it's way too easy to do that and then just forget and never come back to the fixing of the failing tests 🙈

I don't have a solution for this apart from somebody taking and driving this work from start till end.

eiri on 7 Feb 2019

@eiri fully agree that this needs to be a process, the not-fail-if-test-fail is a stop-gap to allow a process to gradually get there

janl on 7 Feb 2019

👍1

@eiri I mentioned the retry_until hack since trying that is about as time consuming as tagging, and if it fixes the issue, the tag and subsequent fixing are no longer necessary.

jaydoane on 8 Feb 2019

I'm working on the tests for Jenkins here https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/jenkins-fix-elixir/7/pipeline/47/

Currently I've added another tag :skip_on_jenkins Some tests just don't play nicely with Jenkins but pass fine on travis.

garrensmith on 18 Feb 2019

👍2

I've changed this issues topic, because the tests seem pretty stable on Travis, so we just need focus on Jenkins. I've spent the week working on fixing the tests in Jenkins. Its not quite perfect and there are still some tests that seem to be failing. At this point I could with help from from the rest of the community. Fixing tests are quite easy it just takes really long as you have to wait for builds to run and sometimes Jenkins just stops running a branch.

The Jenkins build server is here: https://builds.apache.org/blue/organizations/jenkins/CouchDB/activity

Here are my steps to fix tests:

Look at the tests failing in a build for tests to fix
Create a branch that starts with jenkins-, Jenkins will pick those branches up and run them
Try adding a retry_until lots of examples here
If a test is still failing after adding retry_until then add a @tag :skip_on_jenkins
Rinse and repeat

garrensmith on 22 Feb 2019

👍1

@garrensmith experiments with this kind of steps can be done only by people with push access to the repo right?

dottorblaster on 2 Mar 2019

@dottorblaster ah that is frustrating. I think then rather work on converting the across js tests and then I can run them on jenkins and look for flaky tests.

garrensmith on 4 Mar 2019

@garrensmith copy that 👍

dottorblaster on 4 Mar 2019

Jenkins is pretty reliable now...going to close this.

wohali on 25 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings