Caseflow: Improve CI

Created on 9 Jun 2017 · 13Comments · Source: department-of-veterans-affairs/caseflow

This came out of retro on 6/2/17.

CI is painful:

Tests are flakey.
The job takes too long to run.
(minor) It's annoying that restarting a build on Travis wipes away the previous build, instead of creating a new build like on Jenkins.

Look at how to improve this.

Source

NickHeiner

Most helpful comment

@NickHeiner fwiw we also cache the bundler gems with our travis setup https://github.com/department-of-veterans-affairs/caseflow/blob/master/.travis.yml#L48-L50 . It similarly shaves off a few minutes (though perhaps circleCI is somehow more efficient).

If it's easy to switch to CircleCI, just being able to ssh into the instance alone feels worth the switch. It is so difficult to debug problems on travis.

joofsh on 13 Jun 2017

👍2

All 13 comments

Storing log for false failures here: https://github.com/department-of-veterans-affairs/caseflow/tree/travis-logs/travis

NickHeiner on 9 Jun 2017

So a matrix build in travis is kicked off by defining multiple values for the same variable.

We can get parallelism by sharding the rake targets, and running them in parallel on different VMs, something like this:

env:
  - rake_target=spec
  - rake_target=ci:other

script:
  - bundle exec rake $rake_target

but going deeper and splitting up the targets into roughly equally sized units of work.

Running parallel tests in the same VM can also gain speed, but that seems to introduce flakiness as well in my experience. The more parallel you can make your tests, the better, but with Travis it looks like the only parallel mechanism is predefined static parallelism (rather than dynamically running one test on each shard)

ToddStumpf on 12 Jun 2017

Another idea: move fast-failing tasks like lint to the beginning of the process, so if they're going to fail a build, they can do it quickly.

NickHeiner on 12 Jun 2017

This effort is helped by #2305.

NickHeiner on 12 Jun 2017

Link to flakey dispatch tests here: https://github.com/department-of-veterans-affairs/caseflow/issues/2187

They are being worked on next by Tango team

joofsh on 13 Jun 2017

@ToddStumpf For some more context, we tested parallelizing our test suite on travis a few months back (that is, running them all in parallel on the same VM) and it only gained ~15 seconds. Travis seems mostly capped to 1 processor/VM

The challenge with splitting the tests up across VMs is that we can quickly gain a lot of speed, but it immediately breaks our code coverage (simplecov). Soon as you split the tests up, since the other tests have not yet run it will say those sections of the application are below the required code coverage. Some immediate options I see going forward:

1) Disable simplecov (temporary or permanent). While it does provide a small forcing function, you could still write poor tests that meet the requirement. We've done a good job as a team of requiring good quality tests and high test coverage, and I believe this would continue without simplecov.

2) Figure out a way to have the tests run on VMs that all share the same filestorage. Maybe there's a way to have a last process run after all the previous parallel VMs have run that combine the individual results into getting simplecov results

3) Move to jenkins & use beefy boxes. Using larger instances, we could use the existing parallel test setup that allows the tests to run in parallel within the same VM and accurately merge the test results for code coverage.

cc @NickHeiner

joofsh on 13 Jun 2017

👍1

Notes on CircleCI:

Better than Travis

Easy access to ssh into boxes to see what's going wrong
Preserves old builds instead of wiping them on restart
More insights & stats (median build time, success rate, etc) than Travis
Logging is a bit nicer to parse, because it is split into sections
Bundler caching saved 2m18s in a single anecdotal test
- Before, after
We can use a modern NodeJS. I've been told we use an old NodeJS because Travis forces us to with its Ruby image, but it also looks like we're installing our own NodeJS, so I'm not sure what to make of that. :smile:
We can use our own Docker container, which could save us initial setup time. (Note that the bulk of our time is spent in rspec tests, so this probably won't be too helpful for us.)

Worse than Travis

NickHeiner on 13 Jun 2017

If it's easy to switch to CircleCI, just being able to ssh into the instance alone feels worth the switch. It is so difficult to debug problems on travis.

joofsh on 13 Jun 2017

👍2

also @NickHeiner we _were_ using an old NodeJS because travis' base image "forced" us to. Then Alan helped remind me how small the node install is, so now we do manually install the latest node LTS

joofsh on 13 Jun 2017

👍1

We are a little behind the latest LTS for NodeJS. But thanks for that clarification!

NickHeiner on 21 Jun 2017

of our ~18 min total build time right now, these are some steps we could take to shave some time off of that:

< 1 min - Cache as much as the environment setup as we can (ruby, npm, gems, etc.). We already to this to a large degree on travis.
3 min - Do steps that are easy to parallelize in parallel. (lint, js tests, rspec test, security)
5 min - Parallelize the rspec test (unit, feature 1, feature 2, feature X), as @joofsh pointed out, the problem here is code coverage, but I think there is a way to make this work.
?? - Use jenkins & beefy boxes (per @jd), we loose public access to our builds with this option, which i think is pretty important.