Caseflow: Improve CI

Created on 9 Jun 2017  路  13Comments  路  Source: department-of-veterans-affairs/caseflow

This came out of retro on 6/2/17.

CI is painful:

  1. Tests are flakey.
  2. The job takes too long to run.
  3. (minor) It's annoying that restarting a build on Travis wipes away the previous build, instead of creating a new build like on Jenkins.

Look at how to improve this.

Most helpful comment

@NickHeiner fwiw we also cache the bundler gems with our travis setup https://github.com/department-of-veterans-affairs/caseflow/blob/master/.travis.yml#L48-L50 . It similarly shaves off a few minutes (though perhaps circleCI is somehow more efficient).

If it's easy to switch to CircleCI, just being able to ssh into the instance alone feels worth the switch. It is so difficult to debug problems on travis.

All 13 comments

So a matrix build in travis is kicked off by defining multiple values for the same variable.

We can get parallelism by sharding the rake targets, and running them in parallel on different VMs, something like this:

env:
  - rake_target=spec
  - rake_target=ci:other

script:
  - bundle exec rake $rake_target

but going deeper and splitting up the targets into roughly equally sized units of work.

Running parallel tests in the same VM can also gain speed, but that seems to introduce flakiness as well in my experience. The more parallel you can make your tests, the better, but with Travis it looks like the only parallel mechanism is predefined static parallelism (rather than dynamically running one test on each shard)

Another idea: move fast-failing tasks like lint to the beginning of the process, so if they're going to fail a build, they can do it quickly.

This effort is helped by #2305.

Link to flakey dispatch tests here: https://github.com/department-of-veterans-affairs/caseflow/issues/2187

They are being worked on next by Tango team

@ToddStumpf For some more context, we tested parallelizing our test suite on travis a few months back (that is, running them all in parallel on the same VM) and it only gained ~15 seconds. Travis seems mostly capped to 1 processor/VM

The challenge with splitting the tests up across VMs is that we can quickly gain a lot of speed, but it immediately breaks our code coverage (simplecov). Soon as you split the tests up, since the other tests have not yet run it will say those sections of the application are below the required code coverage. Some immediate options I see going forward:

1) Disable simplecov (temporary or permanent). While it does provide a small forcing function, you could still write poor tests that meet the requirement. We've done a good job as a team of requiring good quality tests and high test coverage, and I believe this would continue without simplecov.

2) Figure out a way to have the tests run on VMs that all share the same filestorage. Maybe there's a way to have a last process run after all the previous parallel VMs have run that combine the individual results into getting simplecov results

3) Move to jenkins & use beefy boxes. Using larger instances, we could use the existing parallel test setup that allows the tests to run in parallel within the same VM and accurately merge the test results for code coverage.

cc @NickHeiner

Notes on CircleCI:

Better than Travis

  • Easy access to ssh into boxes to see what's going wrong
  • Preserves old builds instead of wiping them on restart
  • More insights & stats (median build time, success rate, etc) than Travis
  • Logging is a bit nicer to parse, because it is split into sections
  • Bundler caching saved 2m18s in a single anecdotal test
  • We can use a modern NodeJS. I've been told we use an old NodeJS because Travis forces us to with its Ruby image, but it also looks like we're installing our own NodeJS, so I'm not sure what to make of that. :smile:
  • We can use our own Docker container, which could save us initial setup time. (Note that the bulk of our time is spent in rspec tests, so this probably won't be too helpful for us.)

Worse than Travis

@NickHeiner fwiw we also cache the bundler gems with our travis setup https://github.com/department-of-veterans-affairs/caseflow/blob/master/.travis.yml#L48-L50 . It similarly shaves off a few minutes (though perhaps circleCI is somehow more efficient).

If it's easy to switch to CircleCI, just being able to ssh into the instance alone feels worth the switch. It is so difficult to debug problems on travis.

also @NickHeiner we _were_ using an old NodeJS because travis' base image "forced" us to. Then Alan helped remind me how small the node install is, so now we do manually install the latest node LTS

We are a little behind the latest LTS for NodeJS. But thanks for that clarification!

of our ~18 min total build time right now, these are some steps we could take to shave some time off of that:

  • < 1 min - Cache as much as the environment setup as we can (ruby, npm, gems, etc.). We already to this to a large degree on travis.
  • 3 min - Do steps that are easy to parallelize in parallel. (lint, js tests, rspec test, security)
  • 5 min - Parallelize the rspec test (unit, feature 1, feature 2, feature X), as @joofsh pointed out, the problem here is code coverage, but I think there is a way to make this work.
  • ?? - Use jenkins & beefy boxes (per @jd), we loose public access to our builds with this option, which i think is pretty important.

?? - Use jenkins & beefy boxes (per @jd), we loose public access to our builds with this option, which i think is pretty important.

It's been said elsewhere but just for anyone following along here: we can have a public Jenkins setup.

This is being taken care of by @askldjd @mogrenAtWork @CyberKoz @kierachell.

Was this page helpful?
0 / 5 - 0 ratings