This came out of retro on 6/2/17.
CI is painful:
Look at how to improve this.
Storing log for false failures here: https://github.com/department-of-veterans-affairs/caseflow/tree/travis-logs/travis
So a matrix build in travis is kicked off by defining multiple values for the same variable.
We can get parallelism by sharding the rake targets, and running them in parallel on different VMs, something like this:
env:
- rake_target=spec
- rake_target=ci:other
script:
- bundle exec rake $rake_target
but going deeper and splitting up the targets into roughly equally sized units of work.
Running parallel tests in the same VM can also gain speed, but that seems to introduce flakiness as well in my experience. The more parallel you can make your tests, the better, but with Travis it looks like the only parallel mechanism is predefined static parallelism (rather than dynamically running one test on each shard)
Another idea: move fast-failing tasks like lint to the beginning of the process, so if they're going to fail a build, they can do it quickly.
This effort is helped by #2305.
Link to flakey dispatch tests here: https://github.com/department-of-veterans-affairs/caseflow/issues/2187
They are being worked on next by Tango team
@ToddStumpf For some more context, we tested parallelizing our test suite on travis a few months back (that is, running them all in parallel on the same VM) and it only gained ~15 seconds. Travis seems mostly capped to 1 processor/VM
The challenge with splitting the tests up across VMs is that we can quickly gain a lot of speed, but it immediately breaks our code coverage (simplecov). Soon as you split the tests up, since the other tests have not yet run it will say those sections of the application are below the required code coverage. Some immediate options I see going forward:
1) Disable simplecov (temporary or permanent). While it does provide a small forcing function, you could still write poor tests that meet the requirement. We've done a good job as a team of requiring good quality tests and high test coverage, and I believe this would continue without simplecov.
2) Figure out a way to have the tests run on VMs that all share the same filestorage. Maybe there's a way to have a last process run after all the previous parallel VMs have run that combine the individual results into getting simplecov results
3) Move to jenkins & use beefy boxes. Using larger instances, we could use the existing parallel test setup that allows the tests to run in parallel within the same VM and accurately merge the test results for code coverage.
cc @NickHeiner
Notes on CircleCI:
2m18s in a single anecdotal testrspec tests, so this probably won't be too helpful for us.)@NickHeiner fwiw we also cache the bundler gems with our travis setup https://github.com/department-of-veterans-affairs/caseflow/blob/master/.travis.yml#L48-L50 . It similarly shaves off a few minutes (though perhaps circleCI is somehow more efficient).
If it's easy to switch to CircleCI, just being able to ssh into the instance alone feels worth the switch. It is so difficult to debug problems on travis.
also @NickHeiner we _were_ using an old NodeJS because travis' base image "forced" us to. Then Alan helped remind me how small the node install is, so now we do manually install the latest node LTS
We are a little behind the latest LTS for NodeJS. But thanks for that clarification!
of our ~18 min total build time right now, these are some steps we could take to shave some time off of that:
?? - Use jenkins & beefy boxes (per @jd), we loose public access to our builds with this option, which i think is pretty important.
It's been said elsewhere but just for anyone following along here: we can have a public Jenkins setup.
This is being taken care of by @askldjd @mogrenAtWork @CyberKoz @kierachell.
Most helpful comment
@NickHeiner fwiw we also cache the bundler gems with our travis setup https://github.com/department-of-veterans-affairs/caseflow/blob/master/.travis.yml#L48-L50 . It similarly shaves off a few minutes (though perhaps circleCI is somehow more efficient).
If it's easy to switch to CircleCI, just being able to ssh into the instance alone feels worth the switch. It is so difficult to debug problems on travis.