Cats: Evaluate moving to Circle CI

Created on 9 Jul 2018  Â·  30Comments  Â·  Source: typelevel/cats

Travis' memory issue is a bit too much and our build there now takes more than 3 hours.

Most helpful comment

Guys, I've opened a few PRs which demonstrate the config required to use different services.

I evaluated CircleCI too but I found that the container memory limit of 4GB was just not enough to run cats builds reliably. I found the configuration to be quite verbose, and I also had issues where the config validation in the CircleCI CLI disagreed with the service itself and my build didn't run after passing validation locally.

These services do experience intermittent failures with builds, but they all seem to be caused by a single flaky test (ApplicativeTests.monoid.combineAll).

I think we should focus on fixing that whatever we decide to do about CI in the future.

So far my instinct is that Drone.io is probably the best option as it is free for open source, easy to configure and super fast.

Semaphore has a very unclear open source policy and although Buildkite is very nice, I think that managing hardware in addition to the build itself could become a bit of a chore.

All 30 comments

Might also look at BuildKite if we can get people to donate hardware. It's easy enough for me to set up, which means it's easy. An agent running on @larsrh's 3874629847653-core machine would be 🔥

@tpolecat That won't work, unfortunately. That machine is university property.

On 9 July 2018 19:37:21 CEST, Rob Norris notifications@github.com wrote:

Might also look at BuildKite if we can get
people to donate hardware. It's easy enough for me to set up, which
means it's easy. An agent running on @larsrh's 3874629847653-core
machine would be 🔥

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/typelevel/cats/issues/2319#issuecomment-403559680

@kailuowang I'm a bit out of the loop, I think. The build takes 3 hours? How long does it take locally? What is it doing? Thanks to working at SlamData, I have a remarkably vast swath of experience debugging slow Travis builds. I'd be happy to take a look if you want.

the 2-3 hours is the combined total of the builds, each job is typically 20-30 minutes - https://travis-ci.org/typelevel/cats/builds/415542808

And a lot of that time is coverage testing, tut testing, doc testing, site building and so on.

Ok taking a quick look at things, literally the first things that occur to me:

  • Oh god, you're using a separated build script… I hate that convention.
  • Why is sudo: required? I'm relatively certain those VMs are slower. Is it just for codecov? See below.
  • The travis-publish.sh script goes to great lengths to push things all into a single SBT instance. In my experience, this is exactly the opposite of what you want to do when you have a slow build. Separate SBT processes, sequentially invoked, gives you better memory characteristics and is better understood by Travis (especially if you don't split the build script out of .travis.yml).
  • .jvmopts uses -Xmx6g. This is problematic because Travis doesn't have that much memory! You should strongly consider dropping that option altogether and allowing it to be the default (ditto with -Xms), which will be scaled off of the reported system memory.
  • We should have a discussion about whether or not code coverage is actually worth anything. Frankly, I've never seen it provide any value whatsoever, and it doubles the duration of the JVM build.
  • Why is the Ivy cache not being sanitized prior to publication? This is resulting in re-caching quite often.
  • Random best-practice: consider commenting on each of the secure variables so we know which one is which.

I didn't look at SBT itself. Looks like a lot of the logic is in tasks, so that may also contribute.

The build script actually does invoke sbt multiple times, but for jvm we could split even more as per the js build - but the jvm issue normally happens relatively early in the build.

for the sudo - that is a slower startup but you get the 7.5 Gb memory, we could try a lower setting. ref https://docs.travis-ci.com/user/reference/overview/

Why is sudo: required? I'm relatively certain those VMs are slower.

sudo: required gets 7.5 GB as opposed to 4GB. http4s adopted it because the IO was untenable on the container builds, but that should be far less a factor in cats.

We should have a discussion about whether or not code coverage is actually worth anything. Frankly, I've never seen it provide any value whatsoever, and it doubles the duration of the JVM build.

:+1:

My main concern _before_ moving would be to ensure that it really is not our build at fault! one simple option is to add parallelExecution := false to the jvm settings, already in js

re scoverage times... be careful here. The scoverage tests _also_ run the scalacheck tests, but with larger parameters than js. And after a successful coverage run, the code is just rebuilt not tested.

So whilst coverage will always be slower, i doubt it's causing any issues. What we might want to do is try running the scoverage with very low parameters (just to get coverage) and then run the full scalachecks with no scoverage.

IMHO, keeping/ditching coverage is best discussed as a separate issue

@djspiewak thanks so much for helping. And @BennyHill thanks for answering some of the questions.

To answer your questions above.

  • I'm not a fan of the separated build script either. Maybe we can replace it, but it didn't bother me enough to spend time on that either.
  • that is, errr, a way to tell Travis to use a different VM (see @BennyHill's answer above). I don't believe sudo is actually needed for the build to run. We added at least a year ago when we had memory issues with Travis last time. Might worth a try to remove it if we can squeeze
  • +1 on dropping -Xmx6g especially if we can use a different VM
  • code coverage combined with codecover chrome extension made it very easy to identify uncovered code in PRs. I agree that the overall coverage number for a PR isn't that critical. We can probably improve the build by limiting the code coverage in a single scala 2.12 jvm build job. Right now it's performed on both scala 2.11 and 2.12 jvm job.
  • no idea. worth a try.
  • also +1 on adopting that best practice. I think the two we have are sonatype credentials.

re the parallelExecution := false idea, this came up the other day on the scala native channel - https://gitter.im/scala-native/scala-native?at=5b6d631fa6af14730b170260

Finally, re the "separated build scripts" this was orignally done as per the ci docs,

But of course, that was a while back , so perhaps we can revisit that

And finally, finally.... one small advantage of separate build script is that it's far easier to "run* from the command line without having a local travis - see https://github.com/typelevel/cats/blob/master/scripts/travis-publish.sh#L17-L18

If you drop sudo: required it would be a good idea to add -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap to ensure that the heap size is set according to the container's memory limits

If a decision is made to move to circleci let me know as I would gladly help. Have used circleci last few years exclusively.

Are there any other alternatives being considered?

I have also heard really good things about Semaphore and BuildKite, although BuildKite requires its own infrastructure (I can highly recommend packet.com for that) and Semaphore's OSS policy seems to have mysteriously become a "Please email us if you are an OSS project" policy

I had a look at this a few days ago and my overwhelming impression was that it's hard to define a build matrix in a nice way in every one of the hosted services other than Travis. It's possible in Circle CI but relies on YAML dictionary operations rather than being a construct in its own right.

Whether that's a problem or not depends on how much faster (if at all) the builds run on those services IMO 😀

Thanks guys. We haven't seriously looked at the any of the alternatives yet. But we probably should soon given the elevated uncertainty in Travis future and it's suboptimal reliability lately. An easier migration from Travis is a nice to have, reason being if we have to switch yet again, it's slightly more likely to find another service somewhat confirm to the Travis way. How easier to set up a trial on circle ci?

I'd be glad to give a few different services a go and report back @kailuowang?

@DavidGregory084 that would be amazing. Thanks!

There's something interesting
image
about testing new CI systems
image
that brings out all the weird bugs 😄

Guys, I've opened a few PRs which demonstrate the config required to use different services.

I evaluated CircleCI too but I found that the container memory limit of 4GB was just not enough to run cats builds reliably. I found the configuration to be quite verbose, and I also had issues where the config validation in the CircleCI CLI disagreed with the service itself and my build didn't run after passing validation locally.

These services do experience intermittent failures with builds, but they all seem to be caused by a single flaky test (ApplicativeTests.monoid.combineAll).

I think we should focus on fixing that whatever we decide to do about CI in the future.

So far my instinct is that Drone.io is probably the best option as it is free for open source, easy to configure and super fast.

Semaphore has a very unclear open source policy and although Buildkite is very nice, I think that managing hardware in addition to the build itself could become a bit of a chore.

Thanks, @DavidGregory084 that's a lot of work. I will checkout their configs in your PRs , and take stab at ApplicativeTests.monoid.combineAll.

@DavidGregory084 Out of curiosity what specific memory related issues did you hit with CircleCI? Where you leveraging any of circle's parallel processing features?

@softinio you can see the config I used here. I tried using the cgroup memory limit detection (-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap), which didn't work correctly on CircleCI and resulted in the JVM allocating way too much memory. I also tried reducing the JVM memory allocation to 3.5G but I was still getting multiple jobs on each build killed by the CircleCI infra (Exited with code 137). You can see some example runs here.

@softinio it seems like exceeding 4GB of available memory requires using a paid plan; as an open source project we could probably use the resource_class: large if we contacted CircleCI support.

Update on this: Semaphore would like to donate Cats 8 bare metal performance agents for cats CI. In my tests, it cuts Cats’ build time by half. I think we should consider migrating to Semaphore, main reason being that we have so many TL projects on Travis all sharing 6 slow agents, it’s nice to have some more powerful CI resources.

Since nobody has worked on this for quite a while, I'm closing all old CI-related PRs.

Was this page helpful?
0 / 5 - 0 ratings