Julia: Rebuild aging buildbot infrastructure with something more modern

Created on 2 Sep 2020  路  28Comments  路  Source: JuliaLang/julia

I would like to discuss moving the Julia buildbots to something more maintainable than buildbot. It's worked well for us for many years, but the siloing of configuration into a separate repository, combined with the realtively slow pace of development as compared to many other competitors (and also the amount of cruft we've built up to get it as far as it is today) means it's time to move to something new.

Anything we use must have the following features:

  • Multi-platform; it _must_ support runners on all platforms Julia itself is built on. This encompasses:

    • Linux: x86_64 (glibc and musl), i686, armv7l, aarch64, ppc64le
    • Windows: x86_64, i686
    • MacOS: x86_64, aarch64
    • FreeBSD: x86_64
  • I would like the build configuration to live in an easily-modifiable format, such as a .yml file in the Julia repository. It's nice if we can test out different configurations just by making a PR against a repository somewhere.

Possible options include:

I unfortunately likely won't have time to do this all by myself, but if we could get 1-2 community members interested in learning more about CI/CD and who want to help have a hand in bringing the Julia CI story into a better age, I'll be happy to work alongside them.

build help wanted

Most helpful comment

Systems I have worked with in the past have allowed different images/runners to be used in different stages, so that means the build stage can have an image including all the build tools and the test stage could have an image that is more stripped down. Then this becomes a multi-stage pipeline configuration where the stages are:

  1. Clone and build Julia - package the tarballs as artifacts
  2. Test Julia
  3. Deploy to AWS

Yes! This is exactly what we have right now, only instead of "multi-stage pipelines" we have separate worker processes, each running in their own docker container/VM, and triggers that invoke one when the other is finished.

It would be good to look into building this out using Docker images for the CI as much as possible.

Yes, I'm a big fan of Docker and all the Linux workers use docker images, (although the Dockerfile generation is a bit of a mess. Again, written 3+ years ago) but the lack of docker guests for MacOS and FreeBSD make it more painful than I'd like. Previously, I tried using Docker for Windows as well, but I think I was too early and things hadn't matured enough to the point where it was a good experience. Nowadays we may have better luck, so I'd be happy to convert the windows provisioning script to a Dockerfile, and deploy those dockerized containers on our Windows buildbots.

Will computers/instances for CI runners be used, or are we relying solely on the CI provider?

No single service supports the wide range of platforms we need to run on, so it's a mishmash of cloud-run machines and things running on-premises in various locales. To give an exhaustive list:

  • Linux x86_64/i686 runs on the nanosoldier hardware, hosted at MIT.
  • Windows x86_64/i686 runs on AWS EC2 instances.
  • MacOS x86_64 runs on Mac Minis hosted at MIT/Keno's Datacenter/Elliot's Datacenter.
  • MacOS aarch64 runs on Mac Minis hosted at Keno's Datacenter/Elliot's Datacenter.
  • FreeBSD x86_64 runs on KVM VMs, running on OpenStack at MIT.
  • armv7l Linux is a bunch of single-board linux computers in Elliot's Datacenter.
  • aarch64 Linux is a cloud instance on Packet.net.
  • ppc64le Linux is on the IBM cloud somewhere.

Note: in some locales, "datacenter" is pronounced "garage".

All 28 comments

As much as I love the power of GitLab CI (I have been configuring it for another project I work with), it has a major drawback when being used as an external CI provider for repositories hosted on GitHub: they say they won't run pipelines for PRs made from forks of a project (https://gitlab.com/gitlab-org/gitlab/-/issues/5667). That basically means that pipelines won't be run from anyone who doesn't have permissions to make a branch inside the main Julia repo, and will make evaluating new contributor PRs impractical.

I'm not sure if @ChrisRackauckas or @maleadt have ways they work around this for the GPU infrastructure, but that would appear to be a blocking issue for the use of GitLab CI.

JuliaGPU CI uses bors for this. We can just manually say bors try and kick off a job.

JuliaGPU CI uses bors for this. We can just manually say bors try and kick off a job.

That seems like it would not be an ideal workflow for the main Julia repo though, since it will require the tests to be kicked off manually (unless bors can be run automatically in some way).

There's SourceHut's CI from @ddevault with Github integration and a wide range of OS support
The biggest problem is missing Windows and Mac support and I'm not sure how you view 2 CIs (one for unixy and another for proprietary systems)

We tried using bors a long time ago in Pkg, we had a pretty negative experience with it. Getting it to actually test what we wanted was challenging. It seemed to get confused if you pushed new commits while the previous commit was still testing, so that it would then not test the new commit and simply re-use the old commit. We had a hard time canceling old tests and re-running the latest tip of the branch, and looking up the status of a branch wasn't easy since the bors tests would actually happen on a separate branch.

I would personally prefer something that doesn't require a workaround like bors. If GitLab CI can't run on PRs from forks, that's probably a deal-breaker. If there were a simple adapter that worked around this, I'd consider it, but without it, I think we'll have to look elsewhere.

The biggest problem is missing Windows and Mac support

That is a big problem, I'd call that a dealbreaker. :)

It looks like Buildkite is the only system that meets all of our requirements?

Regarding the GitLab CI PR-from-fork and bors stuff, one thing you can do is set up some automation to pull in forked PR branches into the JuliaLang repo so that CI will run normally, and then forward the CI status on those branches to the PR (I think that bit is possible, at least).

I might have a go at a POC to see if it's practical, it might be too much complexity/maintenance burden though.

I think buildkite is definitely worth exploring, without knowing much about it the only con I can see is it's another new system that not many of us have used before (seeing as there are Julia ecosystems using GitLab CI, Azure Pipelines, GHA, etc. already).

Does Jenkins meet all of our requirements?

one thing you can do is set up some automation to pull in forked PR branches into the JuliaLang repo so that CI will run normally

I'd be a little careful about this. For security reasons, we want to treat forked PRs differently from non-forked PRs. For example:

  1. Forked PRs should not be allowed to modify the CI configuration files.
  2. Forked PRs should not have access to any secrets, such as API access tokens, signing keys, etc.

There may be other security considerations as well.

Does Jenkins meet all of our requirements?

Jenkins should run anywhere that Java runs, which is a lot of places. So it seems promising (although I must admit Jenkins "feels" old-school to me).

edit: I think maybe powerpc would be an issue here too
edit2: Actually maybe the agents don't need to run Jenkins software so they could be on any architecture. Not entirely sure, still reading docs.

I'd be a little careful about this [...]

Yeah, this is true, hence why it would probably be a lot of work/maintenance to do properly when there are likely things that already exist which do what we need 馃檪

edit: I think maybe powerpc would be an issue here too
edit2: Actually maybe the agents don't need to run Jenkins software so they could be on any architecture. Not entirely sure, still reading docs.

I _think_ (but I'm not 100% sure) that Jenkins agents can run on ppc64le.

(although I must admit Jenkins "feels" old-school to me).

I agree. I think Buildkite feels like a more modern, "slicker" solution.

Also, Buildkite has explicit support for ppc64le, so no ambiguity there.

I think buildkite is definitely worth exploring, without knowing much about it the only con I can see is it's another new system that not many of us have used before (seeing as there are Julia ecosystems using GitLab CI, Azure Pipelines, GHA, etc. already).

I agree this is the only con.

https://github.com/JuliaCI/julia-buildbot is what we're trying to replicate/replace. right?

Does Jenkins meet all of our requirements?

Man, Jenkins was old 8 years ago when Julia CI consisted of bash scripts I wrote that received a webhook, built Julia, and published the results. :P

@christopher-dG if you're interested in prototyping any of this stuff out, let me know if I can help you in any way. The buildbot provisioning steps are currently a mishmash of docker images (for Linux), powershell AWS provisioning scripts (for Windows), a hand-baked KVM image for FreeBSD, and just a series of copy-pasted commands for MacOS. Much of the pain we go through on Linux is due to the fact that we want to use very old glibc's for compatibility, and so we use very old distributions which themselves don't have things like recent GCCs (we try to use 7 for most things these days). So we have to build a lot of that from scratch. Now that I know more about GCC than I did..... three years ago (my goodness, time flies) we can probably streamline this a lot by building GCC in a similar way to the way we do in BinaryBuilder.

Does Jenkins meet all of our requirements?

Man, Jenkins was old 8 years ago when Julia CI consisted of bash scripts I wrote that received a webhook, built J

Buildkite it is 馃槀

Man, Jenkins was old 8 years ago [...]

I'm glad that my feelings about Jenkins are not uncommon 馃槄

@staticfloat I'll get back to you sometime next week, in the meantime I'll have a look through the buildkite docs.

JuliaCI/julia-buildbot is what we're trying to replicate/replace. right?

Yes, that's right. In brief, the majority of what's going on there is just receiving PRs from GitHub and kicking off various jobs:

  • Packagers

    • Check out and build Julia

    • If build is successful, uploads tarballs/installers to AWS

    • If build is successful, triggers download and testing on a separate machine that doesn't have gcc installed

    • If testing is successful, "promote" the binary by moving it from one folder on AWS to another so that it gets picked up by our ecosystem CI as a new nightly

  • doc builder
  • whitespace checker
  • static analysis passes

The majority of the logic is doing things like working around buildbot idiosyncrasies (it's difficult to do things like run a command on the builder, get the output, then use that in a future step.... ironically would be much easier just use a big bash script), figuring out which URLs to download/upload Julia to, figure out which steps need to run (e.g. PR builds don't get promoted if they pass tests, release-* branches shouldn't conflict with absolute-latest master builds, etc...).

Sketch of work to be done

I see there being multiple possible areas of improvement:

  • CI configuration hackability: Move to a tool with CI configuration in a more hackable format, so that others can help to maintain the CI infrastructure without having a graduate degree in buildbotology.
  • CI environment maintainability: I haven't touched the provisioning scripts for our Linux buildbots in years now. If we needed to change something in them, I have 20% confidence that that I could do it in less than an hour, and only 70% confidence that I could do it in less than a day. It'd be nice to have a provisioning tool that is more unified across platforms, perhaps an Ansible script that gets run locally. I've used many different provisioning tools in the past and Ansible is not bad, but it can be a bit of a pain to get everything set up. It's possible we can use BB to generate build environments in one fell swoop, and our provisioning becomes as simple as extracting a tarball. I'll have to think more about that.
  • CI integration: If we can get a good tool integrated well with GitHub, I can see people like Dilum running away with it and building tools to do things like improve debugging experience for Base PRs, allow (trusted) users to run commands on the buildbots, etc... to relly cut down on the frustration of devs being unable to test things on foreign archs.

IMO let's chip away at this list one item at a time. Let's find a good tool, then we can fill it with beautiful new bits, and then we can make the bits dance for our developing pleasure.

While I haven't used buildkite before, it looks like a nice system that is similar in feel to Gitlab CI and Travis.

Packagers

  • Check out and build Julia
  • If build is successful, uploads tarballs/installers to AWS
  • If build is successful, triggers download and testing on a separate machine that doesn't have gcc installed
  • If testing is successful, "promote" the binary by moving it from one folder on AWS to another so that it gets picked up by our ecosystem CI as a new nightly

This upload to AWS intermediate step should be able to be easily factored out in the newer systems (like buildkite) by treating the tarballs/installers as artifacts that are passed from one pipeline stage to another. That way the actual CI system handles the upload/download for you instead of having to do it manually. Systems I have worked with in the past have allowed different images/runners to be used in different stages, so that means the build stage can have an image including all the build tools and the test stage could have an image that is more stripped down. Then this becomes a multi-stage pipeline configuration where the stages are:

  1. Clone and build Julia - package the tarballs as artifacts
  2. Test Julia
  3. Deploy to AWS

Now for an important question: Will computers/instances for CI runners be used, or are we relying solely on the CI provider?

It would be good to look into building this out using Docker images for the CI as much as possible. That should be simple for all the Linux-based builds and testing images, and I think Windows is available in a Dockerized form (but I have never actually tried it yet).

Now for an important question: Will computers/instances for CI runners be used, or are we relying solely on the CI provider?

I believe that we will be self-hosting all of the runners on our own hardware.

EDIT: This is not quite correct; see Elliot's comment below for the full answer.

Systems I have worked with in the past have allowed different images/runners to be used in different stages, so that means the build stage can have an image including all the build tools and the test stage could have an image that is more stripped down. Then this becomes a multi-stage pipeline configuration where the stages are:

  1. Clone and build Julia - package the tarballs as artifacts
  2. Test Julia
  3. Deploy to AWS

Yes! This is exactly what we have right now, only instead of "multi-stage pipelines" we have separate worker processes, each running in their own docker container/VM, and triggers that invoke one when the other is finished.

It would be good to look into building this out using Docker images for the CI as much as possible.

Yes, I'm a big fan of Docker and all the Linux workers use docker images, (although the Dockerfile generation is a bit of a mess. Again, written 3+ years ago) but the lack of docker guests for MacOS and FreeBSD make it more painful than I'd like. Previously, I tried using Docker for Windows as well, but I think I was too early and things hadn't matured enough to the point where it was a good experience. Nowadays we may have better luck, so I'd be happy to convert the windows provisioning script to a Dockerfile, and deploy those dockerized containers on our Windows buildbots.

Will computers/instances for CI runners be used, or are we relying solely on the CI provider?

No single service supports the wide range of platforms we need to run on, so it's a mishmash of cloud-run machines and things running on-premises in various locales. To give an exhaustive list:

  • Linux x86_64/i686 runs on the nanosoldier hardware, hosted at MIT.
  • Windows x86_64/i686 runs on AWS EC2 instances.
  • MacOS x86_64 runs on Mac Minis hosted at MIT/Keno's Datacenter/Elliot's Datacenter.
  • MacOS aarch64 runs on Mac Minis hosted at Keno's Datacenter/Elliot's Datacenter.
  • FreeBSD x86_64 runs on KVM VMs, running on OpenStack at MIT.
  • armv7l Linux is a bunch of single-board linux computers in Elliot's Datacenter.
  • aarch64 Linux is a cloud instance on Packet.net.
  • ppc64le Linux is on the IBM cloud somewhere.

Note: in some locales, "datacenter" is pronounced "garage".

How does the existing Travis and AppVeyor stuff relate to all this (if at all)?

edit: looking at Travis, AppVeyor, and recent commit/PR checks, they appear to be both unused now.

Yeah, we don't use Travis or Appveyor on Julia base anymore.

Sorry for the inactivity, I got busy/sidetracked. I have been playing around with Buildkite a bit over the last couple weeks, and it seems fine so far. The agent is really easy to set up and run. The only slightly awkward thing is that there's no support for "build matrices" like Travis/GHA have; you can either use YAML anchors to slightly reduce your duplication or run a script that generates your pipeline instead of a static config file. You can run jobs in Docker containers, so the existing Docker images that contain the required build tools can still be used, although we can probably make that stuff neater in the future like Elliot mentioned. I think it would be a great idea to try to get Windows onto Docker as well.

Thanks for the update, Chris! I'm a little surprised to hear it doesn't have matrix support, but taking a quick look, it seems like we can dynamically generate pipelines then trigger those pipelines if need be, which shouldn't be _too_ bad?

Yeah it shouldn't be terrible: https://buildkite.com/docs/pipelines/defining-steps#dynamic-pipelines

Yeah, the only big loss is readability, it's a lot easier to parse a pipeline that's written in YAML vs. a Python/Julia script that outputs YAML.

Although you could have a static pipeline "template" that looks like:

steps:
  - step1:
  - step2:
  - step3:
  - DUMMY_THAT_NEEDS_MATRIX_SUPPORT:
  - step5:

And then run a script that reads the template, then replaces the "dummy" step with N identical jobs that only differ in their OS tags and so on.
All in all, definitely not insurmountable.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

musm picture musm  路  3Comments

Keno picture Keno  路  3Comments

yurivish picture yurivish  路  3Comments

i-apellaniz picture i-apellaniz  路  3Comments

manor picture manor  路  3Comments