I would like to discuss moving the Julia buildbots to something more maintainable than buildbot
. It's worked well for us for many years, but the siloing of configuration into a separate repository, combined with the realtively slow pace of development as compared to many other competitors (and also the amount of cruft we've built up to get it as far as it is today) means it's time to move to something new.
Anything we use must have the following features:
Multi-platform; it _must_ support runners on all platforms Julia itself is built on. This encompasses:
x86_64
(glibc
and musl
), i686
, armv7l
, aarch64
, ppc64le
x86_64
, i686
x86_64
, aarch64
x86_64
I would like the build configuration to live in an easily-modifiable format, such as a .yml
file in the Julia repository. It's nice if we can test out different configurations just by making a PR against a repository somewhere.
Possible options include:
Azure Pipelines (Note: incomplete runner support)
GitHub Actions (Note: incomplete runner support)
I unfortunately likely won't have time to do this all by myself, but if we could get 1-2 community members interested in learning more about CI/CD and who want to help have a hand in bringing the Julia CI story into a better age, I'll be happy to work alongside them.
As much as I love the power of GitLab CI (I have been configuring it for another project I work with), it has a major drawback when being used as an external CI provider for repositories hosted on GitHub: they say they won't run pipelines for PRs made from forks of a project (https://gitlab.com/gitlab-org/gitlab/-/issues/5667). That basically means that pipelines won't be run from anyone who doesn't have permissions to make a branch inside the main Julia repo, and will make evaluating new contributor PRs impractical.
I'm not sure if @ChrisRackauckas or @maleadt have ways they work around this for the GPU infrastructure, but that would appear to be a blocking issue for the use of GitLab CI.
JuliaGPU CI uses bors for this. We can just manually say bors try
and kick off a job.
JuliaGPU CI uses bors for this. We can just manually say bors try and kick off a job.
That seems like it would not be an ideal workflow for the main Julia repo though, since it will require the tests to be kicked off manually (unless bors can be run automatically in some way).
There's SourceHut's CI from @ddevault with Github integration and a wide range of OS support
The biggest problem is missing Windows and Mac support and I'm not sure how you view 2 CIs (one for unixy and another for proprietary systems)
We tried using bors
a long time ago in Pkg
, we had a pretty negative experience with it. Getting it to actually test what we wanted was challenging. It seemed to get confused if you pushed new commits while the previous commit was still testing, so that it would then not test the new commit and simply re-use the old commit. We had a hard time canceling old tests and re-running the latest tip of the branch, and looking up the status of a branch wasn't easy since the bors
tests would actually happen on a separate branch.
I would personally prefer something that doesn't require a workaround like bors
. If GitLab CI can't run on PRs from forks, that's probably a deal-breaker. If there were a simple adapter that worked around this, I'd consider it, but without it, I think we'll have to look elsewhere.
The biggest problem is missing Windows and Mac support
That is a big problem, I'd call that a dealbreaker. :)
It looks like Buildkite is the only system that meets all of our requirements?
Regarding the GitLab CI PR-from-fork and bors stuff, one thing you can do is set up some automation to pull in forked PR branches into the JuliaLang repo so that CI will run normally, and then forward the CI status on those branches to the PR (I think that bit is possible, at least).
I might have a go at a POC to see if it's practical, it might be too much complexity/maintenance burden though.
I think buildkite is definitely worth exploring, without knowing much about it the only con I can see is it's another new system that not many of us have used before (seeing as there are Julia ecosystems using GitLab CI, Azure Pipelines, GHA, etc. already).
Does Jenkins meet all of our requirements?
one thing you can do is set up some automation to pull in forked PR branches into the JuliaLang repo so that CI will run normally
I'd be a little careful about this. For security reasons, we want to treat forked PRs differently from non-forked PRs. For example:
There may be other security considerations as well.
Does Jenkins meet all of our requirements?
Jenkins should run anywhere that Java runs, which is a lot of places. So it seems promising (although I must admit Jenkins "feels" old-school to me).
edit: I think maybe powerpc would be an issue here too
edit2: Actually maybe the agents don't need to run Jenkins software so they could be on any architecture. Not entirely sure, still reading docs.
I'd be a little careful about this [...]
Yeah, this is true, hence why it would probably be a lot of work/maintenance to do properly when there are likely things that already exist which do what we need 馃檪
edit: I think maybe powerpc would be an issue here too
edit2: Actually maybe the agents don't need to run Jenkins software so they could be on any architecture. Not entirely sure, still reading docs.
I _think_ (but I'm not 100% sure) that Jenkins agents can run on ppc64le.
(although I must admit Jenkins "feels" old-school to me).
I agree. I think Buildkite feels like a more modern, "slicker" solution.
Also, Buildkite has explicit support for ppc64le, so no ambiguity there.
I think buildkite is definitely worth exploring, without knowing much about it the only con I can see is it's another new system that not many of us have used before (seeing as there are Julia ecosystems using GitLab CI, Azure Pipelines, GHA, etc. already).
I agree this is the only con.
https://github.com/JuliaCI/julia-buildbot is what we're trying to replicate/replace. right?
Does Jenkins meet all of our requirements?
Man, Jenkins was old 8 years ago when Julia CI consisted of bash
scripts I wrote that received a webhook, built Julia, and published the results. :P
@christopher-dG if you're interested in prototyping any of this stuff out, let me know if I can help you in any way. The buildbot provisioning steps are currently a mishmash of docker images (for Linux), powershell AWS provisioning scripts (for Windows), a hand-baked KVM image for FreeBSD, and just a series of copy-pasted commands for MacOS. Much of the pain we go through on Linux is due to the fact that we want to use very old glibc's for compatibility, and so we use very old distributions which themselves don't have things like recent GCCs (we try to use 7 for most things these days). So we have to build a lot of that from scratch. Now that I know more about GCC than I did..... three years ago (my goodness, time flies) we can probably streamline this a lot by building GCC in a similar way to the way we do in BinaryBuilder.
Does Jenkins meet all of our requirements?
Man, Jenkins was old 8 years ago when Julia CI consisted of
bash
scripts I wrote that received a webhook, built J
Buildkite it is 馃槀
Man, Jenkins was old 8 years ago [...]
I'm glad that my feelings about Jenkins are not uncommon 馃槄
@staticfloat I'll get back to you sometime next week, in the meantime I'll have a look through the buildkite docs.
JuliaCI/julia-buildbot is what we're trying to replicate/replace. right?
Yes, that's right. In brief, the majority of what's going on there is just receiving PRs from GitHub and kicking off various jobs:
gcc
installedThe majority of the logic is doing things like working around buildbot idiosyncrasies (it's difficult to do things like run a command on the builder, get the output, then use that in a future step.... ironically would be much easier just use a big bash script), figuring out which URLs to download/upload Julia to, figure out which steps need to run (e.g. PR builds don't get promoted if they pass tests, release-*
branches shouldn't conflict with absolute-latest master
builds, etc...).
I see there being multiple possible areas of improvement:
IMO let's chip away at this list one item at a time. Let's find a good tool, then we can fill it with beautiful new bits, and then we can make the bits dance for our developing pleasure.
While I haven't used buildkite before, it looks like a nice system that is similar in feel to Gitlab CI and Travis.
Packagers
- Check out and build Julia
- If build is successful, uploads tarballs/installers to AWS
- If build is successful, triggers download and testing on a separate machine that doesn't have gcc installed
- If testing is successful, "promote" the binary by moving it from one folder on AWS to another so that it gets picked up by our ecosystem CI as a new nightly
This upload to AWS intermediate step should be able to be easily factored out in the newer systems (like buildkite) by treating the tarballs/installers as artifacts that are passed from one pipeline stage to another. That way the actual CI system handles the upload/download for you instead of having to do it manually. Systems I have worked with in the past have allowed different images/runners to be used in different stages, so that means the build stage can have an image including all the build tools and the test stage could have an image that is more stripped down. Then this becomes a multi-stage pipeline configuration where the stages are:
Now for an important question: Will computers/instances for CI runners be used, or are we relying solely on the CI provider?
It would be good to look into building this out using Docker images for the CI as much as possible. That should be simple for all the Linux-based builds and testing images, and I think Windows is available in a Dockerized form (but I have never actually tried it yet).
Now for an important question: Will computers/instances for CI runners be used, or are we relying solely on the CI provider?
I believe that we will be self-hosting all of the runners on our own hardware.
EDIT: This is not quite correct; see Elliot's comment below for the full answer.
Systems I have worked with in the past have allowed different images/runners to be used in different stages, so that means the build stage can have an image including all the build tools and the test stage could have an image that is more stripped down. Then this becomes a multi-stage pipeline configuration where the stages are:
- Clone and build Julia - package the tarballs as artifacts
- Test Julia
- Deploy to AWS
Yes! This is exactly what we have right now, only instead of "multi-stage pipelines" we have separate worker processes, each running in their own docker container/VM, and triggers that invoke one when the other is finished.
It would be good to look into building this out using Docker images for the CI as much as possible.
Yes, I'm a big fan of Docker and all the Linux workers use docker images, (although the Dockerfile generation is a bit of a mess. Again, written 3+ years ago) but the lack of docker guests for MacOS and FreeBSD make it more painful than I'd like. Previously, I tried using Docker for Windows as well, but I think I was too early and things hadn't matured enough to the point where it was a good experience. Nowadays we may have better luck, so I'd be happy to convert the windows provisioning script to a Dockerfile
, and deploy those dockerized containers on our Windows buildbots.
Will computers/instances for CI runners be used, or are we relying solely on the CI provider?
No single service supports the wide range of platforms we need to run on, so it's a mishmash of cloud-run machines and things running on-premises in various locales. To give an exhaustive list:
nanosoldier
hardware, hosted at MIT.Note: in some locales, "datacenter" is pronounced "garage".
Yeah, we don't use Travis or Appveyor on Julia base anymore.
Sorry for the inactivity, I got busy/sidetracked. I have been playing around with Buildkite a bit over the last couple weeks, and it seems fine so far. The agent is really easy to set up and run. The only slightly awkward thing is that there's no support for "build matrices" like Travis/GHA have; you can either use YAML anchors to slightly reduce your duplication or run a script that generates your pipeline instead of a static config file. You can run jobs in Docker containers, so the existing Docker images that contain the required build tools can still be used, although we can probably make that stuff neater in the future like Elliot mentioned. I think it would be a great idea to try to get Windows onto Docker as well.
Thanks for the update, Chris! I'm a little surprised to hear it doesn't have matrix support, but taking a quick look, it seems like we can dynamically generate pipelines then trigger those pipelines if need be, which shouldn't be _too_ bad?
Yeah it shouldn't be terrible: https://buildkite.com/docs/pipelines/defining-steps#dynamic-pipelines
Yeah, the only big loss is readability, it's a lot easier to parse a pipeline that's written in YAML vs. a Python/Julia script that outputs YAML.
Although you could have a static pipeline "template" that looks like:
steps:
- step1:
- step2:
- step3:
- DUMMY_THAT_NEEDS_MATRIX_SUPPORT:
- step5:
And then run a script that reads the template, then replaces the "dummy" step with N identical jobs that only differ in their OS tags and so on.
All in all, definitely not insurmountable.
Most helpful comment
Yes! This is exactly what we have right now, only instead of "multi-stage pipelines" we have separate worker processes, each running in their own docker container/VM, and triggers that invoke one when the other is finished.
Yes, I'm a big fan of Docker and all the Linux workers use docker images, (although the Dockerfile generation is a bit of a mess. Again, written 3+ years ago) but the lack of docker guests for MacOS and FreeBSD make it more painful than I'd like. Previously, I tried using Docker for Windows as well, but I think I was too early and things hadn't matured enough to the point where it was a good experience. Nowadays we may have better luck, so I'd be happy to convert the windows provisioning script to a
Dockerfile
, and deploy those dockerized containers on our Windows buildbots.No single service supports the wide range of platforms we need to run on, so it's a mishmash of cloud-run machines and things running on-premises in various locales. To give an exhaustive list:
nanosoldier
hardware, hosted at MIT.Note: in some locales, "datacenter" is pronounced "garage".