Openfoodnetwork: Increase capacity of our CI service

Created on 27 Jan 2021 · 13Comments · Source: openfoodfoundation/openfoodnetwork

What we should change and why (this is tech debt)

Not only our CI service is very old, rather limited in terms of features, and soon discontinued (it was already outlined in https://github.com/openfoodfoundation/openfoodnetwork/issues/4435) but we've also been experiencing serious resource contention issues lately. According to Semaphore itself

developers have spent over 103 hours waiting for CI resources during the week of January 4, 2021.

I think the combination of Dependabots PRs and the need to retry jobs N times due to the test suite's flakiness are the root cause of this.

We initially thought it's worth spending money on this (we're still running on Rob's free account) but given that we've got a free GH account trying out Github Actions might solve it for free (I've personally adopted it successfully for other projects).

Alternatively, we could investigate migrating to Semaphore CI v2 (we have an account already) and pay if needed.

Context

This was briefly discussed in the last delivery train meeting, and there was a shared agreement that this has become a problem.

Impact and timeline

Having an up-to-date CI service should reduce the time it takes to get PRs to test ready, and enable new potential automation.

spike tech debt

Source

sauloperez

Most helpful comment

Re: https://github.com/openfoodfoundation/openfoodnetwork/pull/6902#issuecomment-795267879 - I just double checked the math (and added in the engines specs, which report their counts separately) and I got 4486 examples, 25 pending on both Github and Semaphore. I think we can confidently say we're running all specs on both.

Since we want to keep Semaphore going for the time being in order to do deploys, I propose that we strip the config down to just run, say, the javascript and model specs in a single job. Then we'll have up to four concurrent builds going and should never be waiting on Semaphore since Github will take longer to complete.

@openfoodfoundation/core-devs: If you're good with that proposal, add a 👍 to this comment. If everyone is good with it, we can go ahead and reconfigure Semaphore (and I think close this issue).

Counterproposals welcome, of course :)

andrewpbrett on 19 Mar 2021

👍4 🚀3

All 13 comments

I'm noticing an increasing amount of Net::ReadTimeout errors in the build, such as https://semaphoreci.com/openfoodfoundation/openfoodnetwork-2/branches/pull-request-6774/builds/2. This slows builds even more and the retries increase the queue even more.

sauloperez on 1 Feb 2021

We need to undo https://github.com/openfoodfoundation/openfoodnetwork/issues/6840 as soon as we get to a stable build again.

sauloperez on 9 Feb 2021

Insights

Insights so far after the initial configuration done in #6902

We managed to get all the different kinds of tests working without surprises. Capybara and JS tests run successfully.
Usage limits outweigh our existing CI. We would have up to 20 concurrent jobs (we currently run 4 in each build) included on our Github free plan. See: https://docs.github.com/en/actions/reference/usage-limits-billing-and-administration#usage-limits
It's not only CI server but DevOps pipelines which enables us to automate many parts of the delivery process such as release drafting, translations, deployment, etc. by supporting more events and thus supporting more complex workflows. See: https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#on. There's also a huge marketplace of actions we can use: https://github.com/marketplace?type=actions.
Code coverage reporting could easily be solved. See https://github.com/coopdevs/lazona_connector/pull/12 for reference.
We run on Ubuntu 18 out of the box.
The CI configuration is version controlled like all the code, with all the benefits that brings.
Machines the build runs on: 2-core CPU, 7 GB of RAM, 14 GB of SSD disk space. I couldn't get the specs of our existing ones.

Things yet to be solved

There are three things that are yet to be solved though.

Branch deployments to staging from the UI

This is fundamental for the delivery process as it allows anyone in the core team, not only devs, to stage branches for testing or product validation, so we need to keep it.

We can still use Semaphore CI to do only this while we figure out how deployments can be triggered from Github. We could implement the same solution we went with for Semaphore as a Github action leveraging other's actions. I also explored this same idea with a similar workflow syntax successfully before.

Automatic parallelization of the build

We use a free version of Knapsack to parallelize the build execution but that requires manual intervention to balance the jobs every now and then. That's exactly what the pro version offers.

Given we stopped doing this manual step (no surprise) we are not getting much value from Knapsack lately. I suggest we start without it which will enable us to remove some code, review our assumptions (it was added ages ago), reevaluate and see if automatic concurrent execution is something we need.

The flexibility the pipelines add allows us to run feature specs only after unit tests have successfully finished, for instance. I would give it a bit more thought and try out a more elaborate workflow.

If then we see we still need something like Knapsack I would pay for the pro version. There are plenty of tutorials on how to integrate it with Github Actions.

Failed build

The first attempts pointed out that some specs are failing due to the test being clearly broken. That seems odd because it's not happening in Semaphore. There are no shortcuts for these though. They need to be fixed.

Final considerations

We need to be aware of the fact that with better CI more opportunities for improvement will pop up. We'll need to evaluate case by case whether these need to be addressed now.

For instance, I can already say that the sheer number of deprecation warnings now becomes even more painful. It makes it very hard to grab the scroll bar due to its tiny size besides how much they slow down the build LOL :see_no_evil:

I also see we'll also suffer from some of the flakiest tests and I'm afraid there won't any shortcut there. I'm thinking about the ones dealing with concurrent which I already saw failing.

Proposal

With all this, I propose we finish a first iteration of the workflow started in #6902 fixing any broken tests, and give it a week or two to see how both Semaphore and Github Actions compare. If things go well, we remove Semaphore. If not, we reassess it.

sauloperez on 22 Feb 2021

We can run them both, right? So we could trigger these new Github CI builds alongside the current Semaphore builds (temporarily) with no problems, and then remove the Semaphore builds at a later date.

We've got nothing to lose, lets try it :+1:

Matt-Yorkley on 22 Feb 2021

🚀1

yep, we're totally aligned!

sauloperez on 22 Feb 2021

I've been using Semaphore 2 lately and while it works much better than Semaphore Classic, it can be quite expensive. So if we get all this stuff for free with Github Actions, let's go for it. We can still use Semaphore for deployments while we are working on a better solution.

mkllnk on 23 Feb 2021

👍1

@andrewpbrett As per our delivery-train decision, I'm assigning you here as well so you can help Pau on this one :)

RachL on 1 Mar 2021

We got a first working version of a Github Actions build. Next up: https://github.com/openfoodfoundation/openfoodnetwork/pull/6902#issuecomment-795267879

sauloperez on 10 Mar 2021

@openfoodfoundation/core-devs: If you're good with that proposal, add a 👍 to this comment. If everyone is good with it, we can go ahead and reconfigure Semaphore (and I think close this issue).

Counterproposals welcome, of course :)

andrewpbrett on 19 Mar 2021

👍4 🚀3

One thing we could also do is rebalance the jobs; the admin-feature-folders (~6min) one could get shifted to another one (controllers probably) and the admin-features (~20min) split in two, for example. Or maybe set up knapsack?

andrewpbrett on 19 Mar 2021

Or maybe set up knapsack?

I would personally wait after rebalancing jobs manually as you suggest and see if they get out of balance. If so, we might need to consider Knapsack Pro so we don't add another manual operation to remind ourselves of. But it may not happen that often and we can avoid that extra piece.

sauloperez on 19 Mar 2021

I propose that we strip the config down to just run, say, the javascript and model specs in a single job.

I think that we shouldn't run any specs twice. We could still run Javascript in Semaphore and not in Github and we can remove all other specs from Semahpore. Let's not use resources unnecessarily.

mkllnk on 21 Mar 2021

👍1

I'm going to close this. I just updated the Semaphore config to only run javascript tests and created https://github.com/openfoodfoundation/openfoodnetwork/pull/7178 and https://github.com/openfoodfoundation/openfoodnetwork/pull/7179 which I think should be the last bits of cleanup here. I think we can say that our CI is much improved and can pause for now and revisit if it needs more attention.

andrewpbrett on 23 Mar 2021

👍2

Was this page helpful?

0 / 5 - 0 ratings