Not only our CI service is very old, rather limited in terms of features, and soon discontinued (it was already outlined in https://github.com/openfoodfoundation/openfoodnetwork/issues/4435) but we've also been experiencing serious resource contention issues lately. According to Semaphore itself
developers have spent over 103 hours waiting for CI resources during the week of January 4, 2021.
I think the combination of Dependabots PRs and the need to retry jobs N times due to the test suite's flakiness are the root cause of this.
We initially thought it's worth spending money on this (we're still running on Rob's free account) but given that we've got a free GH account trying out Github Actions might solve it for free (I've personally adopted it successfully for other projects).
Alternatively, we could investigate migrating to Semaphore CI v2 (we have an account already) and pay if needed.
This was briefly discussed in the last delivery train meeting, and there was a shared agreement that this has become a problem.
Having an up-to-date CI service should reduce the time it takes to get PRs to test ready, and enable new potential automation.
I'm noticing an increasing amount of Net::ReadTimeout errors in the build, such as https://semaphoreci.com/openfoodfoundation/openfoodnetwork-2/branches/pull-request-6774/builds/2. This slows builds even more and the retries increase the queue even more.
We need to undo https://github.com/openfoodfoundation/openfoodnetwork/issues/6840 as soon as we get to a stable build again.
Insights so far after the initial configuration done in #6902
There are three things that are yet to be solved though.
This is fundamental for the delivery process as it allows anyone in the core team, not only devs, to stage branches for testing or product validation, so we need to keep it.
We can still use Semaphore CI to do only this while we figure out how deployments can be triggered from Github. We could implement the same solution we went with for Semaphore as a Github action leveraging other's actions. I also explored this same idea with a similar workflow syntax successfully before.
We use a free version of Knapsack to parallelize the build execution but that requires manual intervention to balance the jobs every now and then. That's exactly what the pro version offers.
Given we stopped doing this manual step (no surprise) we are not getting much value from Knapsack lately. I suggest we start without it which will enable us to remove some code, review our assumptions (it was added ages ago), reevaluate and see if automatic concurrent execution is something we need.
The flexibility the pipelines add allows us to run feature specs only after unit tests have successfully finished, for instance. I would give it a bit more thought and try out a more elaborate workflow.
If then we see we still need something like Knapsack I would pay for the pro version. There are plenty of tutorials on how to integrate it with Github Actions.
The first attempts pointed out that some specs are failing due to the test being clearly broken. That seems odd because it's not happening in Semaphore. There are no shortcuts for these though. They need to be fixed.
We need to be aware of the fact that with better CI more opportunities for improvement will pop up. We'll need to evaluate case by case whether these need to be addressed now.
For instance, I can already say that the sheer number of deprecation warnings now becomes even more painful. It makes it very hard to grab the scroll bar due to its tiny size besides how much they slow down the build LOL :see_no_evil:
I also see we'll also suffer from some of the flakiest tests and I'm afraid there won't any shortcut there. I'm thinking about the ones dealing with concurrent which I already saw failing.
With all this, I propose we finish a first iteration of the workflow started in #6902 fixing any broken tests, and give it a week or two to see how both Semaphore and Github Actions compare. If things go well, we remove Semaphore. If not, we reassess it.
We can run them both, right? So we could trigger these new Github CI builds alongside the current Semaphore builds (temporarily) with no problems, and then remove the Semaphore builds at a later date.
We've got nothing to lose, lets try it :+1:
yep, we're totally aligned!
I've been using Semaphore 2 lately and while it works much better than Semaphore Classic, it can be quite expensive. So if we get all this stuff for free with Github Actions, let's go for it. We can still use Semaphore for deployments while we are working on a better solution.
@andrewpbrett As per our delivery-train decision, I'm assigning you here as well so you can help Pau on this one :)
We got a first working version of a Github Actions build. Next up: https://github.com/openfoodfoundation/openfoodnetwork/pull/6902#issuecomment-795267879
Re: https://github.com/openfoodfoundation/openfoodnetwork/pull/6902#issuecomment-795267879 - I just double checked the math (and added in the engines specs, which report their counts separately) and I got 4486 examples, 25 pending on both Github and Semaphore. I think we can confidently say we're running all specs on both.
Since we want to keep Semaphore going for the time being in order to do deploys, I propose that we strip the config down to just run, say, the javascript and model specs in a single job. Then we'll have up to four concurrent builds going and should never be waiting on Semaphore since Github will take longer to complete.
@openfoodfoundation/core-devs: If you're good with that proposal, add a 馃憤 to this comment. If everyone is good with it, we can go ahead and reconfigure Semaphore (and I think close this issue).
Counterproposals welcome, of course :)
One thing we could also do is rebalance the jobs; the admin-feature-folders (~6min) one could get shifted to another one (controllers probably) and the admin-features (~20min) split in two, for example. Or maybe set up knapsack?
Or maybe set up knapsack?
I would personally wait after rebalancing jobs manually as you suggest and see if they get out of balance. If so, we might need to consider Knapsack Pro so we don't add another manual operation to remind ourselves of. But it may not happen that often and we can avoid that extra piece.
I propose that we strip the config down to just run, say, the javascript and model specs in a single job.
I think that we shouldn't run any specs twice. We could still run Javascript in Semaphore and not in Github and we can remove all other specs from Semahpore. Let's not use resources unnecessarily.
I'm going to close this. I just updated the Semaphore config to only run javascript tests and created https://github.com/openfoodfoundation/openfoodnetwork/pull/7178 and https://github.com/openfoodfoundation/openfoodnetwork/pull/7179 which I think should be the last bits of cleanup here. I think we can say that our CI is much improved and can pause for now and revisit if it needs more attention.
Most helpful comment
Re: https://github.com/openfoodfoundation/openfoodnetwork/pull/6902#issuecomment-795267879 - I just double checked the math (and added in the engines specs, which report their counts separately) and I got 4486 examples, 25 pending on both Github and Semaphore. I think we can confidently say we're running all specs on both.
Since we want to keep Semaphore going for the time being in order to do deploys, I propose that we strip the config down to just run, say, the javascript and model specs in a single job. Then we'll have up to four concurrent builds going and should never be waiting on Semaphore since Github will take longer to complete.
@openfoodfoundation/core-devs: If you're good with that proposal, add a 馃憤 to this comment. If everyone is good with it, we can go ahead and reconfigure Semaphore (and I think close this issue).
Counterproposals welcome, of course :)