Describe the bug
Lack of aarch64 builders is blocking nixos-unstable
channel.
The aarch64 capacity has changed and cannot keep up with Linux/Darwin builders. The nixos-unstable
channel is waiting 4 days already for certain aarch64 builds. This is an issue because it further reduces the speed at which we can distribute security updates.
Recently, Packet started provisioning and deprovisioning our c2.large.arm's in a tight loop:
This left us with an average of one machine at a time, but that one machine wasn't up long enough for it to actually do any useful work:
(data: https://status.nixos.org/grafana/d/27bz8SAWk/packet-provisioning-churn?orgId=1&from=1572123600000&to=1572192154115)
Once this problem resolved we were back up to a more normal capacity of ~4 spot instances. This brought our capacity up a good bit:
At the same time, I realized the permanent ARM hardware had fallen off the map too, due to a failed boot a few days ago. This compounded with new rebuilds of the ARM hardware images failing due to a bug in the tests for the JSON library for Nix. Because of these failing builds, no reboots were attempted. A reboot would have solved these problems.
I switched the builders from the 19.09-small channel to 19.09, as it has more tests for aarch64 and a working Nix build.
This update is currently working its way through the deploy process, having successfully built the images for the ARM machines: https://buildkite.com/grahamc/packet-nix-builder/builds/145
Already, capacity is up even more than after the spot market stabilized:
and once those remaining builders are rebooted, the capacity will rise further. (data: https://status.nixos.org/grafana/d/MJw9PcAiz/hydra-jobs?orgId=1&from=now-2d&to=now&fullscreen&panelId=15&refresh=30s)
Additionally, I canceled all queued non-current jobs in Hydra to reduce the queue and focus on getting the channel out.
I'm expecting and hoping this issue is solved, and that the remaining build tasks will be processed promptly.
@grahamc I appreciate you ;)
There seems to be enough capacity again so closing.
Most helpful comment
Recently, Packet started provisioning and deprovisioning our c2.large.arm's in a tight loop:
This left us with an average of one machine at a time, but that one machine wasn't up long enough for it to actually do any useful work:
(data: https://status.nixos.org/grafana/d/27bz8SAWk/packet-provisioning-churn?orgId=1&from=1572123600000&to=1572192154115)
Once this problem resolved we were back up to a more normal capacity of ~4 spot instances. This brought our capacity up a good bit:
At the same time, I realized the permanent ARM hardware had fallen off the map too, due to a failed boot a few days ago. This compounded with new rebuilds of the ARM hardware images failing due to a bug in the tests for the JSON library for Nix. Because of these failing builds, no reboots were attempted. A reboot would have solved these problems.
I switched the builders from the 19.09-small channel to 19.09, as it has more tests for aarch64 and a working Nix build.
This update is currently working its way through the deploy process, having successfully built the images for the ARM machines: https://buildkite.com/grahamc/packet-nix-builder/builds/145
Already, capacity is up even more than after the spot market stabilized:
and once those remaining builders are rebooted, the capacity will rise further. (data: https://status.nixos.org/grafana/d/MJw9PcAiz/hydra-jobs?orgId=1&from=now-2d&to=now&fullscreen&panelId=15&refresh=30s)
Additionally, I canceled all queued non-current jobs in Hydra to reduce the queue and focus on getting the channel out.
I'm expecting and hoping this issue is solved, and that the remaining build tasks will be processed promptly.