Describe the bug
The following build has been failing for 2 weeks on Hydra. https://hydra.nixos.org/build/98372621
@LnL7 and @grahamc were not able to reproduce it.
https://github.com/NixOS/nixpkgs/pull/66381#issuecomment-521007623
https://github.com/NixOS/nixpkgs/pull/66381#issuecomment-521015246
I've merged staging-next into master because the longer we wait the worse the merge conflicts get. But, this is going to block the channels from advancing.
We were able to reproduce it when running it in a loop... but only sometimes. Ugh.
On the Hydra machines it seemed to fail pretty reliably – I tried a few restarts to work around it for now, but without any success, I believe... but such things do happen with impure problems.
After some debugging I was able to reproduce, but this issue seems to have been present since the stdenv uses llvm 7. The occurrence rate just meems much higher, 9/10 failures on the hydra hosts compared to 1/30 on my machines. The following can be used to trigger this locally.
with import ./. {};
darwin.CF.overrideAttrs (drv: {
buildPhase = ''
for i in {1..512}; do
rm Build/CoreFoundation/Base.subproj/CFRuntime.c.o || true
ninja -j$NIX_BUILD_CORES || exit 100
done
'';
})
One of the flags in here might be the culprit, but I don't have time to investigate further. CFRuntime-28db57.txt
What should happen in cases like this and what "supported platforms" mean for the status of staging merges, etc. is probably something that should be formalized. Since this has impact on unstable as well as merge requests (ofborg) I think merging large staging failures shouldn't be done lightly IMHO.
I'm not sure how, but Hydra eventually did succeed with that particular build, and it's already caught up rebuilds of everything but haskellPackages (ghc always takes veeery long to build).
The problem apparently won't just go away by itself – on today's master (after a hash change) I had to restart the job several times to make it succeed.
Could this be an OOM error? We can try setting enableParallelBuilding = false for it to see if it goes away. For reference, this looks very similar to this bug report:
I doubt it, the snippet to reproduce only rebuilds one object repeatedly so the cores shouldn't matter.
What's the status of this? It won't block 19.09, since there's a separate darwin channel for that, right? I'm removing it from the milestone based on that assumption, but feel free to correct this.
Correct. Occasionally it appears when a stdenv rebuild happens – I triggered some Hydra restarts because of that a week or two ago.
Issue occurred again now. A third of darwin got built, so I am merging it (for the same reason as I put up before).
From the other ticket where I wanted to disable building of swift-corefoundation.
BUT, I don't think this PR helps as-is. The problem of the breakage is that so many packages depend on this package ATM.
CoreFoundation is part of the stdenv so this is kind of pointless.
Right. If it means the stdenv and thus none of the Darwin packages build anymore on Hydra, then unless we're going to fix it, then removing it from the Hydra job is exactly what should be done IMO. What's the point of attempting to build it if we know it is going to fail 9/10 times, and in effect blocks channels from advancing?
The only workaround I can think of at the moment is retrying the build phase similar to my reproduction snippet, but that's pretty horrible.
Sounds indeed horrible, but unless another solution can be thought of, it seems to me to be the only way forward to support Darwin. We cannot continuously monitor and press restart while it is blocking staging-next and channels.
I was hoping somebody else would have come up with a better solution, I'll poke a round a little more and make a PR with a workaround.