Nixpkgs: Slow chromium build is blocking channels

Created on 25 Apr 2018  Â·  34Comments  Â·  Source: NixOS/nixpkgs

Issue description

Recently the Chromium build has become so slow that it appears to be blocking release occasionally.

https://hydra.nixos.org/build/73257181

I think we need to do something about this soon before we block eg. security fixes.

Most helpful comment

I just want to clear up a few misconceptions in this thread.

First, I want to make sure that everyone understands that disabling chromium in the tested set does likely mean that every user would have to rebuild chromium, because if the build fails (which is what we are seeing) there is no longer anything holding back the channel bump.

Second, there is no one that disputes that this would be absolutely horrendous UX and bound to cause users to be quite angry -- with reason. The possibility of disabling it was mentioned because at the time, there was no likely solution being worked on by anyone, and if we don't find a fix, we can't just hold back 18.03 indefinitely. We just don't have the necessary kind of release granularity. If this doesn't work, and no one else has a solution in the works, we'll still have to discuss it. I'll reopen this if that is the case. Let's hope not. :)

Also thanks everyone for their thoughts on the matter!

All 34 comments

I think there could be some tricks we could use to speed up chromium. There are lots of parts that at least should be provided for it like libc++, icu, and maybe even v8.

Great, let's try that. Two different (less elegant) ideas if it doesn't work out:

  1. ($$) use dedicated high-performance Hydra builders for these few massive jobs, maybe disposable cloud instances created per job.

  2. Split the build into stages internally, each with its own derivation. Stage 1 starts from source, builds some sub-target and outputs a tarball of the build tree for Stage 2,... Not faster but prevents timeouts.

Unless anyone is actively working on this I suggest

  1. disable chromium in the release critical set

and maybe

  1. mark it as broken to avoid users suddenly trying to build it locally.

Surely 3. and 4. would get the channel moving but at the cost of frustrating all chromium users. Just imagine it's your main browser, and suddenly gone from the stable release. Gonna have a shitstorm... Not sure but we might even be better off with a stalled channel for now.

I agree that it's a very annoying situation. I don't think we should do it if we do believe we can have this fixed within a reasonable timeframe. If not, I still think it is worth considering, despite it being a huge nuisance to chromium users.

(Read: Let's not stall indefinitely without a plan to fix it.)

Perhaps an interim fix could be

  1. put it in a separate hydra jobset with increased timeout (not sure how that works though)

I've put out some feelers. The only way I know of is to manipulate the database directly, which is ick.

Surely 3. and 4. would get the channel moving but at the cost of frustrating all chromium users. Just imagine it's your main browser, and suddenly gone from the stable release. Gonna have a shitstorm... Not sure but we might even be better off with a stalled channel for now.

OTOH NixOS 18.03 currently can't be installed on EFI hosts, but the fix is merged (https://github.com/NixOS/nixpkgs/pull/39342), just not released to the channel because of Chromium blocking the release. That's also frustrating.

Okay its been temporarily disabled in 1d0625499854b583c57267a744111ba8a1d0cfaf.

@matthewbauer, your commit is in master, we were talking 18.03 (mostly - that's where it's more urgent.)

Strong :-1: on this we don't want to let users build chromium if it times out on hydra.

cc @fpletz @vcunat @domenkozar

I'd prefer reverting this.

Yeah feel free to revert it! I just didn’t think it was worth keeping a failing test that works only 50% of the time. Chromium is probably a special case though..

The often failing non-deterministic test is a separate issue. I am investigating it and believe it can be improved. But first we need chromium to build reliably...

1: we have requiredSystemFeatures = [ "big-parallel" ]; there already. Maybe the number of jobs on those machines should be lower (keeping the same --cores).

BTW, the -j parameter was intentionally halved for chromium, because it was frequently exhausting memory of the builders. Some other such tuning could be done, but I feel half-blind in this.

meta.timeout was suggested on IRC, is being tested, and I expect to do that for chromium today (say 24h limit).

We should still remove nixos.tests.chromium from the tested job for now (I mean that the test should be run on hydra but not block the channel) - but let's make sure this doesn't exclude chromium from the hydra build.

That certainly won't exclude it. It's a separate package job as well. EDIT: the problem is that it's a commonly used package, so when people update the channel, they may start compiling chromium...

Great, so @matthewbauer 's change should be fine. If the timeout trick works maybe we should add chromium to the tested job instead. Would be weird to have the stable channel with or without it at random...

I'm -1 removing it, it's really most used app on my laptop and I'm sure many others.

I do believe there's room for improvement: https://github.com/NixOS/nixpkgs/issues/28822

I just want to clear up a few misconceptions in this thread.

First, I want to make sure that everyone understands that disabling chromium in the tested set does likely mean that every user would have to rebuild chromium, because if the build fails (which is what we are seeing) there is no longer anything holding back the channel bump.

Second, there is no one that disputes that this would be absolutely horrendous UX and bound to cause users to be quite angry -- with reason. The possibility of disabling it was mentioned because at the time, there was no likely solution being worked on by anyone, and if we don't find a fix, we can't just hold back 18.03 indefinitely. We just don't have the necessary kind of release granularity. If this doesn't work, and no one else has a solution in the works, we'll still have to discuss it. I'll reopen this if that is the case. Let's hope not. :)

Also thanks everyone for their thoughts on the matter!

This seems to have broken the meta typo check. I hope I have succeded in adding the timeout field in pkgs/stdenv/generic/check-meta.nix now…

Oh dear. My own test walked around that check quite handily. Shall I revert?

Please let the 5-minute ofborg eval finish first… I have pushed what I _hope_ is a fix to both 18.03 and master

Oops, sorry, I fail at git. And now there seems to be a queue.

It looks like my builder now gets jobs!

Chromium built on 18.03 now, though it was faster than 10h, so no confirmation on the workaround yet.

It's still not a proper fix (if it even works). You can't have builds taking that much time (while blocking the release pipeline) if you need to release time critical patches. Something should still be done to speed up the build.

(sorry for being that annoying and critical guy :-) )

@adamtulinius I don't know if I agree. The acceptable build time seems like something we could bikeshed forever.

Besides, it's also a product of the build farm capacity. Maybe you can convince your boss to donate some hardware. :-)

Besides, it's also a product of the build farm capacity. Maybe you can convince your boss to donate some hardware. :-)

That doesn't help if a single build is holding everything up for 10 hours. :-)

@adamtulinius The build takes (took) 10 hours because the farm is busy. If it wasn't that build, something else would be taking probably the same time. You do have the freedom to follow a branch that requires less of the binary cache to be ready, say nixos-18.03-small, or even not care about the binary cache at all, say release-18.03. If we security patch something else far down the dependency chain we'd have to rebuild _everything_ anyway and 10 hours would seem like a breeze.

A general discussion on what limits are acceptable and for which kind of release set should probably go elsewhere.

A general discussion on what limits are acceptable and for which kind of release set should probably go elsewhere.

Yes, probably, and maybe a general discussion of the build/distribution of updates in general would be more beneficial. Thanks for your work getting 18.03 unclogged.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vaibhavsagar picture vaibhavsagar  Â·  3Comments

sid-kap picture sid-kap  Â·  3Comments

domenkozar picture domenkozar  Â·  3Comments

ghost picture ghost  Â·  3Comments

copumpkin picture copumpkin  Â·  3Comments