GHC boostrap on aarch64 is broken. I can't see any relevant change (introduced somewhere within #97146), and I can reproduce the segfault on the shared box. Note that also the newly branched-off 20.09 is affected.
_I'm a bit sorry about letting such a big regression (in terms of package count) to master, but we also wanted to quickly fix the nixos-unstable
channel and unblock the 20.09 process._
Maintainers: @MarcWeber, @kosmikus, @peti
Bisected to patchelf update: f38ed04f0
Maybe we can try reverting b930b2df8aa208e90e999751a4a2cc2980925e5f
According to a quick test, that would only delay the segfault to later during bootstrap (hscolour, happy).
That's quite sad as one of the major reasons for a release was better aarcht64 support.
cc @delroth for ideas
I'll look into it. For now can this be worked around by having ghc built with an older patchelf, or is this also breaking other binaries built with ghc?
Would be helpful if you could attach an ELF that works pre-patchelf, the patchelf invocation, and the resulting segfaulting binary (I imagine building ghc takes forever, and I only have a "weak" ARMv8 builder machine.)
Overriding patchelf just for the single build is not enough, unfortunately.
@delroth building the binary package should be really fast as it only downloads the binary and patchelfs it so it can run with Nix. See https://nix-cache.s3.amazonaws.com/log/wm7l4xaq4jk8w5r9kscb4h6cibjqccjz-ghc-8.6.5-binary.drv
Interestingly, the segfault doesn't reproduce on my machine with 64K pagesize. That's... very unexpected, I can't really think of how this would happen (the opposite is usually the problem since alignment requirement are more restrictive). I'll spin up a VM on EC2 for repro I guess.
[vcunat@aarch64:~]$ getconf PAGE_SIZE
4096
EDIT: though I'm not sure whether that command really shows the number you need.
Heads up this segfault is reproducible in binfmt running on x86_64, cross-compiling to aarch64
I'm trying to compile a proper ghc
compiler with f38ed04 reverted just to see what happens. I'll probably have a result in a couple of hours. My Raspberry Pi 4b is on it ...
I think it might need patchelf override for all ghc-built packages (certainly for some others during ghc boostrap). At least until the patchelf bug gets found and fixed. The problem is that I saw no way of doing such a wider override.
What if we only need to set the correct PAGE_SIZE
in Nix builds?
The thing I don't get here is that the new patchelf behavior has no reason to produce binaries that can't run on 4K pages ARMv8 systems. Generating ELF files with 64K section alignment is what binutils does, what LLVM does, and how ARMv8 binaries are shipped in pretty much every single distro in the world. I'm still trying to find time to debug this.
I'm managing to reproduce the crash by taking the ghc binary pre-patchelf and running patchelf --set-rpath
twice on it. The crash doesn't even happen in ghc, it's ld-linux that can't process the rpath for some reason.
[nix-shell:~]$ ldd /nix/store/413gmmlyhiik8ckxbhm8wvk0fqc3nclh-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc
Segmentation fault (core dumped)
But that's also reproducible on patchelf all the way to 0.9...
[nix-shell:~/patchelf]$ git checkout 0.9
Previous HEAD position was e1e39f3 Update release.nix
HEAD is now at 44b7f95 Update README
[nix-shell:~/patchelf]$ autoreconf -is && ./configure && make
...
make[1]: Leaving directory '/home/ubuntu/patchelf'
[nix-shell:~/patchelf]$ cp /nix/store/413gmmlyhiik8ckxbhm8wvk0fqc3nclh-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc.orig /tmp/ghc
[nix-shell:~/patchelf]$ src/patchelf --set-rpath /foo/bar/lol /tmp/ghc
[nix-shell:~/patchelf]$ src/patchelf --set-rpath /foo/bar/lol:/foo/bar/lol:/foo/bar/lol:/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol /tmp/ghc
[nix-shell:~/patchelf]$ ldd /tmp/ghc
Segmentation fault (core dumped)
The symptoms are exactly the same though, so I suspect there's a latent bug that happened to surface with 0.12 for some reason.
I've now tried running exactly the patchelf command that happens during the ghc build, with patchelf 0.9, 0.10, 0.11 and 0.12.
In all 4 cases, the resulting binary segfaults with the exact same symptoms. I'm puzzled as to how this ever worked frankly. patchelf is broken, but this is not new brokenness.
I finally managed to get a test case which passes on 0.11 and fails on 0.12:
cp ../ghc-8.6.5/ghc/stage2/build/tmp/ghc-stage2 /tmp/ghc
src/patchelf --replace-needed libncursesw.so.5 libncurses.so --replace-needed libtinfo.so libtinfo.so.5 --interpreter /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/ld-linux-aarch64.so.1 /tmp/ghc
strip /tmp/ghc
src/patchelf --set-rpath '/nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib:/nix/store/qsxgr8vk6y8m95r7jf3qxrkz648g8p91-gmp-6.2.0/lib:$ORIGIN/../haskeline-0.7.4.3:$ORIGIN/../stm-2.5.0.0:$ORIGIN/../ghc-8.6.5:$ORIGIN/../terminfo-0.4.1.2:$ORIGIN/../process-1.6.5.0:$ORIGIN/../hpc-0.6.0.3:$ORIGIN/../ghci-8.6.5:$ORIGIN/../transformers-0.5.6.2:$ORIGIN/../template-haskell-2.14.0.0:$ORIGIN/../pretty-1.1.3.6:$ORIGIN/../ghc-heap-8.6.5:$ORIGIN/../ghc-boot-8.6.5:$ORIGIN/../ghc-boot-th-8.6.5:$ORIGIN/../directory-1.3.3.0:$ORIGIN/../unix-2.7.2.2:$ORIGIN/../time-1.8.0.2:$ORIGIN/../filepath-1.4.2.1:$ORIGIN/../binary-0.8.6.0:$ORIGIN/../containers-0.6.0.1:$ORIGIN/../bytestring-0.10.8.2:$ORIGIN/../deepseq-1.4.4.0:$ORIGIN/../array-0.5.3.0:$ORIGIN/../base-4.12.0.0:$ORIGIN/../integer-gmp-1.0.2.0:$ORIGIN/../ghc-prim-0.5.3:$ORIGIN/../rts' /tmp/ghc
ldd /tmp/ghc
Using this I bisected the patchelf change to 0470d6921b5a3fe8e92e356c8e11d120dbbb06c0 which is indeed the 4K->64K alignment change on ARMv8. I still suspect this is a completely unrelated patchelf issue that was only narrowly avoided by luck before, given that the GHC stage2 binary is itself aligned to 64K originally.
Stripping was necessary to reproduce the failure in my test case, so maybe disabling stripping is all that's needed to luck into making this work again. I'm trying a build with dontStrip = true to see if that does indeed do something.
dontStrip = true
in the ghc865Binary derivation seems to be a good enough workaround to produce a valid pandoc
binary down the line, so I would suggest merging that workaround for now. This is what was in place before it was removed in b930b2d, and the comment in the old code suggests someone hit that exact same problem with patchelf+strip. I sent out #98265.
@vcunat you said that this workaround (disabling stripping) didn't work for you earlier in the bug. I'm kind of confused because it clearly does on my box. Could you reconfirm?
I still get segfaults with that patch:
builder for '/nix/store/h1rf84jdgm54cwgawahbfd7irhk3sw43-happy-1.19.12.drv' failed with exit code 139; last 10 log lines:
building
/nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg)
/nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../terminfo-0.4.1.2/libHSterminfo-0.4.1.2-ghc8.6.5.so)
Preprocessing executable 'happy' for happy-1.19.12..
Building executable 'happy' for happy-1.19.12..
/nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../haskeline-0.7.4.3/libHShaskeline-0.7.4.3-ghc8.6.5.so)
/nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../ghc-8.6.5/libHSghc-8.6.5-ghc8.6.5.so)
/nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../terminfo-0.4.1.2/libHSterminfo-0.4.1.2-ghc8.6.5.so)
[ 1 of 19] Compiling AbsSyn ( src/AbsSyn.lhs, dist/build/happy/happy-tmp/AbsSyn.o )
/nix/store/k832pghqg9z887j8py47ddhwzrn4yj1f-stdenv-linux/setup: line 1302: 249 Segmentation fault (core dumped) ./Setup build
The same derivation builds fine here:
~/nixpkgs$ nix-store -r /nix/store/h1rf84jdgm54cwgawahbfd7irhk3sw43-happy-1.19.12.drv
warning: you did not specify '--add-root'; the result might be removed by the garbage collector
/nix/store/88b13147iaaicc586a8421frv07c50d8-happy-1.19.12-data
/nix/store/9l9i919v6929i8drv07cc8nmn3f3hr17-happy-1.19.12
(On an EC2 Graviton2 instance, 4K page size.)
diff --git a/pkgs/development/compilers/ghc/8.6.5-binary.nix b/pkgs/development/compilers/ghc/8.6.5-binary.nix
index 41af279e83f..2bed9f017d3 100644
--- a/pkgs/development/compilers/ghc/8.6.5-binary.nix
+++ b/pkgs/development/compilers/ghc/8.6.5-binary.nix
@@ -55,6 +55,8 @@ stdenv.mkDerivation rec {
nativeBuildInputs = [ perl ];
propagatedBuildInputs = stdenv.lib.optionals useLLVM [ llvmPackages.llvm ];
+ dontStrip = true;
+
# Cannot patchelf beforehand due to relative RPATHs that anticipate
# the final install location/
${libEnvVar} = libPath;
For what it's worth, I reverted f38ed04f0cf62db01eb6a26f4804ecc12c5f4de6 and tried building a proper compiler, but the build fails while compiling hscolour
with the (now successfully installed) binary distribution of ghc
. So reverting back to patchelf-0.11
is no feasible workaround for this issue.
For reference since this hasn't been posted here yet, the backtrace I get from running this in qemu-user (and I remember pretty much the same stack trace when I was experimenting on my Graviton2 EC2 instance):
(gdb) target remote localhost:9000
Remote debugging using localhost:9000
warning: Loadable section ".dynsym" outside of ELF segments
warning: Loadable section ".dynstr" outside of ELF segments
warning: remote target does not support file transfer, attempting to access files from local filesystem.
Reading symbols from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1...
(No debugging symbols found in /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1)
0x00000055008020c0 in _start ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000005500808cf4 in decompose_rpath.isra ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
(gdb) bt
#0 0x0000005500808cf4 in decompose_rpath.isra ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
#1 0x000000550080901c in _dl_init_paths ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
#2 0x0000005500804898 in dl_main ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
#3 0x00000055008165b4 in _dl_sysdep_start ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
#4 0x00000055008029d4 in _dl_start_final ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
#5 0x0000005500802cf0 in _dl_start ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
#6 0x00000055008020c8 in _start ()
from /nix/store/yfa0b4pyywvnspwnlk2nw9id6h6f874x-glibc-2.31/lib/ld-linux-aarch64.so.1
Interestingly gdb knows something is wrong with the binary even before anything gets executed:
warning: Loadable section ".dynsym" outside of ELF segments
warning: Loadable section ".dynstr" outside of ELF segments
this is contributing to ~6100 failures on release-20.09.aarch64
| name | failures |
| ----------- | -------------- |
| aarch64-linux ghc-8.6.5-binary | 6139 |
@jonringer what kind of timeline are we looking at for 20.09? I don't think anyone has found a workaround for this issue which could be implemented in a short amount of time (@peti reported that reverting patchelf doesn't work, some people have reported that disabling stripping doesn't work, etc.). The only way forward I'm currently seeing is getting a patchelf fix in (and I've documented what I think could be done reasonably easily in nixos/patchelf#244), and that's assuming that's sufficient to solve the problem.
This is how we (@NixOS/nixos-release-managers ) are tracking blocking items. We will have a go/no-go meeting to review the blocking items and make a determine on whether we will move forward with the release. Related meeting post is here: https://discourse.nixos.org/t/go-no-go-meeting-nixos-20-09-nightingale/9169/3 . Given your input above, it's likely we will move forward and just make a note that aarch64 ghc is in a bad state right now.
A backport to correct the issue can always be done later.
cc @angerman any idea?
I was able to provision a raspberry pi 4 today off of unstable. This problem still exists, but I agree with @worldofpeace , this shouldn't be considered a blocker
AFAIK everything seems consistent with the patchelf issue having the description of the correct fix, but I think it is possible to workaround the issue for ghc bootstrap by making sure patchelf's change to the section is neutral or shrinking:
diff --git a/pkgs/development/compilers/ghc/8.6.5-binary.nix b/pkgs/development/compilers/ghc/8.6.5-binary.nix
index 41af279e83f..2e0b5cd678b 100644
--- a/pkgs/development/compilers/ghc/8.6.5-binary.nix
+++ b/pkgs/development/compilers/ghc/8.6.5-binary.nix
@@ -120,10 +120,13 @@ stdenv.mkDerivation rec {
# On Linux, use patchelf to modify the executables so that they can
# find editline/gmp.
postFixup = stdenv.lib.optionalString stdenv.isLinux ''
+ for p in $(find "$out/lib" -type f -name "*\.so*"); do
+ (cd $out;ln -s $p `basename $p`)
+ done
for p in $(find "$out" -type f -executable); do
if isELF "$p"; then
echo "Patchelfing $p"
- patchelf --set-rpath "${libPath}:$(patchelf --print-rpath $p)" $p
+ patchelf --set-rpath "$out:${libPath}" $p
fi
done
'' + stdenv.lib.optionalString stdenv.isDarwin ''
On my rock64, this seems to get fix all the executables in ghc-binary (and go on to running out of memory linking cabal, which hopefully just means I need to add swap).
Probably a workaround like this could be cleaner and would best be limited to aarch64, but might be a good interim option to unblock ghc on aarch64 until a new version of patchelf is ready?
AFAIK everything seems consistent with the patchelf issue having the description of the correct fix, but I think it is possible to workaround the issue for ghc bootstrap by making sure patchelf's change to the section is neutral or shrinking:
diff --git a/pkgs/development/compilers/ghc/8.6.5-binary.nix b/pkgs/development/compilers/ghc/8.6.5-binary.nix index 41af279e83f..2e0b5cd678b 100644 --- a/pkgs/development/compilers/ghc/8.6.5-binary.nix +++ b/pkgs/development/compilers/ghc/8.6.5-binary.nix @@ -120,10 +120,13 @@ stdenv.mkDerivation rec { # On Linux, use patchelf to modify the executables so that they can # find editline/gmp. postFixup = stdenv.lib.optionalString stdenv.isLinux '' + for p in $(find "$out/lib" -type f -name "*\.so*"); do + (cd $out;ln -s $p `basename $p`) + done for p in $(find "$out" -type f -executable); do if isELF "$p"; then echo "Patchelfing $p" - patchelf --set-rpath "${libPath}:$(patchelf --print-rpath $p)" $p + patchelf --set-rpath "$out:${libPath}" $p fi done '' + stdenv.lib.optionalString stdenv.isDarwin ''
On my rock64, this seems to get fix all the executables in ghc-binary (and go on to running out of memory linking cabal, which hopefully just means I need to add swap).
Probably a workaround like this could be cleaner and would best be limited to aarch64, but might be a good interim option to unblock ghc on aarch64 until a new version of patchelf is ready?
I love workarounds :+1:
Sorry if this will sound somewhat negative. But let's try to provide some insight:
(a) ghc is very sensitive to any form of stripping or binary changes due to it's tables next to code feature, which essentially puts blobs of meta information right in front of function entry points.
(b) ghc's rts on aarch64 has currently severe correctness issues due to the weak memory model. This will be rectified in HEAD soon'ish, and we may get around back porting those fixes; my advice would be to not use ghc on aarch64 for anything critical, unless you constrain yourself to use at most one core at any time.
(c) we'll soon have a proper native code gen for aarch64, which will reduce closure size, and compilation time significantly at a minor runtime performance penalty.
For reference, here's the patchset for 8.6.5 that brings 8.6.5 to a somewhat acceptable state on aarch64: https://github.com/input-output-hk/ghc/compare/ghc-8.6...release/8.6.5-iohk
Hi @angerman, I was testing out an aarch64 machine with my desktop setup (specifically xmonad) around when this issue came up and noticed that I could not actually build locally up to xmonad despite it being in the cache.. I am able to get from bootstrap -> xmonad itself using the workaround to prevent patchelf from moving the start of the section. Slight additions to xmonad like using xmonad-contrib or setting the configuration seem to hit 2 of the problems you describe (the first by an intentional test and the other by not finding opt in the path).
I think it's generally good to have something bootstrapping to be able to continue testing related things, but I am using a rock64 which is supposed to be experimental at this point anyway. I'm slowly testing a cleanup of that hack so the option to use it will be exist, but I would naturally be thrilled if (c) native code gen is ready soon.
@lostnet could you make a PR?
@domenkozar I tested that a large rpath (~128k) doesn't break x86_64 and changed the workaround to apply only to aarch64 and I tested that ghc8102-binary->ghc901 bootstrap has the same problem, so now I am doing running that bootstrap with the same fix. If that builds successfully I think I'll be ready to open the PR. It's a pretty slow machine, but maybe it will finish tonight, more likely tomorrow..
I can run builds for you on [email protected]
(64-threaded).
Thanks @vcunat , that could definitely speed things up. But I think I should leave out the 8.10.2-Binary in the PR, since it seems to both need the rpath changes and then hit another bug with numactl that stops it from linking in its test compile:
--- /nix/store/246f3ck0xyqgqfvh40xldwkzyyy9nd3j-ghc-8.10.2-binary/lib/ghc-8.10.2/package.conf.d/rts.conf 2020-10-08 18:04:13.283704009 +0200
+++ ../package.conf.d/rts.conf 2020-10-08 20:04:58.236551311 +0200
@@ -8,6 +8,7 @@
exposed: True
library-dirs:
/nix/store/246f3ck0xyqgqfvh40xldwkzyyy9nd3j-ghc-8.10.2-binary/lib/ghc-8.10.2/rts
+ /nix/store/mbjw9pn84prb50md4cjqbclhi6m1i8c5-numactl-2.0.13/lib/
hs-libraries: HSrts Cffi
extra-libraries: m rt dl numa pthread
then it needs to rebuild the cache:
ghc-pkg update rts.conf
since the 8.6.5 is used for the default ghc's bootstrap, I don't think its good to hold it up for that?
My attempt:
HEAD is now at ab87e397603 ghc865-binary: reduce rpath size on aarch64 (prevent SEGV #97407) $ xargs env NIX_REMOTE='ssh-ng://[email protected]?compress=true' nix build --argstr system aarch64-linux -f . ghc error: --- Error --- nix-daemon builder for '/nix/store/af7an5c65km09ssrk9axhdg6mmrag1rk-hscolour-1.24.4.drv' failed with exit code 139; last 10 log lines: building /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg) /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/libHSterminfo-0.4.1.2-ghc8.6.5.so) Preprocessing library for hscolour-1.24.4.. Building library for hscolour-1.24.4.. /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/libHShaskeline-0.7.4.3-ghc8.6.5.so) /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/libHSghc-8.6.5-ghc8.6.5.so) /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary/libHSterminfo-0.4.1.2-ghc8.6.5.so) [ 1 of 16] Compiling Language.Haskell.HsColour.Classify ( Language/Haskell/HsColour/Classify.hs, dist/build/Language/Haskell/HsColour/Classify.o ) /nix/store/k832pghqg9z887j8py47ddhwzrn4yj1f-stdenv-linux/setup: line 1302: 249 Segmentation fault (core dumped) ./Setup build error: --- Error --- nix-daemon 1 dependencies of derivation '/nix/store/3wqqb3qmb6cjb0ngqypc45pgmfr3yqzj-hscolour-1.24.4.drv' failed to build [1 built (1 failed), 0.0 MiB DL] error: error: --- Error --- nix-daemon 1 dependencies of derivation '/nix/store/v6zzafbdnpr2h45ydw80rin3agpwr1jz-ghc-8.8.4.drv' failed to build
Hi @vcunat I get a correct build for the same derivation on the rock64. I'm guessing @domenkozar used a large system to test @delroth 's patch when the core happened in ghc865Binary.happy? While it sounds like delroth tested on a small system like me and couldn't reproduce it.
It would be good to look at the core that was generated and compare binaries in the store with ones from the working machines to see if they were different, but I think that is difficult if the systems experiencing problems tend to be large and therefore shared remote-build systems?
A different option would be to try to fix the rts/numa in the 8.10.2Binary and hope the segfault related to things that were fixed in 8.10.1, etc..
I wouldn't expect the cause to be different from the open issue, but I don't really know. Page size made some (lucky) difference, apparently.
@vcunat The rock64 has a 4k page size which I think is normal in nixos aarch64 kernels, so I think small machines of both page sizes have not hit the later runtime segfault.
Since set-rpath has always had the issue, I think something like a kernel change in linking/memory mapping for NUMA on aarch64 can be a single root cause and explain both finally seeing the rpath issue and other problems not caused by patchelf, i.e. things like @angerman brought up in his note (b).
I think the best thing to do at this point is to compare the actual binaries from the working and non-working with things like eu-readelf -S
to see if these different systems somehow get different ghc binaries, helper binaries already compiled by ghc like Setup, or can not run the same binary. I don't see a way for me to request nar archives from the community server without ssh access so I've requested access (I don't think there's much point in my creating a workaround if I don't test it on large systems similar to hydra anyway.)
OK, the I suspect the machine maintainers might be very busy, so in the meantime let's try cachix (but note this installation hint). For now I uploaded the finished step:
/nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary
as all followup ones fail very fast and incomplete builds are probably not as easy to share (though I can try if you want them).
Hi @vcunat the only difference in ky2g.. seems to be the order things were added to the package.cache, which I hope doesn't matter.
Looking at the HsColour, I think it builds twice as a ghc865Binary and is exactly the same except the second time can call itself to print out the warnings in color, etc.. From the derivation in your output, I think it was building the second which might mean the problem is that the first copy was stripped.
Can you check that you have:
rmwamr0wx3hm2v9wi3si6bmxkjzc9gga-hscolour-1.24.4
but not:
vi594cv9f7gi3jl9pf4z3nqq7fzax8pw-hscolour-1.24.4
And if so add rmw..hscolour to cachix?
Thanks
The rmw path is what fails to build (af7*.drv
above). I'm pushing the partial build directory as /nix/store/w35dra6854p4iyxnkqlnzn0c5xkcna58-nix-build-hscolour-1.24.4.drv-1
and the log as /nix/store/84l9qcdg2zrx42fbmk0wwbd70lcj99pc-af7an5c65km09ssrk9axhdg6mmrag1rk-hscolour-1.24.4.log
.
Hi @vcunat I understand a bit more of the context now from those logs, yet I'm not really sure if the stage in faults at would make damage to a library not used in the earlier compiles a likely cause or not. I pushed a new branch with smaller rpaths and some printing of the elf section layout to hopefully arrive at a determination. I also added ghc8.10.2Binary in a way that is running on the rock64 (currently in stage1 build of ghc901).
If you could try ae3b2eb fafc65e (forgot to bring the new branch up to date) out for both ghc and haskell.compiler.ghc901 on the community server I would appreciate it!
ghc
looks the same to me at a quick glance: hscolour
and happy
segfault in ./Setup build
Hi @vcunat if you can add the 3 new logs(ghcbinary, happy, hscolour) I think there should be enough information for me to investigate what is unique/common at the points it segfaults from tracing good builds.
On ghc901/rock64 it bootstraps fine, going on to xmonad fails since setlocale is whitelisting base <8.15, whitelisting to the latest official might be a common blocking point in libraries for 9.0.X? Yet, if this build doesn't fail on the community server, then I think using the 8.10.2Binary to bootstrap 8.10.2 (and earlier?) might work and/or it might add to the comparisons that can be made to figure out the cause of the segvs.
Thanks!
/nix/store/d57r799wrbahbhimm5va3sxmyidhhy2s-73bb6b5kl0v4aqmaircfaba7635z2g4w-hscolour-1.24.4.log
/nix/store/mj1r2qjajxvdzm63yx04wbgmy28yf3gf-vhsm55j6inrpy6qcyif8yzzb0vayg2cw-ghc-8.6.5-binary.log
/nix/store/vk5yjmzspj09aiwg82gjnpdcsni4aalv-x84ly0ympwb59db12wipzinnmz139iv4-happy-1.19.12.log
BTW, ghc901
has passed those stages and it still keeps building the compiler.
haskell.compiler.ghc901
build succeeded.
I've started a build of ghc8102 bumped to be dependent on ghc8012Binary, since there is a patch it is technically imperfect, but will probably work. Would it make sense to raise the default ghc 8.10.2 now anyway? Then it could make sense to build ghc8.10.2 with itself on aarch64 and 8.6.5Binary on the others.
Hi @vcunat, 8.10.2Binary was able to bootstrap itself fine for me on aarch64 and x86_64, so I've put together an option for working around the problem by:
Switching 8.10.2 to boot from its own binaries on all architectures.
Making it the default ghc on aarch64.
If it looks like a reasonable option to you, please try out 2764755 for ghc and some ghc.* packages. I'm currently building with it to xmonad-with-packages. on the rock64.
Thanks,
We can't switch the default GHC by a major version since it's following the latest LTS stackage release.
In the meantime, the machine finished haskellPackages.xmonad
on 2764755.
I think the explanation for the different behavior on the rock64 (Cortex-A53) and the ~64 core community machine is <8.10.1 Out of Order Execution for the SEGV when using the compiler. (This seems consistent with the segfaults happening with high probability in the 1st build with parallelizable work and if not in the second such round.)
We could do various things to try to get 8.6/8.8 into the cache and I think they would work fine on small/old machines (though these should be becoming rare, i.e. the next in the rock64 line was the rock64Pro which has two A72 cores that could not be used.)
Hmm, now that I said the rock64 is probably not going to hit these, I finally got segfaults on the rock64, by building ghc865.happy or ghc865.hscolour with /nix/store/349hpr41jk4s2g1naw8mpbdsdhkd47z8-ghc-8.6.5/bin/ghc from the cache (not ghc865Binary). Some logs look exactly like the ones on community, but in the core I could get the timing seems to be different:
<no location info>: warning: [-Wmissing-home-modules]
These modules are needed for compilation but not listed in your .cabal file's other-modules:
...
Language.Haskell.HsColour.Options
Language.Haskell.HsColour.Output
Language.Haskell.HsColour.TTY
Segmentation fault
Reading symbols from Setup...
[New LWP 2743]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libthread_db.so.1".
Core was generated by `./Setup build'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000ffffb88512e8 in kill ()
from /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libc.so.6
(gdb) backtrace [12/1423]
#0 0x0000ffffb88512e8 in kill ()
from /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libc.so.6
#1 0x0000000000405130 in exitBySignal (sig=sig@entry=11) at rts/RtsStartup.c:597
#2 0x000000000125dbe4 in shutdownHaskellAndSignal (sig=11, fastExit=<optimized out>)
at rts/RtsStartup.c:562
#3 0x00000000011c7fe0 in ?? ()
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffffb8b63010 (LWP 2743) 0x0000ffffb88512e8 in kill ()
from /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libc.so.6
Hi @vcunat if you can try out 4d79bc6 that would be great. It should boot ghc884 using 8.10.2Binary. On the rock64 it is building the stage2 now so I'm not sure what else might be wrong, but whether 8.8 gets a SEGV when building itself in stage2 is probably the most important factor.
Thanks!
haskellPackages.xmonad
built without issues.
Most helpful comment
haskellPackages.xmonad
built without issues.