Nixpkgs: libiconv-osx is borken on macOS 10.12 Sierra

Created on 9 Sep 2016  ·  47Comments  ·  Source: NixOS/nixpkgs

› ghci
GHCi, version 8.0.1: http://www.haskell.org/ghc/  :? for help
<command line>: can't load .so/.DLL for: /nix/store/xc80lqlb3k0xfh0n5zrpc4i29cllc141-libiconv-osx-10.9.5/lib/libiconv.dylib (dlopen(/nix/store/xc80lqlb3k0xfh0n5zrpc4i29cllc141-libiconv-osx-10.9.5/lib/libiconv.dylib, 5): no suitable image found.  Did find:
        /nix/store/xc80lqlb3k0xfh0n5zrpc4i29cllc141-libiconv-osx-10.9.5/lib/libiconv.dylib: malformed mach-o image: symbol table underruns __LINKEDIT)

(I could not reproduce the error with other programs that depend on libiconv-osx.)

Looking at the sources of dyld-360.22, which ships with 10.11.6, I don’t see this kind of error message, so, I guess, it’s a new check added in Sierra. I’m trying to figure out what’s wrong with the dylib. This binary comes straight from Apple, right? Which makes things even more strange. Does anyone happen to know where to get the sources of new dyld?

/cc @copumpkin @shlevy

bug darwin

Most helpful comment

Please don't close this issue. It's not resolved.

All 47 comments

No, of course, the binary does not come from Apple. It is being built from their source and, actually, then there is a strange manipulation happening.

I think I know what's going on there, and it's a similar error to what I ran into when I updated cctools. I think they've added an extra check of well-formedness of Mach-O files and some of ours violate it.

Yeah, I agree. The problem is that I looked really carefully at the offending dylib and I can’t see anything wrong with it. The symbol table is completely within __LINKEDIT, so I have no idea in which sense it “underruns” it. Is there any way to obtain the new sources of dyld or will we have to wait for the official release?

Although I wouldn't expect anyone in nixpkgs to show that message unless they're being impure. Perhaps ghci is reaching out to the system ld?

Wait a second, ld is a static linker, we are talking about dyld here, right? This message is present in my /usr/lib/dyld but not in the dyld sources at apple.com that’s why I’m looking for the newer ones.

› strings /usr/lib/dyld | grep 'symbol table underruns'                                                                                                                      <1>
malformed mach-o image: symbol table underruns __LINKEDIT
malformed mach-o image: indirect symbol table underruns __LINKEDIT

Oh, I figured it was the obscene manual linking stuff that ghci was doing that was invoking ld on the fly for you. Yeah, we don't have the sources for the 10.12 dyld yet and I don't think they have anything like that in 10.11. I'm guessing they probably added the check to the 10.11 ld64 so that they gave a bit of warning for the 10.12 dyld? Anyway, we can probably figure out what it means based on the ld64 check and stop triggering it.

Hm, I’m afraid I’m not getting what error message in ld64 you are referring to. I see only two messages that have to do with symbol table or __LINKEDIT and they just check that the symbol tables are inside __LINKEDIT, which in our case is true—as I said, I checked this. (Actually, looking at the sources of dyld from 10.11 I don’t see this check there, so I suspect that checks in ld64 and in dyld are completely unrelated to each other.)

Oh sorry for being unclear, the error message is slightly different in ld64, but look for this block:

    if ( symtab != nullptr ) {
        if ( symtab->symoff() < _linkeditStartOffset )
            throwf("malformed mach-o, symbol table not in __LINKEDIT");
        if ( symtab->stroff() < _linkeditStartOffset )
            throwf("malformed mach-o, symbol table strings not in __LINKEDIT");
    }

That's a new block that was added pretty recently to the public ld64 sources in 10.11. See https://github.com/tpoechtrager/cctools-port/commit/05d1a0ab92a9b87cc6896092bbaa4f71d4067b01 for the diff that adds it to the cctools-port project we use in nixpkgs. The wording isn't identical, but we definitely have stuff in nixpkgs that triggers that error message today. I know because I tried upgrading a few weeks ago to the latest cctools-port and ran into that error message. I haven't had time to look into it yet, but I'm guessing the two are related. I know you checked it manually, but perhaps it's only certain binaries/libraries that fail the check? All I know is that when I upgraded to the latest cctools-port, I got a failure when bootstrapping the stdenv and decided to figure it out later.

Extremely weird! Look:

Load command 4
     cmd LC_SYMTAB
 cmdsize 24
  symoff 0
   nsyms 0
  stroff 4104
 strsize 8

As you can see, there are no symbols, therefore, symoff is set to zero. I would say, that in this case symtab should be nullptr and therefore the check you pointed out should not be triggered.

Well, I edited the file and set symoff to 4096 (that is, the beginning of __LINKEDIT) and, guess what, it worked. 😕

I'm not sure I agree with your characterization! There is a symtab, and it has strings in it. So the outer nullptr check will pass, and then we find that symoff is 0, which is certainly less than _linkeditStartOffset. Right?

Perhaps take a quick survey of LC_SYMTAB entries on the Mach-O files that Apple ships with 10.12 and compare to their counterparts in 10.10 or something?

Yes, you are right, sorry, I assumed that the strings table is saved to a separate variable, but that’s not the case.

The problem is that all the other dylibs I found are completely different (at least, they have symbols). Actually, libiconv-nocharset.dylib looks exactly like all the other dylibs, but after the linking magic in libiconv/default.nix it changes significantly.

I have discovered all this weird mach-o world just today so I kind of lost and I have no idea what’s going on and why it changes so much. What the hell is even going on? What is this reexport thing? Do we create a new dynamic library that is effectively “empty” and somehow provides functions from the other two? What is the goal?

All right, so, yeah, you can’t go on with this reexporting weirdness anymore.

I just tested a minimal example: created a trivial library, created a trivial executable, checked that they work together, did the reexporting trick with the library (got the library with 0 symoff) and here goes:

› ./test
dyld: Library not loaded: liblib.dylib
  Referenced from: /Users/kirelagin/tmp/dylib/./test
  Reason: no suitable image found.  Did find:
        liblib.dylib: malformed mach-o image: symbol table underruns __LINKEDIT
zsh: abort      ./test

Aha okay, so they've effectively outlawed umbrella libraries that re-export existing libraries without having any code of their own. A simple fix for now is to just shove some dummy code into libiconv. I don't recall why I did what I'm doing in there but we can probably dig up the history from before we squashed the commits.

Yeah I have just tried to put a stub function into the library. I might be doing something wrong, but my binary now segfaults without saying anything helpful. Could you please check my steps:

› cat stub.c
void nixos_reexport_stub() {}
› clang -c -mmacosx-version-min=10.10 stub.c
› ld -dylib -o liblib.dylib -reexport_library /Users/kirelagin/tmp/dylib/liblib-orig.dylib -dylib_compatibility_version 7.0.0 stub.o
ld: warning: -macosx_version_min not specified, assuming 10.10
› ./test
zsh: segmentation fault  ./test
› mv liblib-orig.dylib liblib.dylib
› ./test  #OK

Am I missing something?

Not sure, feels like a bug in something of theirs. Do you actually end up with any local symbols in your first liblib?

Sorry, my fault. I forgot about install_name_tool. It works now.

oh, great

@copumpkin I’d like to fix this. Could you please try to figure out why this magic was needed? Maybe we could just drop it altogether instead of working it around?

Drop what altogether? I'd probably take one of two approaches:

  1. just add the trivial lump of code to the re-exporting library. That should make it work again on Sierra with minimal other variables changing
  2. Take out the re-export and figure out what goes wrong, fixing whatever the underlying cause was. I'm sorry I don't recall what that was.

Both of these will probably cause a full rebuild and be kind of slow to develop on, but welcome to working on a Nix stdenv 😄

Drop what altogether?

The reexport. I was just hoping that you might be able to recall what the purpose of those reexports was.
I’m not worried about a full rebuild that much, I just have no idea what to look for in case the breakage happens to be non-obvious.

Feel free to post in here if there's something you don't understand and I might be able to help (or others might see something they can help with)

I'm also going to be messing with related parts of the stdenv and pushing some full rebuilds later today, so if you prefer I can make the change.

Yeah, sure, go ahead!

On Sun, Sep 11, 2016, 21:29 Daniel Peebles [email protected] wrote:

I'm also going to be messing with related parts of the stdenv and pushing
some full rebuilds later today, so if you prefer I can make the change.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/NixOS/nixpkgs/issues/18461#issuecomment-246195669,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAbk-8f_qaJRYH1Ieu9AD5Y-64c9rcxPks5qpEh4gaJpZM4J43lX
.

@kirelagin I pushed a fix to staging. If you feel like testing it on Sierra, that would be helpful, but otherwise I'll fire up my VM soon and check.

@copumpkin I'm building from staging now and will let you know how it goes. Thanks for your quick work on this.

@dhess thanks! It seemed to be working in my 10.12 VM and on my 10.11 machine but the more testers the merrier. In principle it really shouldn't depend much on the host OS, but as long as we have impurities (including the kernel) there will always be variations.

nix-env -i nixbreaks for me when building out of staging (e6ea302c47c58c2ad525263305853b03739df34d). Looks like an issue with clang/LLVM 3.8.

Log attached.

nix-staging.log.txt

I'll try to build the rest of my world without updating nix.

Ah yes, I haven't tried nix itself. It fails on 10.11 too, and is probably just an incompatibility with clang 3.8

Now it's db. Short story here:

...
./libtool --mode=compile clang -c -I. -I../src  -O3  ../src/os/os_clock.c
libtool: compile:  clang -c -I. -I../src -O3 ../src/os/os_clock.c  -fno-common -DPIC -o .libs/os_clock.o
../src/os/os_clock.c:35:14: warning: implicit declaration of function 'clock_gettime' is invalid in C99 [-Wimplicit-function-declaration]
                RETRY_CHK((clock_gettime(
                           ^
../src/os/os_clock.c:36:7: error: use of undeclared identifier 'CLOCK_REALTIME'
                    CLOCK_REALTIME, (struct timespec *)tp)), ret);
                    ^
../src/os/os_clock.c:38:27: error: use of undeclared identifier 'CLOCK_REALTIME'
        RETRY_CHK((clock_gettime(CLOCK_REALTIME, (struct timespec *)tp)), ret);
                                 ^
1 warning and 2 errors generated.
make: *** [Makefile:2223: os_clock.lo] Error 1
builder for ‘/nix/store/ls61g817i28fk8gnbm86zva84whf8xfy-db-5.3.28.drv’ failed with exit code 2
cannot build derivation ‘/nix/store/zs947y1grrry68r6yvymk8hyxrc7wv08-apr-util-1.5.4.drv’: 1 dependencies couldn't be built
building path(s) ‘/nix/store/z4dlfi6k7vd8fdjp4ixax5ziyzsxryi1-emacs-24.5’
cannot build derivation ‘/nix/store/c4kf9vwlr4afqix7rz23fn3lkymdfdkc-openldap-2.4.44.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/58v9y41vy6xs5gsihiarmgni9h17v52s-gnupg-2.1.15.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/0va86xa0i90kdkd2g9xqw5qgk4l5634x-subversion-1.9.4.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/vbv7b20n2xkpdqylga5w4hnr8ccfnxzs-shell-env.drv’: 1 dependencies couldn't be built
killing process 64265
cannot build derivation ‘/nix/store/c62v2dpzr1fnks0ifqpgbgc1jcbmywjk-meta-env.drv’: 1 dependencies couldn't be built
error: build of ‘/nix/store/c62v2dpzr1fnks0ifqpgbgc1jcbmywjk-meta-env.drv’ failed

I'll post a log in a bit.

(edit: looks like maybe a header has changed. This is something I can look into if you don't have time at the moment.)

(edit 2: log posted)

nix-staging.log2.txt

Yeah, if you don't mind, that'd be helpful! Doesn't seem specific to 10.12. More likely caused by the clang 3.8 upgrade, which I could just undo if it proves to be a huge PITA. I'd rather not though.

Perhaps also worth a separate ticket for db? Also reminds me that we should test it in the standard release-blockers, because it isn't there right now so I didn't notice that it broke before pushing staging 😄

Sure, I'll take a look and file something against db a bit later.

Thanks! Feel free to come hang out in ##nix-darwin on freenode too.

I found some related stuff https://community.oracle.com/thread/3952592 . The error at the bottom of this also looks like what I'm seeing: https://lists.freebsd.org/pipermail/freebsd-ports/2013-July/085046.html

Yikes. Oracle.

Anyway, I poked at it a bit and couldn't make any obvious headway. Would appreciate you or anyone else taking a look. In the meantime, we could hack around it and make more things build by changing db/generic.nix to default cxxSupport to off on Darwin, but I'd rather fix this the right way if we can avoid hackery.

Oddly enough, nix-build -E "let pkgs = import ./. {}; in pkgs.db.override { stdenv = pkgs.llvmPackages_38.stdenv; }" didn't give any trouble on linux, so it seems somewhat specific to darwin, but none of the headers involved in the error are system headers that my stdenv changes would really affect. So basically I'm not sure what's going on but I haven't really looked that deeply either.

I haven't looked carefully yet but it sounds like a major change to db as it involves C++ namespaces. I will take a closer look, but it might not be feasible to fix.

Would the cxxSupport change you mentioned use gcc to compile db rather than clang?

(I'm sorry for the noise in this thread, it's clearly unrelated to libiconv and we should probably move it to a new issue.)

So I don't think the libiconv issue is completely fixed in staging. I'm running into these ghc bootstrapping failures:

http://hydra.nixos.org/build/40406993

except on macOS 10.12, rather than whatever hydra is running.

Here's a question: as far as I understand it, Darwin stdenv goes out of its way to use Apple's libiconv source, but someow Nix ends up with a different list of global symbols in its libiconv than Apple's provided libiconv. Why is that?

 nm -gU /usr/lib/libiconv.2.dylib 
00000000000f2d80 S ___iconv_2VersionNumber
00000000000f2d50 S ___iconv_2VersionString
00000000000f4750 D __libiconv_version
0000000000001c80 T _aliases2_lookup
0000000000001ae5 T _aliases_lookup
0000000000003174 T _iconv
00000000000034b7 T _iconv_canonicalize
0000000000003196 T _iconv_close
0000000000001cdd T _iconv_open
00000000000031a3 T _iconvctl
00000000000032ac T _iconvlist
0000000000015f0d T _libiconv_relocate
0000000000015e54 T _libiconv_set_relocation_prefix
0000000000015ab3 T _locale_charset
000000000000153e T _utf8_decodestr
00000000000011f8 T _utf8_encodestr
00000000000f4730 D _utf_extrabytes

 nm -gU /nix/store/5h0xi7cz5yrf9904jfv1ss54c1gsyzcg-libiconv-osx-10.11.6/lib/libiconv.2.dylib 
00000000000e3220 D __libiconv_version
0000000000002800 T _iconv
0000000000002b40 T _iconv_canonicalize
0000000000002830 T _iconv_close
0000000000001320 T _iconv_open
0000000000002840 T _iconvctl
0000000000002950 T _iconvlist
0000000000014690 T _libiconv_set_relocation_prefix

The Apple-provided libiconv shown above is from my 10.12 system and I understand the 10.12 libiconv source isn't available yet, but the same symbol names are exported from the Apple-provided libiconv on 10.11.6, so nothing has changed in this regard.

In my local fork, I ended up hacking @kirelagin's stub fix back into the libiconv recipe and that is allowing me to make progress in my build (though now I'm running into other issues). Until someone more knowledgeable can look into why libiconv in staging is missing the locale and UTF-8 symbols, I suggest we put that hack back in. Anything that tries to use them (e.g., Haskell doctests) won't build in staging as-is.

I'm traveling until Sunday, then I should be able to deal with all this stuff and answer actual questions. Sorry!

Please don't close this issue. It's not resolved.

This has been fixed by 77d1fb94f18bc5a8cbc72f0ab8da1a28df1752f5.

I can confirm that building and running haskell programs from 77d1fb9 works.

Closing, thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

spacekitteh picture spacekitteh  ·  3Comments

edolstra picture edolstra  ·  3Comments

grahamc picture grahamc  ·  3Comments

chris-martin picture chris-martin  ·  3Comments

yawnt picture yawnt  ·  3Comments