Nomad: v0.9.0 appears fundamentally broken on musl/Alpine

Created on 10 Apr 2019  路  8Comments  路  Source: hashicorp/nomad

Nomad version

Can't get this far, the resultant binary is not usable. Release tag v0.9.0 though.

Operating system and Environment details

Alpine Linux 64-bit as available in the golang:1.12-alpine docker image.

Issue

Ignoring the FTBFS case if using the target make dev-ui (which fails due to the expected nodejs being very old compared to current LTS and current stable), the resulting binary is not usable due to problems in the nvidia support which doesn't appear possible to disable.

Reproduction steps

$ docker run --rm -it golang:1.12-alpine /bin/sh
# apk add git bash linux-headers bash g++ make
# mkdir -p src/github.com/hashicorp/nomad
# cd src/github.com/hashicorp/nomad
# git clone -b v0.9.0 https://github.com/hashicorp/nomad.git .
# make dev
# ./bin/nomad version

Attempting to run the built binary provides the following errors:

/go/src/github.com/hashicorp/nomad # ./bin/nomad 
Error relocating ./bin/nomad: nvmlDeviceGetMaxPcieLinkGeneration: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPowerManagementLimit: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetUUID: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPowerUsage: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetCurrentClocksThrottleReasons: symbol not found
Error relocating ./bin/nomad: nvmlSystemGetProcessName: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMemoryInfo: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMaxClockInfo: symbol not found
Error relocating ./bin/nomad: nvmlInit_v2: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetDisplayMode: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetAccountingBufferSize: symbol not found
Error relocating ./bin/nomad: nvmlEventSetFree: symbol not found
Error relocating ./bin/nomad: nvmlErrorString: symbol not found
Error relocating ./bin/nomad: nvmlShutdown: symbol not found
Error relocating ./bin/nomad: nvmlEventSetCreate: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPerformanceState: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetGraphicsRunningProcesses: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPcieThroughput: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMemoryErrorCounter: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetName: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPciInfo_v3: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetHandleByIndex_v2: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetComputeRunningProcesses: symbol not found
Error relocating ./bin/nomad: nvmlSystemGetDriverVersion: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMinorNumber: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetClockInfo: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetDecoderUtilization: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMaxPcieLinkWidth: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetUtilizationRates: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPersistenceMode: symbol not found
Error relocating ./bin/nomad: nvmlDeviceRegisterEvents: symbol not found
Error relocating ./bin/nomad: nvmlEventSetWait: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetTemperature: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetEncoderUtilization: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetDisplayActive: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetBAR1MemoryInfo: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetCount_v2: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetAccountingMode: symbol not found

Given that these all seem related to GPU scheduling which my organization doesn't need, I assumed I could just pass a build tag and shut this off. To my great disappointment this "feature" seems to be impossible to disable!

Is my best bet to stay on 0.8.7 until these bugs can be resolved?

Most helpful comment

We had a similar issue on our RHEL7 distribution:
./nomad: symbol lookup error: ./nomad: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses

I worked around this by building a temporary nvml.so that stubs out the functions, as follows:

objdump -T  nomad  | grep -o "nvml.*" | sort -u | sed 's/\(nvml.*\)/extern int \1(){ return 1; }/g' > nvml.c
gcc -c nvml.c -fpic
gcc -shared -o nvml.so nvml.o
rm nvml.o rnvml.c

Then run nomad with LD_PRELOAD:
LD_PRELOAD=./nvml.so ./nomad

All 8 comments

Thanks!

Thanks @the-maldridge for raising this. I have added a nonvidia tag for disabling nvidia integration and confirmed that binaries work if compiled in an musl/alpine image.

It's a bit interesting though - I believe the nvidia library is a dynamically shared library, so its presence wasn't required for glibc based OSes. Do you know what might make musl compilation behave differently?

I believe it is because its a shared library that this error occurs. Compilation occurs just fine, but the resulting binary contains dynamically loaded non-relocatable symbols. I would need to conduct more research to figure out why this is the case, but at the moment I'm still trying to determine if the frontend can be built with a modern yarn/nodejs.

If I had to take a shot in the dark at why this is happening, I'd guess that nvml isn't correctly checking if there are build-time constraints that prevent it from working correctly and so it builds "blind" and then happens to work correctly on glibc systems. The proper fix is probably for them to return a nil implementation if the library can't be loaded.

I see - thanks for the background.

at the moment I'm still trying to determine if the frontend can be built with a modern yarn/nodejs.

This might be handy for you https://github.com/hashicorp/nomad/pull/5427 - you can skip building the frontend and use latest built release frontend assets by setting the ui go build tag.

@the-maldridge @notnoop we had this problem even on glibc-based build:
https://github.com/NixOS/nixpkgs/pull/63854

$ ldd ./result-bin/bin/nomad 
        linux-vdso.so.1 (0x00007ffd607d1000)
        libpthread.so.0 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/libpthread.so.0 (0x00007f93de658000)
        libdl.so.2 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/libdl.so.2 (0x00007f93de653000)
        libc.so.6 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/libc.so.6 (0x00007f93de49d000)
        /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/ld-linux-x86-64.so.2 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib64/ld-linux-x86-64.so.2 (0x00007f93de67b000)

$ ./result-bin/bin/nomad -v
./result-bin/bin/nomad: symbol lookup error: ./result-bin/bin/nomad: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses

~Hey @notnoop , I don't think this issue should be closed. It may be the case that you can compile Nomad itself on Alpine for it to work on Alpine, but it seems like a bug if the official Nomad release simply can't execute on Alpine, no?~

Apologies, I didn't see that https://github.com/hashicorp/nomad/issues/5643 was already open, this is the bug I was looking for :).

We had a similar issue on our RHEL7 distribution:
./nomad: symbol lookup error: ./nomad: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses

I worked around this by building a temporary nvml.so that stubs out the functions, as follows:

objdump -T  nomad  | grep -o "nvml.*" | sort -u | sed 's/\(nvml.*\)/extern int \1(){ return 1; }/g' > nvml.c
gcc -c nvml.c -fpic
gcc -shared -o nvml.so nvml.o
rm nvml.o rnvml.c

Then run nomad with LD_PRELOAD:
LD_PRELOAD=./nvml.so ./nomad

In fact, I discovered that the issue was caused with having environment variable LD_BIND_NOW set. Doing
export -n LD_BIND_NOW
fixed it.

Was this page helpful?
0 / 5 - 0 ratings