nomad agent dev fails to start with 'undefined symbol: nvmlDeviceGetPciInfo_v3'

Created on 28 Jun 2020  ยท  19Comments  ยท  Source: hashicorp/nomad

Nomad version

Nomad v0.12.0-beta2 (5b80d4e638f1a27eee3ca245f8babb115e4c098d)
Same with Nomad 0.11.3 GA

Operating system and Environment details

Elementary Linux 5.x (based off Ubuntu 18.04)

uname -a
Linux mynodename 5.3.0-61-generic #55~18.04.1-Ubuntu SMP Mon Jun 22 16:40:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a
No LSB modules are available.
Distributor ID: elementary
Description:    elementary OS 5.1.5 Hera
Release:        5.1.5
Codename:       hera

Issue

Nomad agent fails to start with the following error:

nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
nomad: symbol lookup error: nomad: undefined symbol: nvmlDeviceGetPciInfo_v3

Reproduction steps

run "nomad agent -dev"

Job file (if appropriate)

n/a

Nomad Client logs (if appropriate)

Additional information that might be useful:

lspci | grep -i vga
02:00.0 VGA compatible controller: NVIDIA Corporation G98 [Quadro NVS 295] (rev a1)

Steps to install the drivers was:

ubuntu-drivers autoinstall

Output of nvidia-smi

nvidia-smi

+------------------------------------------------------+
| NVIDIA-SMI 340.108    Driver Version: 340.108        |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro NVS 295      Off  | 0000:02:00.0     N/A |                  N/A |
| N/A   63C   P12    N/A /  N/A |     56MiB /   255MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+
themclient themdevices typbug

Most helpful comment

@shantanugadgil We just merged an option for disabling the nvidia driver and it should be out in 0.12.1. Thanks for raising the issue.

All 19 comments

Sorry for the slow response. Looks like nomad requires a more recent driver than the ones that are bundled with Linux kernel. Can you try upgrading your driver and let us know if that works?

It seems that Linux is bundling legacy drivers by default. Nomad currently requires a more recent versions like the ones bundled with CUDA 9 or 10.

I've tested nomad against driver 484.11 (bundled with CUDA 9) and that worked:

ubuntu@ip-172-31-26-165:~$ nvidia-smi
Tue Jun 30 17:32:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 00000000:00:03.0 Off |                  N/A |
| N/A   33C    P8    18W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-26-165:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
ubuntu@ip-172-31-26-165:~$ ./nomad --version
Nomad v0.12.0-beta2 (5b80d4e638f1a27eee3ca245f8babb115e4c098d)
ubuntu@ip-172-31-26-165:~$ ./nomad agent -dev 2>&1 | head -n5
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648

The way I installed the drivers was ubuntu-drivers autoinstall.
Hopefully installing the latest drivers won't bork my system!!! ๐Ÿคž

FWIW, these drivers are not _that_ old:

when I search the official website, I get this:

https://www.nvidia.com/object/product_quadro_nvs_295_us.html


Version: | 340.108
-- | --
Release Date: | 2019.12.23
Operating System: | Linux 64-bit
Language: | English (US)
File Size: | 66.92 MB

I will try to follow the wizard here: https://developer.nvidia.com/cuda-downloads to get the latest

Update: looks this won't happen anytime soon ... way too much download ~ 2 GiB ... ๐Ÿ˜ข

EDIT: I cancelled this operation and tried installing from PPA

It's strange - when I looked for 340.108, I noticed it was marked legacy even though it was released in 2019 - e.g. https://forums.developer.nvidia.com/t/linux-solaris-and-freebsd-driver-340-108-legacy-for-geforce-8-and-9-series/109520#5414137 .

Stepping back a bit - let me clarify the use case. Are you actually planning to use this GPU with nomad for machine learning/CUDA-like workloads? Or is it that you are trying to start nomad on a server that just happened to have GPU though it's not critical to the nomad case?

Also, mind if you try running the nomad agent found in https://79969-36653430-gh.circle-artifacts.com/0/builds/nomad_linux_amd64.zip

This is not a critical system. I just happened to have an old display card and decided to set it up on an old desktop (which was already running Elementary Linux)

I wouldn't be really using this for any real word CUDA workloads, though it would be good to have, as I could run trivial CUDA things on my desktop.

That said, if this doesn't fit on the roadmap due to it's "non real" use case, I am fine with that. (1)

Though, in that case, what would be the proper way to disable nvidia detection altogether during Nomad startup.
pt. 1 is fine, Nomad not starting at all is super sad (I should check up on disabling drivers using the client blocklist)

Update: I tried adding the nvidia ppa and manually installing the "latest" available driver. this broke the nvidia driver altogether, I am down to VESA mode, but the agent starts now ๐Ÿ™„ .

add-apt-repository ppa:graphics-drivers/ppa
apt update
apt install nvidia-384

I'm very sorry that I have your system borked :(. Also, I fully agree that nomad agent should function with legacy nvidia drivers - the agent should start normally but without nvidia support. We'll follow up.

One odd thing is in my testing, I noticed that Ubuntu 18.04 offers nvidia-driver-440 (and other versions as well):

ubuntu@ip-172-31-19-213:~$ apt-cache madison nvidia-driver-440
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages

Though, in that case, what would be the proper way to disable nvidia detection altogether during Nomad startup.
pt. 1 is fine, Nomad not starting at all is super sad (I should check up on disabling drivers using the client blocklist)

This brings up a new question in my mind
Q: I am currently unable to disable the device detection altogether for device nvidia-gpu.
I know about disabling drivers via blacklist, but there doesn't seem to be anything equivalent for device plugins, right?

I'm very sorry that I have your system borked :(. Also, I fully agree that nomad agent should function with legacy nvidia drivers - the agent should start normally but without nvidia support. We'll follow up.

One odd thing is in my testing, I noticed that Ubuntu 18.04 offers nvidia-driver-440 (and other versions as well):

ubuntu@ip-172-31-19-213:~$ apt-cache madison nvidia-driver-440
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages

_That's OK, always ready for trial-and-error to get Nomad working!_ ๐Ÿ˜„ ๐Ÿ›ฉ๏ธ

BTW, did you try with the NVIDIA NVS 295 display card? That is my display card. (maybe that matters, I dunno)

For me this is not "Ubuntu" Ubuntu, it is Elementary Linux (a desktop oriented) distro, hence I would prefer having the display driver working, higher resolution etc.

OK, after refining my apt search ...

apt search nvidia | grep "^nvidia\-driver\-"
nvidia-driver-390/bionic-updates,bionic-security 390.138-0ubuntu0.18.04.1 amd64
nvidia-driver-410/unknown 410.129-0ubuntu1 amd64
nvidia-driver-415/bionic 415.27-0ubuntu0~gpu18.04.2 amd64
nvidia-driver-418/bionic 430.64-0ubuntu0~gpu18.04.1 amd64
nvidia-driver-430/bionic-updates,bionic-security,bionic 440.100-0ubuntu0.18.04.1 amd64
nvidia-driver-435/bionic-updates,bionic 435.21-0ubuntu0.18.04.2 amd64
nvidia-driver-440/bionic-updates,bionic-security,bionic 440.100-0ubuntu0.18.04.1 amd64
nvidia-driver-450/unknown 450.36.06-0ubuntu1 amd64

I will try with 440 now.

Also, mind if you try running the nomad agent found in https://79969-36653430-gh.circle-artifacts.com/0/builds/nomad_linux_amd64.zip

This doesn't work with my correct working display driver v 340.

_I will try with v 440 and try again_

$ nvidia-smi
Wed Jul  1 01:08:09 2020
+------------------------------------------------------+
| NVIDIA-SMI 340.108    Driver Version: 340.108        |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro NVS 295      Off  | 0000:02:00.0     N/A |                  N/A |
| N/A   69C   P12    N/A /  N/A |     56MiB /   255MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+

$ ./nomad --version
Nomad v0.12.0-dev (9f070e16db5c1aa1d28960a209740d584ab4abc0)

$ ./nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
./nomad: symbol lookup error: ./nomad: undefined symbol: nvmlDeviceGetPciInfo_v3

Newer drivers are not working.
I have reinstalled the supported drivers using ubuntu-drivers autoinstall.
I am back to the higher resolution, etc.
For now I'll let this be, as having a higher resolution on the desktop is needed for now.

Though, I wish there was a clean fix for this! :)

@shantanugadgil We just merged an option for disabling the nvidia driver and it should be out in 0.12.1. Thanks for raising the issue.

: waiting eagerly for 0.12.1 to test on my machine : ๐Ÿ˜

For basic testing, you can try the binaries found in https://app.circleci.com/pipelines/github/hashicorp/nomad/10642/workflows/1ff98cc1-e847-434f-aff4-05acfbb6f993/jobs/84842/artifacts along with the config from the PR:

plugin "nvidia-gpu" {
  config {
    enabled = false
  }
}

Please try it and let me know how it goes!

The Nomad agent is starting with the mentioned config above.

Perfect - thanks for letting us know!

Any chances of getting older drivers to work with Nomad in the foreseeable future?

I suspect we'll unlikely try to support older drivers without strong demand; we'd be happy to link to community drivers if one exists ;-).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jippi picture jippi  ยท  3Comments

clinta picture clinta  ยท  3Comments

hamann picture hamann  ยท  3Comments

funkytaco picture funkytaco  ยท  3Comments

dvusboy picture dvusboy  ยท  3Comments