Nomad v0.12.0-beta2 (5b80d4e638f1a27eee3ca245f8babb115e4c098d)
Same with Nomad 0.11.3 GA
Elementary Linux 5.x (based off Ubuntu 18.04)
uname -a
Linux mynodename 5.3.0-61-generic #55~18.04.1-Ubuntu SMP Mon Jun 22 16:40:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
lsb_release -a
No LSB modules are available.
Distributor ID: elementary
Description: elementary OS 5.1.5 Hera
Release: 5.1.5
Codename: hera
Nomad agent fails to start with the following error:
nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
nomad: symbol lookup error: nomad: undefined symbol: nvmlDeviceGetPciInfo_v3
run "nomad agent -dev"
n/a
Additional information that might be useful:
lspci | grep -i vga
02:00.0 VGA compatible controller: NVIDIA Corporation G98 [Quadro NVS 295] (rev a1)
Steps to install the drivers was:
ubuntu-drivers autoinstall
nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 340.108 Driver Version: 340.108 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro NVS 295 Off | 0000:02:00.0 N/A | N/A |
| N/A 63C P12 N/A / N/A | 56MiB / 255MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Sorry for the slow response. Looks like nomad requires a more recent driver than the ones that are bundled with Linux kernel. Can you try upgrading your driver and let us know if that works?
It seems that Linux is bundling legacy drivers by default. Nomad currently requires a more recent versions like the ones bundled with CUDA 9 or 10.
I've tested nomad against driver 484.11 (bundled with CUDA 9) and that worked:
ubuntu@ip-172-31-26-165:~$ nvidia-smi
Tue Jun 30 17:32:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 On | 00000000:00:03.0 Off | N/A |
| N/A 33C P8 18W / 125W | 0MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-26-165:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
ubuntu@ip-172-31-26-165:~$ ./nomad --version
Nomad v0.12.0-beta2 (5b80d4e638f1a27eee3ca245f8babb115e4c098d)
ubuntu@ip-172-31-26-165:~$ ./nomad agent -dev 2>&1 | head -n5
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
The way I installed the drivers was ubuntu-drivers autoinstall.
Hopefully installing the latest drivers won't bork my system!!! ๐ค
FWIW, these drivers are not _that_ old:
when I search the official website, I get this:
https://www.nvidia.com/object/product_quadro_nvs_295_us.html
Version: | 340.108
-- | --
Release Date: | 2019.12.23
Operating System: | Linux 64-bit
Language: | English (US)
File Size: | 66.92 MB
I will try to follow the wizard here: https://developer.nvidia.com/cuda-downloads to get the latest
Update: looks this won't happen anytime soon ... way too much download ~ 2 GiB ... ๐ข
EDIT: I cancelled this operation and tried installing from PPA
It's strange - when I looked for 340.108, I noticed it was marked legacy even though it was released in 2019 - e.g. https://forums.developer.nvidia.com/t/linux-solaris-and-freebsd-driver-340-108-legacy-for-geforce-8-and-9-series/109520#5414137 .
Stepping back a bit - let me clarify the use case. Are you actually planning to use this GPU with nomad for machine learning/CUDA-like workloads? Or is it that you are trying to start nomad on a server that just happened to have GPU though it's not critical to the nomad case?
Also, mind if you try running the nomad agent found in https://79969-36653430-gh.circle-artifacts.com/0/builds/nomad_linux_amd64.zip
This is not a critical system. I just happened to have an old display card and decided to set it up on an old desktop (which was already running Elementary Linux)
I wouldn't be really using this for any real word CUDA workloads, though it would be good to have, as I could run trivial CUDA things on my desktop.
That said, if this doesn't fit on the roadmap due to it's "non real" use case, I am fine with that. (1)
Though, in that case, what would be the proper way to disable nvidia detection altogether during Nomad startup.
pt. 1 is fine, Nomad not starting at all is super sad (I should check up on disabling drivers using the client blocklist)
Update: I tried adding the nvidia ppa and manually installing the "latest" available driver. this broke the nvidia driver altogether, I am down to VESA mode, but the agent starts now ๐ .
add-apt-repository ppa:graphics-drivers/ppa
apt update
apt install nvidia-384
I'm very sorry that I have your system borked :(. Also, I fully agree that nomad agent should function with legacy nvidia drivers - the agent should start normally but without nvidia support. We'll follow up.
One odd thing is in my testing, I noticed that Ubuntu 18.04 offers nvidia-driver-440 (and other versions as well):
ubuntu@ip-172-31-19-213:~$ apt-cache madison nvidia-driver-440
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages
Though, in that case, what would be the proper way to disable nvidia detection altogether during Nomad startup.
pt. 1 is fine, Nomad not starting at all is super sad (I should check up on disabling drivers using the client blocklist)
This brings up a new question in my mind
Q: I am currently unable to disable the device detection altogether for device nvidia-gpu.
I know about disabling drivers via blacklist, but there doesn't seem to be anything equivalent for device plugins, right?
I'm very sorry that I have your system borked :(. Also, I fully agree that nomad agent should function with legacy nvidia drivers - the agent should start normally but without nvidia support. We'll follow up.
One odd thing is in my testing, I noticed that Ubuntu 18.04 offers nvidia-driver-440 (and other versions as well):
ubuntu@ip-172-31-19-213:~$ apt-cache madison nvidia-driver-440 nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages
_That's OK, always ready for trial-and-error to get Nomad working!_ ๐ ๐ฉ๏ธ
BTW, did you try with the NVIDIA NVS 295 display card? That is my display card. (maybe that matters, I dunno)
For me this is not "Ubuntu" Ubuntu, it is Elementary Linux (a desktop oriented) distro, hence I would prefer having the display driver working, higher resolution etc.
OK, after refining my apt search ...
apt search nvidia | grep "^nvidia\-driver\-"
nvidia-driver-390/bionic-updates,bionic-security 390.138-0ubuntu0.18.04.1 amd64
nvidia-driver-410/unknown 410.129-0ubuntu1 amd64
nvidia-driver-415/bionic 415.27-0ubuntu0~gpu18.04.2 amd64
nvidia-driver-418/bionic 430.64-0ubuntu0~gpu18.04.1 amd64
nvidia-driver-430/bionic-updates,bionic-security,bionic 440.100-0ubuntu0.18.04.1 amd64
nvidia-driver-435/bionic-updates,bionic 435.21-0ubuntu0.18.04.2 amd64
nvidia-driver-440/bionic-updates,bionic-security,bionic 440.100-0ubuntu0.18.04.1 amd64
nvidia-driver-450/unknown 450.36.06-0ubuntu1 amd64
I will try with 440 now.
Also, mind if you try running the nomad agent found in https://79969-36653430-gh.circle-artifacts.com/0/builds/nomad_linux_amd64.zip
This doesn't work with my correct working display driver v 340.
_I will try with v 440 and try again_
$ nvidia-smi
Wed Jul 1 01:08:09 2020
+------------------------------------------------------+
| NVIDIA-SMI 340.108 Driver Version: 340.108 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro NVS 295 Off | 0000:02:00.0 N/A | N/A |
| N/A 69C P12 N/A / N/A | 56MiB / 255MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
$ ./nomad --version
Nomad v0.12.0-dev (9f070e16db5c1aa1d28960a209740d584ab4abc0)
$ ./nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
./nomad: symbol lookup error: ./nomad: undefined symbol: nvmlDeviceGetPciInfo_v3
Newer drivers are not working.
I have reinstalled the supported drivers using ubuntu-drivers autoinstall.
I am back to the higher resolution, etc.
For now I'll let this be, as having a higher resolution on the desktop is needed for now.
Though, I wish there was a clean fix for this! :)
@shantanugadgil We just merged an option for disabling the nvidia driver and it should be out in 0.12.1. Thanks for raising the issue.
: waiting eagerly for 0.12.1 to test on my machine : ๐
For basic testing, you can try the binaries found in https://app.circleci.com/pipelines/github/hashicorp/nomad/10642/workflows/1ff98cc1-e847-434f-aff4-05acfbb6f993/jobs/84842/artifacts along with the config from the PR:
plugin "nvidia-gpu" {
config {
enabled = false
}
}
Please try it and let me know how it goes!
The Nomad agent is starting with the mentioned config above.
Perfect - thanks for letting us know!
Any chances of getting older drivers to work with Nomad in the foreseeable future?
I suspect we'll unlikely try to support older drivers without strong demand; we'd be happy to link to community drivers if one exists ;-).
Most helpful comment
@shantanugadgil We just merged an option for disabling the nvidia driver and it should be out in 0.12.1. Thanks for raising the issue.