Nixpkgs: Amdgpu driver with newest kernel has no display at all

Created on 28 Jul 2018  路  47Comments  路  Source: NixOS/nixpkgs

Issue description

No display even though no obvious error is presented.
Journalctl (full):
https://gist.github.com/Mounium/715146ce2ce1993ade521d11b25766fd

Priority >= 4 logs:

[Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0x22 (or later)
ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
pmd_set_huge: Cannot satisfy [mem 0xf8000000-0xf8200000] with a huge-page mapping due to MTRR override.
usb 3-8: device descriptor read/64, error -71
booting system configuration /nix/store/gx1bd0k70xzsp8js4sxrv6mlskg3kh73-nixos-system-eni-18.09.git.e7d5785
Failed to find module 'snd_pcm_oss'
Specified group 'render' unknown
Specified group 'kvm' unknown
r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
CRAT table not found
resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window]
caller pci_map_rom+0x58/0xe0 mapping multiple BARs
Process '/nix/store/jhqbxszrw2dfgl6m0ilbf4fp7cq6jrjx-alsa-utils-1.1.6/sbin/alsactl restore 0' failed with exit code 99.
Process '/nix/store/jhqbxszrw2dfgl6m0ilbf4fp7cq6jrjx-alsa-utils-1.1.6/sbin/alsactl restore 1' failed with exit code 99.
platform regulatory.0: Direct firmware load for regulatory.db failed with error -2

Steps to reproduce

Nixos-rebuilding at e7d5785 commit, with amdgpu driver and linuxPackages_latest as kernel (4.17.10).

Technical details

AMD R9 380 as card

Most helpful comment

@Chiiruno you should be able to use the current mesa just by deleting your cache. To be clear, that's:

sudo rm -rf /root/.cache/mesa_shader_cache
rm -rf ~/.cache/mesa_shader_cache

The problem is, you may have to do it whenever you switch to a new version of mesa, even if you just want to go back from 18.1 to your previously working system config with 18.0. Fixing the cache logic would essentially clear the cache automatically when switching versions.

For anyone hitting this problem, I'd suggest just deleting the cache and moving to 18.1. Hopefully by time there's another breaking update, we'll have a fix in place.

All 47 comments

I can't reproduce on RX570. Display is working fine

my config (kernelParam was to enable audio over displayport)

  boot.kernelModules = [ "kvm-intel" "kvm-amd" ];
  boot.kernelParams = ["amdgpu.dc=1"];

  boot.kernelPackages = pkgs.linuxPackages_4_17

@Mounium I just updated to master and I may be seeing the same thing. kvm works, but the system hangs when I try to start X. I'm using a 290. My last working config is linux-4.17.3, and the broken one is 4.17.11. I'll try bisecting.

I can confirm this also, am using AMDGPU with latest kernel and when it initializes X, just blank with an underscore on the top left.
Of course, the only way to get out of this is to either hard-reboot, or preferably, a sysrq combination.

This also happens on the standard kernel.
Anyway, can confirm this on unstable.

@Chiiruno I've been bisecting it in my spare time, no culprit yet though.

I had no luck with sysrq other than rebooting and possibly disk sync. I couldn't get back control or get any output. In the case that journald writes something out, it just looks like xorg is failing to start without any error messages.

drm.debug=0x3f will be my next step if I don't find anything obvious in bisection.

@corngood I meant sysrq for clean rebooting anyway. REISUB

Yeah I got this too on one of my machines. Last working kernel for me was 4.17.5

There seems to be two different problems on 4.14.56 (14.55 is the last working one, although I'm not sure the kernel bump is responsible for it) and on 4.17. In the case of 14.56 I get a single character of _ on the screen in the left upper corner and that's all, only way to exit is REISUB, while on 4.17. I get no display at all.

Appending nomodeset to kernel params, I get blinking terminal in the former case, while normally working one with the latter, not sure why the blinking, but it doesn't even read keystrokes, only between two blinks, so it was frustrating to write in my password.

As for the 17.* error, I think it is actually caused by the kernel, because I just installed gentoo and with this version it resulted in the same error, no display at all, though I could still login and reboot from terminal; I didn't try 4.14.56 yet on gentoo.

I tried to locate the 4.14.* bug, not much progress but here it is: 6c44deb181a816c2bcf287eada3155a6840f16b3 works and 69affb8d263c186665300824175610733a374330 is not.

first bad commit: [fc0b7c984cfd0467cbfb1a24be249ec441da764a] mesa: 18.0.3 -> 18.1.2

I've been testing the whole time with linux-4.17. I'm just going to try master with mesa 18.0.3 and then dig into mesa changes.

So it's working ok on master with mesa-18.0.3, and linux-4.17.11.

I had a look through the mesa bug database and didn't find anything obvious. Maybe it's worth trying with xorg-server-1.20?

I suppose affected people can work around it on never systems via:

let # just fix higlighting
nixpkgs.config.packageOverrides = super: {
  mesa_drivers = (super.mesa_drivers.overrideAttrs (attrs: rec {
    version = "18.0.5";
    name = "mesa-noglu-${version}";
    src =  fetchurl {
      url = "https://mesa.freedesktop.org/archive/mesa-${version}.tar.xz";
      sha256 = "5187bba8d72aea78f2062d134ec6079a508e8216062dce9ec9048b5eb2c4fc6b";
    };
    passthru = attrs.passthru // { inherit version; };
  })).drivers;
};

(assuming 18.0.5 isn't worse than 18.0.3 in this respect). This should rebuild only the drivers and nothing else in the system.

@vcunat 's workaround does work for 4.14.59 on 65c43b44684506b629fd4effeeca59412344eea6, but not for linuxPackages_latest (which is 4.17.11) on the same commit, where I still get no display at all, though I can login and reboot the system, so no need for REISUB.

That sounds like hitting multiple bugs at once.

Yeah, overriding mesa_drivers works for me on 4.17.11, but I have only tried mesa 18.0.3 so far. I'd like to bisect Mesa when I get a chance to set up incremental builds.

Perhaps using ccacheStdenv for the build would be a sufficient speedup, but I've never resorted to that.

It seems we may well need multiple bug reports. I'm seeing something similar with a custom 4.13 kernel from AMD. My symptoms are that when using GDM, I get a logins screen but when I log in, I get a gray screen then bounced back to the login screen. journalctl shows X failing to start, but nothing else I noticed. With sddm, I get a totally unresponsive black screen with an underscore in the top left; no login screen or anything. Both of these are with the mesa reversion suggested above.

Is a bisect of nixpkgs the best way forward for me?
I misapplied the suggested fix. Using mesa_drivers-18.0.5 solved the problem for me.

Bisecting over whole nixpkgs will probably lead to way too many unnecessary rebuilds. I expect you could focus on mesa_drivers (overridden as above) and the kernel.

Has anyone tried xorg-server-1.20?

I was completely wrong: @vcunat's suggestion fixed things for me. I mistakenly put the override in my user's packageOverrides, but putting it in my system configuration got things working again for me. Thank you!

@corngood I did briefly try, but it is challenging since the repo structure has changed so much; see here. The xorgserver build fails quite quickly due to an insufficiently new randrproto, but to get that I think we'd need to change to use the xorgproto mono-repository. This will ultimately simplify things, but it's more than a version bump for the packaging.

Using mesa 18.0.5 works great with the latest kernel, but @vcunat 's solution needed a few tweaks to work in my global configuration. Put this into nixpkgs.config if you already have it defined:

packageOverrides = super: with pkgs; rec {
    mesa_drivers = (super.mesa_drivers.overrideAttrs (attrs: rec {
        version = "18.0.5";
        name = "mesa-noglu-${version}";
        passthru = attrs.passthru // { inherit version; };

        src =  fetchurl {
            url = "https://mesa.freedesktop.org/archive/mesa-${version}.tar.xz";
            sha256 = "5187bba8d72aea78f2062d134ec6079a508e8216062dce9ec9048b5eb2c4fc6b";
        };
    })).drivers;
};

This fixed my problems as well (on linux 4.15.5). Nice catch.

Let's hope it fixes stuff as well on linux 4.17.11.

@acowley I got xorg 1.20 working by adding xorgproto, but it didn't change the behaviour: mesa 18.0 worked, 18.1 hung. We are already on the lastest video-amdgpu I think, so I guess bisecting mesa is my next step.

There's a new version of libdrm since today and mesa 18.1.5 was released a few days ago, but I don't expect there's a good chance that either fixes this.

So, funny story.

I started bisecting mesa. marked 18.1.2 as bad, 18.0.3 as good. Figured out the least painful way of testing, which ended up just being a full nixos-rebuild test with fetchgitLocal for mesa. I disabled 32-bit drivers and it was only a few minutes per-iteration (no reboot required after a good test).

I had a mix of good and bad (always the same hang) results, but on a particular bad one I had to reboot and noticed that X wouldn't start on my baseline configuration. I thought that was odd, so I tried the configuration I'd been using for the last several weeks, and X wouldn't start there either.

That's when I remembered .cache/mesa_shader_cache. I booted into multi-user.target, ran sudo rm /root/.cache/mesa_shader_cache, started X, and it worked.

I figured I had blown the bisection by not clearing the cache. So I made a note of the bad tests in the log, and continued testing while clearing the cache each time. All the rest were good, and I landed on a previously bad commit that worked after clearing the cache.

I checked out 18.1.2 again, cleared the cache, and it worked. I reverted all my mesa hacks (back to 18.1.4 from master), cleared the cache, and it worked.

I've run into the odd problem with the shader cache before, but I've only ever had to clear ~/.cache/mesa_shader_cache, and I've only ever had problems when I've been building my own mesa, never with official releases.

Wow, that worked immediately - the display manager started up as soon as I removed that.

Thanks for the dedicated testing!

edit: kernel 4.14.59 ;)

@Pneumaticat thanks for confirming.

I guess this still needs more investigation before 18.1 gets released in nixos.

So far the only bug I can find that sounds somewhat similar is: https://bugs.freedesktop.org/show_bug.cgi?id=105904. You'd think this would be popping up on other distros. Maybe they have hooks to clear the cache on update? Maybe it's because we haven't changed llvm versions?

I'll have to take a look at how the shader cache works. Perhaps we can key the cache using the driver derivation hash or something. It's certainly worrying that it's vulnerable to this sort of problem, especially across minor version releases.

I also had this issue with my Radeon Pro WX 2100. Reverting mesa_drivers to version 18.0.5 solved the no-display problem for me, but now my display is almost unusably slow, with extensive tearing and artifacts.

I have some graphics issue with one or another of my machines every time there is a major version bump of Mesa. We should consider being more conservative about upgrading Mesa, and keeping older versions of the drivers in Nixpkgs longer.

OK, I suppose it will be best. AFAIK major mesa updates used to work mostly without problems (reported back), but apparently the current workflow isn't good anymore.

For example, we can now create mesa_drivers_18_1 (which will keep it in cache, too), and a PR to switch the default which can serve as a thread for interested testers.

@ttuegel does clearing the shader cache fix mesa 18.1 for you?

@vcunat I like the idea of keeping multiple versions, but we'll still have to deal with the cache problem, or people will end up breaking their existing system configurations.

I'll try to find some time over the weekend to figure out how the cache works.

does clearing the shader cache fix mesa 18.1 for you?

@corngood No, clearing the shader cache does not fix the problems with slow display + artifacts.

Ah, I think I found the problem. The shader cache (for radeonsi) is keyed on:

  • cache format version
  • mtime of mesa shared library
  • mtime of llvm shared library
  • gpu name
  • cpu pointer size
  • driver flags

Obviously the mtimes aren't going to be meaningful /nix/store/.

It would be pretty simple to add the driver derivation path to the cache key. We could do this with a patch in nixpkgs, and we probably should, so we can protect existing releases. As far as upstreaming it, I don't know, maybe have it always include the absolute paths as well as the mtimes? Maybe add a config flag for a hash key?

@corngood Would that mean we could keep the current Mesa version without reverting?

@Chiiruno you should be able to use the current mesa just by deleting your cache. To be clear, that's:

sudo rm -rf /root/.cache/mesa_shader_cache
rm -rf ~/.cache/mesa_shader_cache

The problem is, you may have to do it whenever you switch to a new version of mesa, even if you just want to go back from 18.1 to your previously working system config with 18.0. Fixing the cache logic would essentially clear the cache automatically when switching versions.

For anyone hitting this problem, I'd suggest just deleting the cache and moving to 18.1. Hopefully by time there's another breaking update, we'll have a fix in place.

@corngood It worked perfectly, I'm on the latest mesa again.
Thanks!

Should be fixed by #44575

Assuming that #44575 is in master (and I assume it is; manually clearing the shader cache didn't work), there's still an unrelated problem on master with a Ryzen 2400G:

  1. Boot using Linux 4.18
  2. stage1 banner is displayed
  3. ^@ is printed after a few seconds.
  4. About half a second later, the screen freezes
  5. The ^@, and other messages previously on the screen are still displayed (cursor stops blinking, though).
  6. SSH is possible

Don't know of another distro with a live ISO on 4.18 to test. 4.14 works, but doesn't support the GPU, so it uses llvmpipe (at full resolution, which may be relevant, but probably not).

nomodeset works, but obviously blocks Xorg, due to amdgpu not supporting it.

Yes, that PR is even in the unstable channels already.

Should I submit mine as a separate issue?

@vcunat Should I submit mine as a separate issue?

@leo60228: yes, that will be better. Discussion here is focused too much around the cache-clearing which should be fixed now.

Sorry to bump an old issue, but I seem to be observing the same behavior. For anyone who has solved this, did you need to add any relevant kernel modules / kernel parameters to get this to work or rather any additional config?

I haven't tried the package override, but I assumed as it's nearly a year old this would be on the stable channel by now?

I've tried the amdgpu, amdgpu-pro and the *_unfree driver variants as well with no luck, I either get a frozen screen or no output at all (though the machine does not hang - I can SSH etc), though I can't switch TTY consoles (crtl-alt-f1 etc).

@chrissound In case you have a newer AMD card, try adding amdgpu.dc=0 to your kernel parameters

If the GPU is newer model, I'd also try newer linuxPackages (at least as general approach when encountering problems).

@chrissound What GPU do you have? I had a Ryzen 5 2400G and fixed my problem by compiling Linux master before 4.19 came out, but I'm using standard 5.0 now.

AMD R7 250.

I tried again today and finally got this to work! Thanks for all the suggestions, not actually sure what the issue was though.

In my hardware-configuration.nix:

  hardware.enableRedistributableFirmware = true;
  boot.initrd.availableKernelModules = [ "nvme" "xhci_pci" "ahci" "usbhid" "sd_mod" ];
  boot.kernelModules = [ "kvm-amd" "fuse"];
   boot.kernelParams = [
   "radeon.si_support=0"
   "amdgpu.si_support=1"
   "amdgpu.dc=1"
   ];
  boot.extraModulePackages = [ ];
  #boot.kernelPackages = pkgs.linuxPackages_4_19;
  boot.kernelPackages = pkgs.linuxPackages_latest;

Using the amdgpu driver.

Kernel:

uname -a
Linux nixos 5.1.3 #1-NixOS SMP Thu May 16 17:35:40 UTC 2019 x86_64 GNU/Linux

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/amd-rx-570-with-triple-monitor-setup/6938/1

Was this page helpful?
0 / 5 - 0 ratings