Describe the bug
$ glxinfo
zsh: segmentation fault (core dumped) glxinfo
Bisected to d020ee07cda36787d5bc5fe8a0da0dbbed63859a
To Reproduce
Steps to reproduce the behavior:
git checkout d020ee0
sudo ln -snf $(nix-build . -A mesa.drivers) /run/opengl-driver
glxinfo
Expected behavior
The same test does not segfault with d020ee0^
Additional context
This is on a machine with an Intel iGPU, in case this ends up being a driver-specific problem rather than core Mesa / breaking all drivers.
Notify maintainers
@domenkozar for patchelf
@primeos @vcunat for mesa
Metadata
"x86_64-linux"
Linux 5.6.15-hardened, NixOS, 20.09.git.53bbb34d874 (Nightingale)
yes
yes
nix-env (Nix) 2.3.6
""
/etc/nixpkgs
Maintainer information:
# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
I've bisected it to NixOS/patchelf@c4deb5e9e1ce9c98a48e0d5bb37d87739b8cfee4.
dontPatchELF
is not enough as a workaround because the derivation requires some rpath manipulation (done manually in postFixup).
In mesa, iris_disk_cache_init
does the following:
const struct build_id_note *note =
build_id_find_nhdr_for_addr(iris_disk_cache_init);
assert(note && build_id_length(note) == 20); /* sha1 */
const uint8_t *id_sha1 = build_id_data(note);
assert(id_sha1);
char timestamp[41];
_mesa_sha1_format(timestamp, id_sha1);
readelf
is complaining about the .note.gnu.build-id
of the .so when patched with the new patchelf. I assume this is related.
Displaying notes found in: .note.gnu.build-id
Owner Data size Description
readelf: Warning: note with invalid namesz and/or descsz found at offset 0x0
readelf: Warning: type: 0x3, namesize: 0x00000004, descsize: 0x00000014, alignment: 8
patchelf 0.9 doesn't end up causing this warning.
I've sent a PR to fix this in patchelf: https://github.com/NixOS/patchelf/pull/218
I think I stopped this from getting into nixos-unstable
by cancelling this job.
For current master I expect we'll use something like https://github.com/NixOS/nixpkgs/pull/91157 ?
Apparently not all cases get affected. For my amdgpu (POLARIS11) the crash doesn't happen.
I think that's only for drivers that have shader caching enabled, since that's the feature which reads build-id from the .so.
Unfortunately that includes all Intel iGPUs as far as I can tell.
To clarify the current status: with #91157 merged the immediate problem is worked around and Mesa on master
isn't broken anymore. I think I'll keep this bug open to track the long term fix.
I'm quite uncertain about the likelihood of this manifesting in any other package, but you even provided a patch :heart: so for now I'll hope that "someone who knows the context well" will review it soon and we can fix it all.
I would say this particular failure mode is unlikely to hit anything else. The only other thing I know that reads build-id notes is gdb (when using external symbol files), so maybe this patchelf bug could make using gdb on nixos harder for some users. I don't think it's worth worrying about that one.
I also knew only about the gdb use case and I had no idea about overlaps (but I just don't deal with low-level ELF stuff). Anyway, we'll see... or rather I hope we won't.
This was only a problem for Iris for me. Reverting back to using i915 was fine (Coffee Lake).
To make this issue more searchmachine discoverable I also append the backtraces in this case:
This is how the crash looks like in xorg:
$ sudo coredumpctl debug /nix/store/q71hsmwan6zk0nwh1awbyw6m9js3w62n-xorg-server-1.20.8/bin/Xorg
(gdb) bt
#0 0x00007f207105317a in raise () from /nix/store/c2rlh7xa8fcgg7qz8pl76ipvvb172c6k-glibc-2.30/lib/libc.so.6
#1 0x00007f207103d548 in abort () from /nix/store/c2rlh7xa8fcgg7qz8pl76ipvvb172c6k-glibc-2.30/lib/libc.so.6
#2 0x000000000059e90a in OsAbort ()
#3 0x00000000005a3c43 in AbortServer ()
#4 0x00000000005a4966 in FatalError ()
#5 0x000000000059bc15 in OsSigHandler ()
#6 <signal handler called>
#7 0x00007f206f500d29 in os_read_file () from /run/opengl-driver/lib/dri/iris_dri.so
#8 0x00007fff61fd03e0 in ?? ()
#9 0x827990036893cc00 in ?? ()
#10 0x00000000021d2770 in ?? ()
#11 0x00007fff61fd03f0 in ?? ()
#12 0x00000000021d2770 in ?? ()
#13 0x00000000021d4208 in ?? ()
#14 0x00007f206ff910e1 in ?? () from /run/opengl-driver/lib/dri/iris_dri.so
#15 0x00007f206ff910f2 in ?? () from /run/opengl-driver/lib/dri/iris_dri.so
#16 0x00000000021d4280 in ?? ()
#17 0x00007f206f194a77 in pipe_iris_create_screen () from /run/opengl-driver/lib/dri/iris_dri.so
#18 0x00007f206f194a77 in pipe_iris_create_screen () from /run/opengl-driver/lib/dri/iris_dri.so
#19 0x00007f206f787548 in ?? () from /run/opengl-driver/lib/dri/iris_dri.so
#20 0x00000000021d42a0 in ?? ()
#21 0x827990036893cc00 in ?? ()
#22 0x00000000021d4160 in ?? ()
#23 0x00007f206f1970f8 in dri2_init_screen () from /run/opengl-driver/lib/dri/iris_dri.so
#24 0x00007f206f6b4947 in driCreateNewScreen2 () from /run/opengl-driver/lib/dri/iris_dri.so
#25 0x00007fff61fd0f58 in ?? ()
#26 0x0000000002180d70 in ?? ()
#27 0x00007f20708c25b6 in dri_screen_create_dri2 () from /nix/store/j49s35llg57p2cpshf7xhajzzrmhrkqp-mesa-20.0.8/lib/libgbm.so.1
#28 0x00007f20708c2968 in dri_device_create () from /nix/store/j49s35llg57p2cpshf7xhajzzrmhrkqp-mesa-20.0.8/lib/libgbm.so.1
#29 0x00007f20708c0637 in gbm_create_device () from /nix/store/j49s35llg57p2cpshf7xhajzzrmhrkqp-mesa-20.0.8/lib/libgbm.so.1
#30 0x00007f20708d7393 in glamor_egl_init () from /nix/store/q71hsmwan6zk0nwh1awbyw6m9js3w62n-xorg-server-1.20.8/lib/xorg/modules/libglamoregl.so
#31 0x00007f2070910251 in PreInit () from /nix/store/q71hsmwan6zk0nwh1awbyw6m9js3w62n-xorg-server-1.20.8/lib/xorg/modules/drivers/modesetting_drv.so
#32 0x000000000048262f in InitOutput ()
#33 0x0000000000445344 in dix_main ()
#34 0x00007f207103ed8b in __libc_start_main () from /nix/store/c2rlh7xa8fcgg7qz8pl76ipvvb172c6k-glibc-2.30/lib/libc.so.6
#35 0x000000000042f33a in _start ()
This is the crash in sway (wayland)
$ sudo coredumpctl debug /nix/store/zqxzz9d0rdgpxwsyivzdrs8qa264yxbs-sway-unwrapped-1.4/bin/sway
(gdb) bt
#0 0x00007f41a158ad29 in os_read_file () from /run/opengl-driver/lib/dri/iris_dri.so
#1 0x00007fffd05d6150 in ?? ()
#2 0x9356315f1737e200 in ?? ()
#3 0x00000000021443d0 in ?? ()
#4 0x00007fffd05d6160 in ?? ()
#5 0x00000000021443d0 in ?? ()
#6 0x0000000002145e68 in ?? ()
#7 0x00007f41a201b0e1 in ?? () from /run/opengl-driver/lib/dri/iris_dri.so
#8 0x00007f41a201b0f2 in ?? () from /run/opengl-driver/lib/dri/iris_dri.so
#9 0x0000000002145bb0 in ?? ()
#10 0x00007f41a121ea77 in pipe_iris_create_screen () from /run/opengl-driver/lib/dri/iris_dri.so
#11 0x00007f41a121ea77 in pipe_iris_create_screen () from /run/opengl-driver/lib/dri/iris_dri.so
#12 0x00007f41a1811548 in ?? () from /run/opengl-driver/lib/dri/iris_dri.so
#13 0x0000000002145bd0 in ?? ()
#14 0x9356315f1737e200 in ?? ()
#15 0x0000000002145dc0 in ?? ()
#16 0x00007f41a12210f8 in dri2_init_screen () from /run/opengl-driver/lib/dri/iris_dri.so
#17 0x00007f41a173e947 in driCreateNewScreen2 () from /run/opengl-driver/lib/dri/iris_dri.so
#18 0x0000000002106060 in ?? ()
#19 0x0000000002106060 in ?? ()
#20 0x00007f41a2e2f5b6 in dri_screen_create_dri2 () from /nix/store/j49s35llg57p2cpshf7xhajzzrmhrkqp-mesa-20.0.8/lib/libgbm.so.1
#21 0x00007f41a2e2f968 in dri_device_create () from /nix/store/j49s35llg57p2cpshf7xhajzzrmhrkqp-mesa-20.0.8/lib/libgbm.so.1
#22 0x00007f41a2e2d637 in gbm_create_device () from /nix/store/j49s35llg57p2cpshf7xhajzzrmhrkqp-mesa-20.0.8/lib/libgbm.so.1
#23 0x00007f41a36c7df5 in init_drm_renderer () from /nix/store/6amfxnsdjnr0r0icwy2a1qy3l21j9fl6-wlroots-0.10.1/lib/libwlroots.so.5
#24 0x00007f41a36c37f2 in wlr_drm_backend_create () from /nix/store/6amfxnsdjnr0r0icwy2a1qy3l21j9fl6-wlroots-0.10.1/lib/libwlroots.so.5
#25 0x00007f41a36c1bb3 in attempt_drm_backend () from /nix/store/6amfxnsdjnr0r0icwy2a1qy3l21j9fl6-wlroots-0.10.1/lib/libwlroots.so.5
#26 0x00007f41a36c2396 in wlr_backend_autocreate () from /nix/store/6amfxnsdjnr0r0icwy2a1qy3l21j9fl6-wlroots-0.10.1/lib/libwlroots.so.5
#27 0x00000000004191e1 in server_privileged_prepare ()
#28 0x000000000040e5fb in main ()
On latest unstable (a45f68ccac476dc37ddf294530538f2f2cce5a92) sway fails with the following:
MESA-LOADER: failed to open radeonsi (search paths /run/opengl-driver/lib/dri)
failed to load driver: radeonsi
MESA-LOADER: failed to open kms_swrast (search paths /run/opengl-driver/lib/dri)
failed to load driver: kms_swrast
MESA-LOADER: failed to open swrast (search paths /run/opengl-driver/lib/dri)
failed to load swrast driver
2020-07-31 09:38:43 - [backend/drm/renderer.c:19] Failed to create GBM device
2020-07-31 09:38:43 - [backend/drm/backend.c:203] Failed to initialize renderer
2020-07-31 09:38:43 - [backend/backend.c:163] Failed to open DRM device 9
2020-07-31 09:38:43 - [backend/backend.c:304] Failed to open any DRM device
2020-07-31 09:38:43 - [sway/server.c:47] Unable to create backend
I'm using amdgpu
driver on my AMD gpu. This is also reproducible on latest master.
Can't check on d020ee0, but it does work on bdac777. Is this the same issue or should I open another one?
On latest unstable (a45f68c) sway fails with the following:
MESA-LOADER: failed to open radeonsi (search paths /run/opengl-driver/lib/dri) failed to load driver: radeonsi MESA-LOADER: failed to open kms_swrast (search paths /run/opengl-driver/lib/dri) failed to load driver: kms_swrast MESA-LOADER: failed to open swrast (search paths /run/opengl-driver/lib/dri) failed to load swrast driver 2020-07-31 09:38:43 - [backend/drm/renderer.c:19] Failed to create GBM device 2020-07-31 09:38:43 - [backend/drm/backend.c:203] Failed to initialize renderer 2020-07-31 09:38:43 - [backend/backend.c:163] Failed to open DRM device 9 2020-07-31 09:38:43 - [backend/backend.c:304] Failed to open any DRM device 2020-07-31 09:38:43 - [sway/server.c:47] Unable to create backend
I'm using
amdgpu
driver on my AMD gpu. This is also reproducible on latest master.Can't check on d020ee0, but it does work on bdac777. Is this the same issue or should I open another one?
Yes. Looks like it.
Yes. Looks like it.
Thank you for confirming!
Is there any known way to fix this besides rolling back? I tried pinning mesa_drivers
and sway
to an older version but it didn't help. What else can I try?
Yes. Looks like it.
Thank you for confirming!
Is there any known way to fix this besides rolling back? I tried pinning
mesa_drivers
andsway
to an older version but it didn't help. What else can I try?
Sorry, what I meant to say this looks like a different issue. I don't own ATI GPUs. You can open a new issue.
Meanwhile, I seem to also have this issue with iris
driver.
Well, the stacktrace above is from iris, but we don't build mesa with the bad patchelf version (for more than a month now), so it shouldn't be happening anymore.
Yet iris
driver causes a segfault. I guess I should create a different issue with some backtrace.
Yes, a separate one if the trace looks sufficiently different.
@delroth: that trace really does look like a failure when build ID is used by iris.
Agreed, seems like the same stack trace I was dealing with originally. So we were probably just lucky that the layout of the .so didn't trigger the bug with previous versions of patchelf. I expect master
patchelf should solve that.
@edolstra can we get a new release for patchelf to fix this properly?
Closing this as it should be fully resolved since #96513 was merged :)
Most helpful comment
To clarify the current status: with #91157 merged the immediate problem is worked around and Mesa on
master
isn't broken anymore. I think I'll keep this bug open to track the long term fix.