Linux: drm: BUG: unable to handle page fault for address: 17ec6000

Created on 7 Jul 2020  Â·  18Comments  Â·  Source: ClangBuiltLinux/linux

On the Asus F2A85-M PRO with

00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Richland [Radeon HD 8470D] [1002:9996]

running Debian Sid/unstable with Linux v5.8-rc4-25-gbfe91da29bfad (with some patches for LLVM/Clang/LLD) built with clang-11 and lld-11 1:11~++20200701093119+ffee8040534-1~exp1 from experimental, starting a graphical session (X.Org or Wayland) fails with a page fault:

[  502.044997] BUG: unable to handle page fault for address: 17ec6000
[  502.045650] #PF: supervisor write access in kernel mode
[  502.046301] #PF: error_code(0x0002) - not-present page
[  502.046956] *pde = 00000000 
[  502.047612] Oops: 0002 [#1] SMP
[  502.048269] CPU: 0 PID: 2125 Comm: Xorg.wrap Not tainted 5.8.0-rc4-00105-g4da71f1ee6263 #141
[  502.048967] Hardware name: System manufacturer System Product Name/F2A85-M PRO, BIOS 6601 11/25/2014
[  502.049686] EIP: __srcu_read_lock+0x11/0x20
[  502.050413] Code: 83 e0 03 50 56 68 72 c6 99 dd 68 46 c6 99 dd e8 3a c8 fe ff 83 c4 10 eb ce 0f 1f 44 00 00 55 89 e5 8b 48 68 8b 40 7c 83 e1 01 <64> ff 04 88 f0 83 44 24 fc 00 89 c8 5d c3 90 0f 1f 44 00 00 55 89
[  502.052027] EAX: 00000000 EBX: f36671b8 ECX: 00000000 EDX: 00000286
[  502.052856] ESI: f3f94eb8 EDI: f3e51c00 EBP: f303dd9c ESP: f303dd9c
[  502.053695] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[  502.054543] CR0: 80050033 CR2: 17ec6000 CR3: 2eea2000 CR4: 000406d0
[  502.055402] Call Trace:
[  502.056275]  drm_minor_acquire+0x6f/0x140 [drm]
[  502.057162]  drm_stub_open+0x2e/0x110 [drm]
[  502.058049]  chrdev_open+0xdd/0x1e0
[  502.058937]  do_dentry_open+0x21d/0x330
[  502.059828]  vfs_open+0x23/0x30
[  502.060718]  path_openat+0x947/0xd60
[  502.061610]  ? unlink_anon_vmas+0x53/0x120
[  502.062504]  do_filp_open+0x6d/0x100
[  502.063404]  ? __alloc_fd+0x73/0x140
[  502.064305]  do_sys_openat2+0x1b3/0x2a0
[  502.065217]  __ia32_sys_openat+0x90/0xb0
[  502.066128]  ? prepare_exit_to_usermode+0xa/0x20
[  502.067046]  do_fast_syscall_32+0x68/0xd0
[  502.067970]  do_SYSENTER_32+0x12/0x20
[  502.068902]  entry_SYSENTER_32+0x9f/0xf2
[  502.069839] EIP: 0xb7ef14f9
[  502.070764] Code: Bad RIP value.
[  502.071689] EAX: ffffffda EBX: ffffff9c ECX: bfa6a2ac EDX: 00008002
[  502.072654] ESI: 00000000 EDI: b7ed1000 EBP: bfa6b2c8 ESP: bfa6a1c0
[  502.073630] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[  502.074615] Modules linked in: af_packet k10temp r8169 realtek i2c_piix4 snd_hda_codec_realtek snd_hda_codec_generic ohci_pci ohci_hcd ehci_pci snd_hda_codec_hdmi ehci_hcd radeon i2c_algo_bit snd_hda_intel ttm snd_intel_dspcfg snd_hda_codec drm_kms_helper snd_hda_core snd_pcm cfbimgblt cfbcopyarea cfbfillrect snd_timer sysimgblt syscopyarea sysfillrect snd fb_sys_fops xhci_pci xhci_hcd soundcore acpi_cpufreq drm drm_panel_orientation_quirks agpgart ipv6 nf_defrag_ipv6
[  502.077895] CR2: 0000000017ec6000
[  502.079050] ---[ end trace ced4517b63a6db26 ]---
[  502.080214] EIP: __srcu_read_lock+0x11/0x20
[  502.081392] Code: 83 e0 03 50 56 68 72 c6 99 dd 68 46 c6 99 dd e8 3a c8 fe ff 83 c4 10 eb ce 0f 1f 44 00 00 55 89 e5 8b 48 68 8b 40 7c 83 e1 01 <64> ff 04 88 f0 83 44 24 fc 00 89 c8 5d c3 90 0f 1f 44 00 00 55 89
[  502.083891] EAX: 00000000 EBX: f36671b8 ECX: 00000000 EDX: 00000286
[  502.085148] ESI: f3f94eb8 EDI: f3e51c00 EBP: f303dd9c ESP: f303dd9c
[  502.086406] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[  502.087675] CR0: 80050033 CR2: 17ec6000 CR3: 2eea2000 CR4: 000406d0

• linux-5.8-rc4+-messages.txt

Reported upstream [BUG] Untriaged

Most helpful comment

So as it turns out, I think that @nickdesaulniers 's recent SRCU patch actually fixes this issue... https://lore.kernel.org/lkml/[email protected]/

I can reproduce these warnings on my Raspberry Pi on next-20201002 by just booting it up now that the DRM stack works fine.
As soon as I add Nick's patch, they go away. Further testing would be nice from @paulmenzel and @kaniini and if that patch fixes it, we should ask Paul to fast track it to mainline with a CC stable tag.

All 18 comments

Should I report this to the DRM folks, or is it a LLVM/Clang issue because it works with GCC just fine?

Should I report this to the DRM folks

It might be worth getting them involved due to the complexity of the system.

is it a LLVM/Clang issue because it works with GCC just fine?

Just because it works fine with GCC doesn't mean it is an LLVM/Clang issue. See https://github.com/ClangBuiltLinux/linux/issues/735 for an instance of this with amdgpu.

Other than that, I do not have much else to offer at the moment from staring at the code.

with some patches for LLVM/Clang/LLD

What does that mean; they may be important to reproduce.

From the trace, it looks like just the 32b registers are being printed? Is this a 32b kernel image, or a 64b kernel image?

with some patches for LLVM/Clang/LLD

What does that mean; they may be important to reproduce.

Sorry, the two Linux commits for the two issues below are applied.

  1. x86: support i386 with Clang, https://github.com/ClangBuiltLinux/linux/issues/194
  2. x86/boot: allow a relocatable kernel to be linked with lld, https://github.com/ClangBuiltLinux/linux/issues/579

From the trace, it looks like just the 32b registers are being printed? Is this a 32b kernel image, or a 64b kernel image?

This is a 32-bit (ARCH=i386) Linux kernel image.

$ dmesg | ./scripts/decodecode
[ 55.784870] Code: 83 e0 03 50 56 68 ca c6 99 cf 68 9e c6 99 cf e8 3a c8 fe ff 83 c4 10 eb ce 0f 1f 44 00 00 55 89 e5 8b 48 68 8b 40 7c 83 e1 01 <64> ff 04 88 f0 83 44 24 fc 00 89 c8 5d c3 90 0f 1f 44 00 00 55 89
All code
========
   0:   83 e0 03                and    $0x3,%eax
   3:   50                      push   %eax
   4:   56                      push   %esi
   5:   68 ca c6 99 cf          push   $0xcf99c6ca
   a:   68 9e c6 99 cf          push   $0xcf99c69e
   f:   e8 3a c8 fe ff          call   0xfffec84e
  14:   83 c4 10                add    $0x10,%esp
  17:   eb ce                   jmp    0xffffffe7
  19:   0f 1f 44 00 00          nopl   0x0(%eax,%eax,1)
  1e:   55                      push   %ebp
  1f:   89 e5                   mov    %esp,%ebp
  21:   8b 48 68                mov    0x68(%eax),%ecx
  24:   8b 40 7c                mov    0x7c(%eax),%eax
  27:   83 e1 01                and    $0x1,%ecx
  2a:*  64 ff 04 88             incl   %fs:(%eax,%ecx,4)        <-- trapping instruction
  2e:   f0 83 44 24 fc 00       lock addl $0x0,-0x4(%esp)
  34:   89 c8                   mov    %ecx,%eax
  36:   5d                      pop    %ebp
  37:   c3                      ret    
  38:   90                      nop
  39:   0f 1f 44 00 00          nopl   0x0(%eax,%eax,1)
  3e:   55                      push   %ebp
  3f:   89                      .byte 0x89

Code starting with the faulting instruction
===========================================
   0:   64 ff 04 88             incl   %fs:(%eax,%ecx,4)
   4:   f0 83 44 24 fc 00       lock addl $0x0,-0x4(%esp)
   a:   89 c8                   mov    %ecx,%eax
   c:   5d                      pop    %ebp
   d:   c3                      ret    
   e:   90                      nop
   f:   0f 1f 44 00 00          nopl   0x0(%eax,%eax,1)
  14:   55                      push   %ebp
  15:   89                      .byte 0x89

This is happening for me when building unpatched Linux 5.8.9 from kernel.org with clang as well on aarch64:

[   19.810753] Unable to handle kernel paging request at virtual address ffff802f0735a000
[   19.810755] Mem abort info:
[   19.810756]   ESR = 0x96000005
[   19.810758]   EC = 0x25: DABT (current EL), IL = 32 bits
[   19.810760]   SET = 0, FnV = 0
[   19.810761]   EA = 0, S1PTW = 0
[   19.810762] Data abort info:
[   19.810764]   ISV = 0, ISS = 0x00000005
[   19.810765]   CM = 0, WnR = 0
[   19.810767] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000e3578000
[   19.810768] [ffff802f0735a000] pgd=0000002fdffff003, p4d=0000002fdffff003, pud=0000000000000000
[   19.810773] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   19.810886] Modules linked in: cdc_ether usbnet r8152 mii nls_utf8 nls_cp437 vfat fat snd_usb_audio aes_ce_blk crypto_simd cryptd snd_usbmidi_lib aes_ce_cipher snd_rawmidi crct10dif_ce ghash_ce snd_seq_device snd_hda_codec_hdmi joydev mousedev gf128mul mc input_leds af_packet sha2_ce efi_pstore evdev snd_hda_intel snd_intel_dspcfg snd_hda_codec sha256_arm64 sha1_ce snd_hda_core efivars sbsa_gwdt snd_hwdep snd_pcm snd_timer snd lm90 soundcore uio_pdrv_genirq uio fan thermal processor dwc3 efivarfs hid_generic usbhid hid amdgpu gpu_sched hwmon ttm drm_kms_helper drm cec fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_algo_bit i2c_core nvme nvme_core ahci_platform libahci_platform libahci libata xhci_plat_hcd xhci_hcd loop usb_storage usbcore
[   19.816393] CPU: 15 PID: 2957 Comm: X Not tainted 5.8.9-1-edge #2-Alpine
[   19.816834] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jul 24 2020
[   19.816988] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[   19.817607] pc : __srcu_read_lock+0x38/0x7c
[   19.832584] lr : drm_minor_acquire+0xa8/0x11c [drm]
[   19.838341] sp : ffff800012f9baa0
[   19.838342] x29: ffff800012f9baa0 x28: 0000000000000041 
[   19.838344] x27: ffff002f100a24c0 x26: 0000000000000030 
[   19.838346] x25: ffff002f08879140 x24: ffff002f10a33328 
[   19.838347] x23: 0000000000000000 x22: ffff800008a5b4e8 
[   19.838349] x21: ffff002f0785c3f0 x20: ffff002f10af6000 
[   19.838351] x19: 0000000000000000 x18: 0000000000000000 
[   19.838352] x17: 0000000000000002 x16: ffffffffffffffff 
[   19.838354] x15: 0000000000000028 x14: 0000000000000005 
[   19.838356] x13: ffff002f08879148 x12: 0000000000000000 
[   19.838357] x11: ffff802f0735a000 x10: 0000000000000001 
[   19.838359] x9 : ffff802f0735a000 x8 : ffff002f100a24c0 
[   19.838360] x7 : 0000000000000000 x6 : 000000000000003f 
[   19.838362] x5 : 0000000000000000 x4 : 0000000000000000 
[   19.838364] x3 : 0000000000000001 x2 : ffff002f082e6400 
[   19.838365] x1 : 0000000000000000 x0 : ffff800008a794d0 
[   19.838368] Call trace:
[   19.838371]  __srcu_read_lock+0x38/0x7c
[   19.838382]  drm_minor_acquire+0xa8/0x11c [drm]
[   19.838391]  drm_stub_open+0x34/0x114 [drm]
[   19.838394]  chrdev_open+0x198/0x1f8
[   19.838396]  do_dentry_open+0x268/0x3a0
[   19.838398]  vfs_open+0x28/0x30
[   19.838399]  path_openat+0x888/0xc0c
[   19.838401]  do_filp_open+0x74/0x11c
[   19.838403]  do_sys_openat2+0x7c/0x14c
[   19.838404]  __arm64_sys_openat+0x68/0x8c
[   19.838407]  el0_svc_common+0x98/0x160
[   19.838409]  do_el0_svc+0x70/0x78
[   19.838412]  el0_sync_handler+0xd4/0x248
[   19.838413]  el0_sync+0x140/0x180
[   19.838416] Code: d538d08b 8b130d49 8b090169 5280002a (c85f7d2c) 
[   19.838419] ---[ end trace 32b105d7eb3c05e2 ]---
[   19.838434] note: X[2957] exited with preempt_count 1

I suspect the issue has to do with SRCU.

I will have to build a kernel by hand outside of the Alpine kernel packaging but will sprinkle in some printk() this weekend as requested in that thread.

Thanks for the reports. @kaniini please attach disassembly of the bottom most stack frame when posting traces; those go hand in hand and we typically need both to understand reports. They also need to come from precisely the same kernel image; rebuilding may change the object file (I'm not sure of the kernel's status as far as fully reproducible builds is concerned).

Paul had some suggestions:

0.      Did someone call srcu_read_lock() before init_srcu_struct()
        had been called on this srcu_struct structure?

Printing via printk %p with kptr restrict should help us spot if we see the same address between these two, but in the wrong order, perhaps.

1.      Did the init_srcu_struct() for this srcu_struct report an error?
        (Though with current mainline, that memory-allocation failure
        would more likely have page-faulted in init_srcu_struct().)

You can check the dmesg closer for any reports. I noticed that some of these functions have different definitions when CONFIG_DEBUG_LOCK_ALLOC is enabled. You could try enabling that config.

2.      Has the srcu_struct in question already been passed to
        cleanup_srcu_struct()?

So in this case, I'd add printk's in cleanup_srcu_struct of the %p pointer. You might need kptr restrict disabled to not see 0x00's in the dmesg. If you see the same address as one that's been

3.      Has the value of %fs been clobbered?  Though that seems
        unlikely given that it also happens on aarch64.  Plus, the
        smoking gun seems to me to be the zero value of %eax.

I don't think this is the case, %fs had a value in @paulmenzel 's report, and @kaniini 's report is arm64.

4.      If the above three questions fail to provide enlightenment,
        I suggest recording the ->sda value and adding debug checks
        to anything that can unmap memory...  And recording the value
        of ->sda somewhere to check to see if it is being changed (it
        should remain constant from init_srcu_struct()'s return through
        the corresponding call to cleanup_srcu_struct()).

I kind of get the feeling that there may be a dangling reference to a value that's been cleaned up somewhere, too. I wonder if enabling KASAN would help find use after frees here?

arm64 does not have scripts/decodecode available to it, AFAIK. otherwise I would :)

also, %fs was originally zeroed in @paulmenzel's report. the second mention of FS is 0xd8, which means clobbering is possible.

So as it turns out, I think that @nickdesaulniers 's recent SRCU patch actually fixes this issue... https://lore.kernel.org/lkml/[email protected]/

I can reproduce these warnings on my Raspberry Pi on next-20201002 by just booting it up now that the DRM stack works fine.
As soon as I add Nick's patch, they go away. Further testing would be nice from @paulmenzel and @kaniini and if that patch fixes it, we should ask Paul to fast track it to mainline with a CC stable tag.

Actually, I just decided to reply on the mailing list with that information: https://lore.kernel.org/lkml/20201006065623.GA2418984@ubuntu-m3-large-x86/. Further testing would still be appreciated!

Thank you for the update. I did a test again with

$ git describe --tags clang-lto/clang-lto # x86, build: allow LTO_CLANG and THINLTO to be selected
v5.9-rc8-174-gf37134efda8fd

with the KBUILD_LDFLAGS += -z notext change from https://github.com/ClangBuiltLinux/linux/issues/579 applied on top.

Using the package clang-11 and lld-11 at version 11.0.0~+rc5-1, adding non-versioned symbolic links, an image built with

make bindeb-pkg -j32 ARCH=i386 LLVM=1

works, and the bug is not visible.

$ more /proc/version
Linux version 5.9.0-rc8+ (root@855cb05d002d) (Debian clang version 11.0.0-+rc5-1 , LLD 11.0.0) #205 SMP Tue Oct 6 08:09:19 UTC 2020

But it looks like, Nick’s patch you referenced is not in the branch,

git log --grep srcu -i --author=Nick

so my problem seems to have been something else, and this issue can be closed, and a new one opened for yours on the Raspberry Pi?

Hmmm, good to know that your issue is resolved although I cannot help but feel that the issues are somehow related given the call trace is extremely similar. I do not think we should split the bugs for now.

Applying the SRCU patch does seem to resolve it here in light testing.

But it looks like, Nick’s patch you referenced is not in the branch,

My patch hasn't landed in mainline yet. You'll need to pick it up and apply it manually.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nathanchance picture nathanchance  Â·  4Comments

nickdesaulniers picture nickdesaulniers  Â·  3Comments

tpimh picture tpimh  Â·  3Comments

nathanchance picture nathanchance  Â·  3Comments

tpgxyz picture tpgxyz  Â·  4Comments