general protection fault: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 786 Comm: systemd-udevd Tainted: P O 5.4.0-rc2-1-me-clang+ #3
Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5007 06/17/2019
RIP: 0010:CalculatePrefetchSourceLines+0xe3/0x190 [amdgpu]
Code: f2 0f 58 05 37 8a 16 00 f2 48 0f 2c c0 31 d2 41 f7 f6 e9 a0 00 00 00 f2 0f 58 05 20 8a 16 00 44 89 f0 0f 57 c9 f2 48 0f 2a c8 <66> 0f 29 4d d0 f2 0f 5e c1 f2 0f 5a c0 f3 0f 10 0d f0 8f 16 00 e8
RSP: 0018:ffff943717476db8 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff9436c8331f38 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9436c8331f38
RBP: ffff943717476de8 R08: 0000000000000000 R09: ffff9436c8336d10
R10: 0000000000000001 R11: ffff9436c83379b8 R12: ffff9436c8336cd0
R13: ffff9436c83367f8 R14: 0000000000000001 R15: ffff9436c8336d10
FS: 00007fb1af330840(0000) GS:ffff94371e800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555e5db20d68 CR3: 0000000817472000 CR4: 00000000003406f0
Call Trace:
dml20v2_ModeSupportAndSystemConfigurationFull+0x47ec/0x6c90 [amdgpu]
? get_page_from_freelist+0x370/0x680
dml_get_voltage_level+0x1ea/0x200 [amdgpu]
dcn20_fast_validate_bw+0x290/0x990 [amdgpu]
? __kmalloc+0x218/0x2b0
dcn20_validate_bandwidth_internal+0x8d/0x1f0 [amdgpu]
dcn20_validate_bandwidth+0x46/0xe0 [amdgpu]
dc_validate_global_state+0x354/0x380 [amdgpu]
amdgpu_dm_atomic_check+0xdc1/0xee0 [amdgpu]
? __alloc_pages_nodemask+0x16b/0x300
? idr_alloc+0x8e/0xc0
? __drm_mode_object_add+0x95/0xb0
drm_atomic_check_only+0x410/0x710
? drm_atomic_add_affected_connectors+0xe6/0x100
drm_atomic_commit+0x18/0x60
drm_client_modeset_commit_atomic+0x158/0x1c0
drm_client_modeset_commit_force+0x43/0x170
drm_fb_helper_set_par+0x79/0xb0
fbcon_init+0x483/0x620
visual_init+0xd1/0x130
do_bind_con_driver+0x340/0x620
do_take_over_console+0x257/0x2a0
fbcon_fb_registered+0x128/0x220
register_framebuffer+0x27d/0x310
__drm_fb_helper_initial_config_and_unlock+0x365/0x480
amdgpu_fbdev_init+0xe3/0x100 [amdgpu]
amdgpu_device_init+0x1935/0x1ca0 [amdgpu]
? __alloc_pages_nodemask+0x16b/0x300
? kmalloc_order_trace+0x77/0x130
amdgpu_driver_load_kms+0x7b/0x230 [amdgpu]
drm_dev_register+0xdb/0x2a0
amdgpu_pci_probe+0xdd/0x130 [amdgpu]
pci_device_probe+0x11f/0x1b0
really_probe+0x283/0x560
driver_probe_device+0x59/0xf0
device_driver_attach+0x66/0xa0
__driver_attach+0xac/0x130
? driver_attach+0x20/0x20
bus_for_each_dev+0x83/0xb0
bus_add_driver+0x11b/0x200
? 0xffffffffc2013000
driver_register+0x86/0x120
do_one_initcall+0x106/0x290
? __free_one_page+0x67/0x380
? _raw_spin_unlock+0x16/0x30
? preempt_count_add+0x56/0xa0
? __vunmap+0x267/0x2a0
? kmem_cache_alloc_trace+0x1ea/0x260
? __vunmap+0x267/0x2a0
do_init_module+0x60/0x230
load_module+0x3996/0x3ce0
? copy_user_generic_string+0x2c/0x40
__se_sys_init_module+0xfe/0x130
do_syscall_64+0x8a/0x120
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb1b0b4ed2e
Code: 48 8b 0d 55 01 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 22 01 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd73618d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 0000555e5dabdf40 RCX: 00007fb1b0b4ed2e
RDX: 0000555e5daba460 RSI: 000000000073a0b8 RDI: 0000555e5e2e29f0
RBP: 0000555e5daba460 R08: 0000555e5d97301a R09: 0000000000000004
R10: 0000555e5d973010 R11: 0000000000000246 R12: 0000555e5e2e29f0
R13: 0000555e5dac03b0 R14: 0000000000020000 R15: 0000555e5dabdf40
Modules linked in: vfat fat amdgpu(+) btusb btrtl btbcm btintel bluetooth mxm_wmi ecdh_generic ecc crc16 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel gpu_sched k10temp snd_intel_nhlt snd_hda_codec snd_usb_audio ttm igb snd_hda_core snd_usbmidi_lib snd_hwdep snd_pcm snd_rawmidi snd_seq_device snd_timer snd soundcore usbkbd usbmouse wmi nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) sg efivarfs ip_tables hid_generic usbhid
---[ end trace bb36ee02fd414cc1 ]---
------------[ cut here ]------------
000000000022c760 <CalculatePrefetchSourceLines>:
22c760: e8 00 00 00 00 call 22c765 <CalculatePrefetchSourceLines+0x5>
22c765: 55 push rbp
22c766: 48 89 e5 mov rbp,rsp
22c769: 41 57 push r15
22c76b: 41 56 push r14
22c76d: 41 54 push r12
22c76f: 53 push rbx
22c770: 48 83 ec 10 sub rsp,0x10
22c774: 4d 89 cf mov r15,r9
22c777: 41 89 ce mov r14d,ecx
22c77a: 48 89 fb mov rbx,rdi
22c77d: f2 0f 58 c8 addsd xmm1,xmm0
22c781: f2 0f 58 0d 00 00 00 addsd xmm1,QWORD PTR [rip+0x0] # 22c789 <CalculatePrefetchSourceLines+0x29>
22c788: 00
22c789: 85 d2 test edx,edx
22c78b: 75 18 jne 22c7a5 <CalculatePrefetchSourceLines+0x45>
22c78d: 40 0f b6 c6 movzx eax,sil
22c791: f2 0f 2a d0 cvtsi2sd xmm2,eax
22c795: f2 0f 59 15 00 00 00 mulsd xmm2,QWORD PTR [rip+0x0] # 22c79d <CalculatePrefetchSourceLines+0x3d>
22c79c: 00
22c79d: f2 0f 59 d0 mulsd xmm2,xmm0
22c7a1: f2 0f 58 ca addsd xmm1,xmm2
22c7a5: 4c 8b 65 10 mov r12,QWORD PTR [rbp+0x10]
22c7a9: f2 0f 59 0d 00 00 00 mulsd xmm1,QWORD PTR [rip+0x0] # 22c7b1 <CalculatePrefetchSourceLines+0x51>
22c7b0: 00
22c7b1: 0f 57 c0 xorps xmm0,xmm0
22c7b4: f2 0f 5a c1 cvtsd2ss xmm0,xmm1
22c7b8: f3 0f 10 0d 00 00 00 movss xmm1,DWORD PTR [rip+0x0] # 22c7c0 <CalculatePrefetchSourceLines+0x60>
22c7bf: 00
22c7c0: e8 00 00 00 00 call 22c7c5 <CalculatePrefetchSourceLines+0x65>
22c7c5: f3 0f 5a c0 cvtss2sd xmm0,xmm0
22c7c9: f2 41 0f 11 07 movsd QWORD PTR [r15],xmm0
22c7ce: 80 bb e0 17 00 00 00 cmp BYTE PTR [rbx+0x17e0],0x0
22c7d5: 74 59 je 22c830 <CalculatePrefetchSourceLines+0xd0>
22c7d7: 44 89 f0 mov eax,r14d
22c7da: 0f 57 c9 xorps xmm1,xmm1
22c7dd: f2 48 0f 2a c8 cvtsi2sd xmm1,rax
22c7e2: f2 0f 11 4d d0 movsd QWORD PTR [rbp-0x30],xmm1
22c7e7: f2 0f 5e c1 divsd xmm0,xmm1
22c7eb: f2 0f 5a c0 cvtsd2ss xmm0,xmm0
22c7ef: f3 0f 10 0d 00 00 00 movss xmm1,DWORD PTR [rip+0x0] # 22c7f7 <CalculatePrefetchSourceLines+0x97>
22c7f6: 00
22c7f7: e8 00 00 00 00 call 22c7fc <CalculatePrefetchSourceLines+0x9c>
22c7fc: f3 48 0f 2c c8 cvttss2si rcx,xmm0
22c801: 41 89 0c 24 mov DWORD PTR [r12],ecx
22c805: f2 41 0f 10 07 movsd xmm0,QWORD PTR [r15]
22c80a: 66 0f 2e 05 00 00 00 ucomisd xmm0,QWORD PTR [rip+0x0] # 22c812 <CalculatePrefetchSourceLines+0xb2>
22c811: 00
22c812: 77 05 ja 22c819 <CalculatePrefetchSourceLines+0xb9>
22c814: f2 0f 58 45 d0 addsd xmm0,QWORD PTR [rbp-0x30]
22c819: f2 0f 58 05 00 00 00 addsd xmm0,QWORD PTR [rip+0x0] # 22c821 <CalculatePrefetchSourceLines+0xc1>
22c820: 00
22c821: f2 48 0f 2c c0 cvttsd2si rax,xmm0
22c826: 31 d2 xor edx,edx
22c828: 41 f7 f6 div r14d
22c82b: e9 a0 00 00 00 jmp 22c8d0 <CalculatePrefetchSourceLines+0x170>
22c830: f2 0f 58 05 00 00 00 addsd xmm0,QWORD PTR [rip+0x0] # 22c838 <CalculatePrefetchSourceLines+0xd8>
22c837: 00
22c838: 44 89 f0 mov eax,r14d
22c83b: 0f 57 c9 xorps xmm1,xmm1
22c83e: f2 48 0f 2a c8 cvtsi2sd xmm1,rax
22c843: 66 0f 29 4d d0 movapd XMMWORD PTR [rbp-0x30],xmm1 <============= here
22c848: f2 0f 5e c1 divsd xmm0,xmm1
22c84c: f2 0f 5a c0 cvtsd2ss xmm0,xmm0
22c850: f3 0f 10 0d 00 00 00 movss xmm1,DWORD PTR [rip+0x0] # 22c858 <CalculatePrefetchSourceLines+0xf8>
22c857: 00
22c858: e8 00 00 00 00 call 22c85d <CalculatePrefetchSourceLines+0xfd>
22c85d: f3 0f 5a c0 cvtss2sd xmm0,xmm0
22c861: f2 0f 10 0d 00 00 00 movsd xmm1,QWORD PTR [rip+0x0] # 22c869 <CalculatePrefetchSourceLines+0x109>
22c868: 00
22c869: f2 0f 58 c1 addsd xmm0,xmm1
22c86d: f2 48 0f 2c c0 cvttsd2si rax,xmm0
22c872: 41 89 04 24 mov DWORD PTR [r12],eax
22c876: f2 41 0f 10 07 movsd xmm0,QWORD PTR [r15]
22c87b: 66 0f 28 5d d0 movapd xmm3,XMMWORD PTR [rbp-0x30]
22c880: f2 0f 58 d8 addsd xmm3,xmm0
22c884: f2 0f c2 c8 01 cmpltsd xmm1,xmm0
22c889: 66 0f 28 d1 movapd xmm2,xmm1
22c88d: 66 0f 55 d3 andnpd xmm2,xmm3
22c891: 66 0f 54 c8 andpd xmm1,xmm0
22c895: 66 0f 56 ca orpd xmm1,xmm2
22c899: f2 0f 58 0d 00 00 00 addsd xmm1,QWORD PTR [rip+0x0] # 22c8a1 <CalculatePrefetchSourceLines+0x141>
22c8a0: 00
22c8a1: f2 48 0f 2c c1 cvttsd2si rax,xmm1
22c8a6: 31 d2 xor edx,edx
22c8a8: 41 f7 f6 div r14d
22c8ab: 0f 57 c0 xorps xmm0,xmm0
22c8ae: f2 48 0f 2a c2 cvtsi2sd xmm0,rdx
22c8b3: 0f 57 c9 xorps xmm1,xmm1
22c8b6: f2 0f 5a c8 cvtsd2ss xmm1,xmm0
22c8ba: f3 0f 10 05 00 00 00 movss xmm0,DWORD PTR [rip+0x0] # 22c8c2 <CalculatePrefetchSourceLines+0x162>
22c8c1: 00
22c8c2: e8 00 00 00 00 call 22c8c7 <CalculatePrefetchSourceLines+0x167>
22c8c7: f3 48 0f 2c d0 cvttss2si rdx,xmm0
22c8cc: 41 8b 0c 24 mov ecx,DWORD PTR [r12]
22c8d0: 41 0f af ce imul ecx,r14d
22c8d4: 01 d1 add ecx,edx
22c8d6: 0f 57 c0 xorps xmm0,xmm0
22c8d9: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
22c8de: 48 83 c4 10 add rsp,0x10
22c8e2: 5b pop rbx
22c8e3: 41 5c pop r12
22c8e5: 41 5e pop r14
22c8e7: 41 5f pop r15
22c8e9: 5d pop rbp
22c8ea: c3 ret
22c8eb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
rbp-30 is not aligned to 16 bytes boundary. Hopefully I didn't misunderstand anything?
File is drivers/gpu/drm/amd/display/dc/dml/dcn20/display_mode_vba_20v2.c
I believe this is the runtime issue that @nickdesaulniers mentioned in IRC?
Yes, @yshui great job debugging. You've come to the same conclusions I did. While it seems that their driver is compiled w/ 16B stack alignment, the rest of the x86_64 kernel is compiled with 8B stack alignment (as a micro optimization). Thus, there are no guarantees that the stack is actually 16B aligned. I added the -msse2 flag to allow the AMDGPU driver to link (otherwise, clang would generate soft-fp calls to functions that would exist in userspace libgcc_s/compiler_rt, but aren't implemented in the kernel). -msse2 makes clang select instructions that have 16B stack alignment requirements (which now link, but result in GPF's at runtime; I didn't/don't have hardware to test on when I wrote the -msse2 patch). GCC at -msse -mno-sse2 <no further SIMD extensions> doesn't error for the use of float or double, but does make use of xmm registers (I guess without alignment requirements, though I'm unfamiliar with the differences between sse and sse2). Clang frequently crashes during instruction selection in the x86_64 backend when mixing float, double, and -mno-sse2. (Honestly, sse but no sse2 is kind of weird).
I don't know why their kernel uses a 16B stack alignment, since it's not possible to guarantee your stack is actually 16B aligned if called from 8B stack aligned translation units (without additional inline assembly to detect and realign such cases, which I would consider a hack upon a hack).
Even if the 16B alignment was removed from their driver, it would still be incorrect to use -msse2 w/ Clang. (because the sse2 instructions require 16B stack alignment). So that should be reverted at some point, though there's value in being able to at least build it and stop the warnings that are coming into the actively developed and large driver.
If we implemented the soft-fp routines in the kernel (there is precedent for lifting them from libgcc's source), we'd also need to remove the calls to kernel_fpu_begin()/kernel_fpu_end() because then we wouldn't be using the xmm registers in the kernel. We'd need multiple implementations since currently CONFIG_AMDGPU is available from non-x86_64 architectures, unless we'd want to disable it for those, though I suspect someone might be using it for arm64.
The soft-fp routines only make sense if the code is not performance critical. @ajs1984 suspects it has something to do with setting mode lines; I'm not sure if that would be harmed by soft-fp, but it's worth a try. Once soft-fp routines were implemented, we could then remove the -mhard-float -msse that the driver is using for both compilers.
Finally, we could spend more time on x86_64 instruction selection w/ sse but no sse2, but it's a very limited use case that even LLVM's x86_64 backend maintainer had doubts about that approach.
According to the Makefile, clang should get the -mstack-alignment=16 flag. I think the stack should have been aligned to 16 bytes.
Hmm, the amdgpu module (specifically, the dc) is the only place the stack is 16 bytes aligned. All other parts of the kernel have 8 bytes alignment. How does this work with gcc? Is the stack pointer somehow aligned before calling the amdgpu functions?
Normally you would use -mincoming-stack-boundary, but I don't see that being used anywhere.
Probably need input from AMD people to understand.
How does this work with gcc?
I assume that either:
-msse -mno-sse2 does not select such instructionsIs the stack pointer somehow aligned before calling the amdgpu functions?
It is not, hence this GPF w/ Clang. You'd have to audit the translation unit boundaries, and annotate them. Without an automated way of doing so or validating them, it would very easy to break and would be a mental burden for AMDGPU driver developers to remember to do and check. Thus, I consider that brittle.
Normally you would use -mincoming-stack-boundary
Oh, interesting, I did not know about that flag. If clang implements it, I wonder if we can use it for just those translation units? (Normally, I wouldn't adjust the stack alignment, but such is life).
gcc at -msse -mno-sse2 does not select such instructions
If that is true, then passing -mpreferred-stack-boundary=4 to gcc would have been kind of pointless. (I raised question on #radeon@freenode, but I didn't get a reply).
If that is true, then passing -mpreferred-stack-boundary=4 to gcc would have been kind of pointless.
I agree, but maybe there's another reason it's used that I can't think of. I did not go read the commits that added it to see if maybe a good reason was provided, but that's a good place to start.
(I raised question on #radeon@freenode, but I didn't get a reply).
I've asked the AMDGPU maintainer to ask around and set up a call, and I've asked their partner engineer on an internal thread related to CrOS. I've also spoken in person with an AMDGPU LLVM developer about this FWIW.
I haven't heard back from them yet. I give them the benefit of the doubt, but should chase if I don't hear back soon.
I'm unfamiliar with the differences between sse and sse2
To a very rough approximation, SSE is support for float[4], and SSE2 added support for double[2], as well as a replacement for MMX.
because the sse2 instructions require 16B stack alignment
Not quite. The aligned movs (MOVAPD in this case) require a 16b aligned memory operand, which in this case it doesn't get because it references a 16b object misaligned on an 8b stack.
An 8 byte alignment generally, with 16 in a callee is an ABI breakage, unless the function knows it need to cope with misaligned callers.
Either
1) The global switch from 16=>8 alignment caused an ABI breakage by not fixing this driver or declaring the incoming stack alignment, or
2) The incoming stack alignment is being specified, and the function prologue is not correct.
Code generation to use MOVAPD is correct in this case. It is on a temporary variable spilled to the stack, and because the stack is 16 bytes aligned, this is a safe instruction to schedule.
The aligned movs (MOVAPD in this case) require a 16b aligned memory operand, which in this case it doesn't get because it references a 16b object misaligned on an 8b stack.
Well put (more precise).
An 8 byte alignment generally, with 16 in a callee is an ABI breakage
Agreed.
The global switch from 16=>8 alignment caused an ABI breakage by not fixing this driver or declaring the incoming stack alignment, or
I believe the global switch occurred before the driver was upstreamed? Maybe it existed out of tree prior to the switch, and only for older kernels?
The incoming stack alignment is being specified, and the function prologue is not correct.
@yshui pointed out -mincoming-stack-boundary. Either:
cc-option).$ cd linux-next
$ grep -r incoming-stack-boundary
$ echo $?
1
so 1, but also 2: https://godbolt.org/z/N9b67Q
What is the best solution here? It seems quite clear there is a bug in amdgpu at least, since it should have used -mincoming-stack-boundary.
So we need to:
-mincoming-stack-boundary to amdgpu's Makefile?
I don't immediately see anything in the driver which would necessitate the use of a 16 byte stack. I'd start by dropping that and seeing what explodes.
Once Clang sees that the stack is only 8b aligned, its no longer correct to schedule MOVAPD, and it will probably use MOVUPD instead which is the unaligned variant.
so 1, but also 2: https://godbolt.org/z/N9b67Q
I was thinking more of https://godbolt.org/z/TeiLEr but yes. I don't see suitable prologue code being generated even in GCC.
@andyhhp @nickdesaulniers ~https://godbolt.org/z/i-HQwU~ https://godbolt.org/z/XbdshX
-mincoming-stack-boundary= takes the same kind of arguments as -mpreferred-stack-boundary=, so for 8 bytes aligned you need to pass 3
What is the best solution here?
Add -mincoming-stack-boundary to amdgpu's Makefile
Implement equivalent in clang
Add clang equivalent flag to Makefile
This feels like papering over an ABI mismatch which is bad behavior we shouldn't bend over backwards to support IMO. While it is a possible route to a solution, I consider it suboptimal.
I'd start by dropping that and seeing what explodes.
Agreed.
its no longer correct to schedule MOVAPD, and it will probably use MOVUPD instead which is the unaligned variant.
Oh, good point, and if it doesn't, that's a bug in Clang.
I was thinking more of
Oh, I was just showing the flag being unrecognized. The code in the link was irrelevant.
I don't see suitable prologue code being generated even in GCC.
Should the flag be -mincoming-stack-boundary=8 (not -mincoming-stack-boundary=3)? I know GCC likes you to specify 2^X for the stack alignment flag. Not sure if this one is similar.
You both are making me so happy with your use of godbolt! :heart_decoration:
diff --git a/drivers/gpu/drm/amd/display/dc/calcs/Makefile b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
index 985633c08a26..3c3778de33b4 100644
--- a/drivers/gpu/drm/amd/display/dc/calcs/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
@@ -24,13 +24,7 @@
# It calculates Bandwidth and Watermarks values for HW programming
#
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-calcs_ccflags := -mhard-float -msse $(cc_stack_align)
+calcs_ccflags := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
calcs_ccflags += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/Makefile b/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
index ddb8d5649e79..13d1fd2bf8f8 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
@@ -10,13 +10,7 @@ ifdef CONFIG_DRM_AMD_DC_DSC_SUPPORT
DCN20 += dcn20_dsc.o
endif
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o := -mhard-float -msse $(cc_stack_align)
+CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/Makefile b/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
index ef673bffc241..c4042732c8a0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
@@ -3,13 +3,7 @@
DCN21 = dcn21_hubp.o dcn21_hubbub.o dcn21_resource.o
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o := -mhard-float -msse $(cc_stack_align)
+CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile b/drivers/gpu/drm/amd/display/dc/dml/Makefile
index 5b2a65b42403..a604cdbaf317 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
@@ -24,13 +24,7 @@
# It provides the general basic services required by other DAL
# subcomponents.
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-dml_ccflags := -mhard-float -msse $(cc_stack_align)
+dml_ccflags := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
dml_ccflags += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dsc/Makefile b/drivers/gpu/drm/amd/display/dc/dsc/Makefile
index b456cd23c6fa..8954ba2c284b 100644
--- a/drivers/gpu/drm/amd/display/dc/dsc/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dsc/Makefile
@@ -1,13 +1,7 @@
#
# Makefile for the 'dsc' sub-component of DAL.
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-dsc_ccflags := -mhard-float -msse $(cc_stack_align)
+dsc_ccflags := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
dsc_ccflags += -msse2
but I still see movaps :sob: :
$ llvm-objdump -d drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.o | grep movaps
9b3: 0f 28 05 00 00 00 00 movaps (%rip), %xmm0
fcc: 0f 28 05 00 00 00 00 movaps (%rip), %xmm0
$ llvm-objdump -d drivers/gpu/drm/amd/display/dc/calcs/dcn_calcs.o | grep movaps
d4c: 0f 28 c4 movaps %xmm4, %xmm0
1160: 0f 28 d0 movaps %xmm0, %xmm2
...
@andyhhp @nickdesaulniers ~https://godbolt.org/z/i-HQwU~ https://godbolt.org/z/XbdshX
Ah - much better.
You both are making me so happy with your use of godbolt! heart_decoration
Godbolt is awesome.
but I still see
movapssob :
Those 4 are all fine. The first two will be to objects store in .data etc, whose alignment doesn't depend on the stack at all. They are instruction-pointer relative accesses without outstanding relocations (hence the 0's) because the object files haven't been linked yet.
The second two are register to register moves, so have no alignment restrictions.
You'll want to grep for 'mova' because for the float variants alone, you've got mov{p,s}{s,d} which are {packed,scalar}{single,double}.
I have a kernel built without the stack alignment option, with clang. Seems to work fine so far.
Those 4 are all fine.
Ah, sorry, I should have included the full output, and used -r to show relocations.
They are instruction-pointer relative accesses without outstanding relocations (hence the 0's) because the object files haven't been linked yet.
You mean with outstanding relocations? If so, that makes sense, and I should have recognized the missing relocation info from the 4B's of zeros.
The second two are register to register moves, so have no alignment restrictions.
Right, but I think there were then stack references later on in the full output, IIRC. (Shall post full output). Maybe those had relocations as well though (I always forget -r to llvm-objdump).
I have a kernel built without the stack alignment option, with clang. Seems to work fine so far.
That's great news @yshui ! Can you test running it with Clang and GCC, and if it seems ok, I'm happy to send the above diff with your tested-by+debugged-by && @andyhhp 's suggested by tags (or whatever you all would like) to LKML.
(And if that works for both, I wonder if we can then further [re-]add -msse2 unconditionally for both compilers as we had originally. That's what I had submitted, but it was reverted when users w/ GCC built kernels reported "instability," which I'm guessing is the same GPF's as were seeing here due to ABI mismatch of stack alignment).
You mean _with_ outstanding relocations?
Yes I did. (Sorry - it was late here.)
If you want a S-by tag, I'm Andrew Cooper <[email protected]>
@nickdesaulniers The gcc side seems to work fine as well, I didn't test adding -msse2.
On the clang side, I hit a GPU timeout, which I don't know if is related to this or not. I still haven't seen a GPF, which is good.
Great, thanks for the report. I've asked the AMD folks on our internal CrOS thread to test the diff as well. I'll send the patch by EOD, hopefully with 1 more tested by tag on it. :checkered_flag:
Upstream inquiry: https://lkml.org/lkml/2019/10/15/871
@nickdesaulniers Isn't enabling sse2 for gcc<7.1 also an option?
Edit: just tried, nope. https://godbolt.org/z/xXbjMr
Another question would be: is -msse absolutely required? Can we just remove it for gcc<7.1?
Another question would be: is -msse absolutely required?
Probably not. They use -mhard-float in the AMDGPU Makefile to seemingly prevent -mno-80387 from making GCC emit soft-fp calls (the way Clang did until I added -msse2).
The patches have been accepted and presumably passed AMD's internal testing because they are on their way for 5.4: https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-5.4-2019-10-30
Nick's patches have been merged and released in 5.4-rc6.
Most helpful comment
The patches have been accepted and presumably passed AMD's internal testing because they are on their way for 5.4: https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-5.4-2019-10-30