I have been chasing down an issue that crash our system on ARM 32 bits machine occasionally.
The error is a SIGSEGV or SIGABRT on memory that we are certain that we own.
We are able to reproduce this in a fairly consistent basis, but only by throwing a lot of work on the machine and I don't have a simple reproduction.
The error occur, at all times, on this line of code: Unsafe.CopyBlockUnaligned()
We have been able to capture this in lldb and have the following information:
(lldb) Process 18813 stopped
* thread dotnet/runtime#3861: tid = 0x49a2, 0x7664d55e, name = 'Raven.Server', stop reason = signal SIGSEGV: address access protected (fault address: 0x520d5000)
frame #0: 0x7664d55e
-> 0x7664d55e: stmdavs r11, {r0, r1, r11, sp, lr}
0x7664d562: stcllt p6, c15, [sp, #-772]!
0x7664d566: .long 0xe92d0000 ; unknown opcode
0x7664d56a: svcge #0x34ff0
The fault address is: 0x520d5000
Looking at smaps, we can confirm that this is indeed an address that we shouldn't access:
520c5000-520d5000 rw-s 05b90000 08:01 1310760 /mnt/external/TmpDataDir/Databases/zz/Temp/scratch.0000000002.buffers
Size: 64 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 64 kB
Pss: 64 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 64 kB
Private_Dirty: 0 kB
Referenced: 64 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
VmFlags: rd wr sh mr mw me ms
520d5000-520d6000 ---p 00000000 00:00 0
Size: 4 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
However, note that we own the memory just before this bit.
We have added additional tracing to the code and we believe that the actual failure happened when we call:
Unsafe.CopyBlockUnaligned(0x520D4FFD,0x4DE6F1D4,2);
The 0x4DE6F1D4 source address is allocated on the stack and is used just before the failure with:
Unsafe.CopyBlockUnaligned(0x520D4FF1,0x4DE6F1D4,8);
Unsafe.CopyBlockUnaligned(0x520D4FF9,0x4DE6F1D4,4);
// dies here
Unsafe.CopyBlockUnaligned(0x520D4FFD,0x4DE6F1D4,2);
We have a stackalloc ulong[1] variable that is used as a buffer to copy to the destination.
We are writing toward the end of the page that we own, but that is expected and should be fine because we aren't going beyond the boundary of the page.
Here is the disassembly at the time of the crash
(lldb) d
-> 0x7664d55e: stmdavs r11, {r0, r1, r11, sp, lr}
0x7664d562: stcllt p6, c15, [sp, #-772]!
0x7664d566: .long 0xe92d0000 ; unknown opcode
0x7664d56a: svcge #0x34ff0
0x7664d56e: .long 0xf8c3b081 ; unknown opcode
0x7664d572: bgt 0x7828157a
0x7664d576: andeq pc, r4, #-2147483648
0x7664d57a: svceq #0xe8b2
And here are the registers at the crash
(lldb) register read
General Purpose Registers:
r0 = 0x520d4ffd
r1 = 0x4de6f1d4
r2 = 0x00000002
r3 = 0x00000000
r4 = 0x520d4ffd
r5 = 0x4de6f1d4
r6 = 0x00000002
r7 = 0x00000000
r8 = 0x5c48b99c
r9 = 0x5c48b9ac
r10 = 0x4de6fba4
r11 = 0x4de6f1a0
r12 = 0x7664d559
sp = 0x4de6f170
lr = 0x50ea66c5
pc = 0x7664d55e
cpsr = 0x20000030
I'm not an expert on ARM assembly, but it looks like the STM call is writing to the r11, but while r6 looks like it contains the size, I'm not seeing this actually being used here.
Here is the full disassembly from around the location of the crash:
(lldb) di -s 0x7664d500 -e 0x7664d600
0x7664d500: .long 0xf04f462b ; unknown opcode
0x7664d504: .long 0x94000403 ; unknown opcode
0x7664d508: blx 0x7560968a
0x7664d50c: stmdals r10, {r3, r5, r8, r11, r12, sp, pc}
0x7664d510: .long 0xe8bdb001 ; unknown opcode
0x7664d514: .long 0xb0044ff0 ; unknown opcode
0x7664d518: .long 0x46844770 ; unknown opcode
0x7664d51c: .long 0xe8bdb001 ; unknown opcode
0x7664d520: .long 0xbc0f4ff0 ; unknown opcode
0x7664d524: push {r5, r6, r8, r9, r10, lr}
0x7664d528: stc p15, c4, [sp, #-964]!
0x7664d52c: strlt r0, [r2], #-2824
0x7664d530: stmdage r10, {r0, r7, r12, sp, pc}
0x7664d534: mrc2 p7, #0x5, apsr_nzcv, c10, c11, #0x6
0x7664d538: .long 0xbc02b001 ; unknown opcode
0x7664d53c: bleq 0x76888838
0x7664d540: svchi #0xf1e8bd
0x7664d544: strtvc sp, [r4], r0, asr dotnet/coreclr#20
0x7664d548: strtvc sp, [r4], r4, lsl dotnet/coreclr#22
0x7664d54c: svclt #0x82a00
0x7664d550: stmdavs r3, {r4, r5, r6, r8, r9, r10, lr}
0x7664d554: blt 0x7758b060
0x7664d558: svclt #0x82a00
0x7664d55c: stmdavs r3, {r4, r5, r6, r8, r9, r10, lr}
0x7664d560: .long 0xf6c1680b ; unknown opcode
0x7664d564: .long 0x0000bd6d ; unknown opcode
0x7664d568: svcmi #0xf0e92d
0x7664d56c: addlt r10, r1, r3, lsl dotnet/runtime#3862
0x7664d570: andle pc, r0, r3, asr dotnet/coreclr#17
0x7664d574: .long 0xf102ca70 ; unknown opcode
0x7664d578: .long 0xe8b20204 ; unknown opcode
0x7664d57c: strmi r0, [r8, r0, lsl dotnet/runtime#3862]
0x7664d580: .long 0xe8bdb001 ; unknown opcode
0x7664d584: strlt r8, [r0, #0xff0]
0x7664d588: .long 0xf8c3466f ; unknown opcode
0x7664d58c: ldrmi sp, [r0, r0]
0x7664d590: andeq r11, r0, r0, lsl dotnet/coreclr#27
0x7664d594: andle r2, r6, r0, lsl dotnet/coreclr#20
0x7664d598: .long 0x466fb580 ; unknown opcode
0x7664d59c: stmdavc r11, {r0, r1, r11, r12, sp, lr}
0x7664d5a0: ldcl p6, c15, [r0, #-772]
0x7664d5a4: stmdami r3, {r7, r8, r10, r11, r12, sp, pc}
0x7664d5a8: stmdahs r0, {r11, sp, lr}
0x7664d5ac: .long 0xf7a4bf18 ; unknown opcode
0x7664d5b0: .long 0x4770bebf ; unknown opcode
0x7664d5b4: strtvc sp, [r4], r0, asr dotnet/coreclr#20
0x7664d5b8: andeq r0, r0, r0
0x7664d5bc: andeq r0, r0, r0
0x7664d5c0: svclt #0x4770
0x7664d5c4: svclt #0xbf00
0x7664d5c8: svclt #0xbf00
0x7664d5cc: svclt #0xbf00
0x7664d5d0: svchi #0x5ff3bf
0x7664d5d4: .long 0xf2406001 ; unknown opcode
0x7664d5d8: .long 0xf2c00301 ; unknown opcode
0x7664d5dc: addsmi r0, r9, #0, dotnet/coreclr#6
0x7664d5e0: .long 0xf641d30a ; unknown opcode
0x7664d5e4: .long 0xf2c74324 ; unknown opcode
0x7664d5e8: bl 0x7671a324
0x7664d5ec: ldmdavc r8, {r4, r7, r8, r9, sp}
0x7664d5f0: svclt #0x1c28ff
0x7664d5f4: .long 0x701820ff ; unknown opcode
0x7664d5f8: andeq r4, r0, r0, ror r7
0x7664d5fc: andeq r0, r0, r0
@aviviadi unfortunately, the disassembly is a garbage. Either it is in some random piece of memory or lldb thinks it is ARM code while it is in fact THUMB2. Or the processor errorneously jumped to an even address.
I actually wonder how you made lldb work on arm32 at all, since I've tried many versions in the past (on different Linux distros) and none of them worked. They either weren't able to start a process at all or they could start it, but they could not hit any breakpoints. What is the distro and lldb version that you are using? And what version of dotnet are you using?
You can try to disass from an address higher by one (ARM processors use the lowest address bit to distinguish between ARM and THUMB2 modes). That may make lldb to get the right disass.
Could you also try to get stack trace at the time of failure using "bt" command?
And finally, the LR register contains return address. Can you please try to disassemble the function at
that address? disass -a 0x50ea66c5 or, if the code is a managed code, you'll need to disass using range of addresses. I would try something like disass -s 0x50ea66a5 -e 0x50ea66d5.
We have a
stackalloc ulong[1]variable that is used as a buffer to copy to the destination.
What's the C# for this line? (e.g. are you using array initalizer for values? Was fixed C# issue if so https://github.com/dotnet/roslyn/issues/29092)
@benaadams We initially had a ulong value and too the address of that to use in Unsafe.BlockCopyUnaligned.
We changed that to stackalloc ulong[1] with no init to see if it would help.
I'm fairly certain that the actual problem is with the code for Unsafe, not with the stackalloc
@janvorli - lldb-3.9, Raspbian Stretch (RPi3), runtime 2.1.6 (happens also on earlier runtime versions)
Using ssh debugging (Visual Studio), and using lldb (also with logging to memory entries and exits from function calls), we see the SIGSEGV while in Unsafe.CopyBlockUnaligned(), always with adress 0x*FFD and for writing 2 bytes, when we allowed to write upto 3 bytes ahead.
As for lldb-3.9, I am attaching it to the running published -r linux-arm process.
I reproduced it again with the disass as you requested. And it is here below. I am going to leave this instance up, so if you wish to see/dissasm more, it will be done on this segv reproduction
(do you think gdb is better for this investigation ?)
Architecture set to: armv6-unknown-unknown.
(lldb) process handle -s false -n false -p false SIGTRAP SIGPIPE
NAME PASS STOP NOTIFY
=========== ===== ===== ======
SIGTRAP false false false
SIGPIPE false false false
(lldb) continue
Process 22200 resuming
Process 22200 stopped
* thread dotnet/coreclr#36: tid = 0x56f1, 0x7664155e, name = 'Raven.Server', stop reason = signal SIGSEGV: address access protected (fault address: 0x4bba3000)
frame #0: 0x7664155e
-> 0x7664155e: stmdavs r11, {r0, r1, r11, sp, lr}
0x76641562: stcllt p6, c15, [sp, #-772]!
0x76641566: .long 0xe92d0000 ; unknown opcode
0x7664156a: svcge #0x34ff0
(lldb) bt
* thread dotnet/coreclr#36: tid = 0x56f1, 0x7664155e, name = 'Raven.Server', stop reason = signal SIGSEGV: address access protected (fault address: 0x4bba3000)
* frame #0: 0x7664155e
frame dotnet/coreclr#1: 0x50eecacf
frame dotnet/coreclr#2: 0x4f53da7d
frame dotnet/coreclr#3: 0x4b90e047
frame dotnet/coreclr#4: 0x4b90bf89
frame dotnet/coreclr#5: 0x4b90b327
frame dotnet/coreclr#6: 0x4b90afbb
frame dotnet/coreclr#7: 0x4b90a5e9
frame dotnet/coreclr#8: 0x4b364a2b
frame dotnet/coreclr#9: 0x4b3647d7
frame dotnet/coreclr#10: 0x4faaef11
frame dotnet/coreclr#11: 0x5751d7f7
frame dotnet/coreclr#12: 0x59da8d2f
frame dotnet/coreclr#13: 0x59d953cf
(lldb) register read
General Purpose Registers:
r0 = 0x4bba2ffe
r1 = 0x4f6c31d4
r2 = 0x00000002
r3 = 0x00000000
r4 = 0x4bba2ffe
r5 = 0x4f6c31d4
r6 = 0x00000002
r7 = 0x5b781dfc
r8 = 0x5b781e0c
r9 = 0x5b780c70
r10 = 0x4f6c3ba4
r11 = 0x4f6c31a0
r12 = 0x76641559
sp = 0x4f6c3170
lr = 0x50ebc14b
pc = 0x7664155e
cpsr = 0x20000030
(lldb) disass -a 0x50ebc14b
-> 0x7664155e: stmdavs r11, {r0, r1, r11, sp, lr}
0x76641562: stcllt p6, c15, [sp, #-772]!
0x76641566: .long 0xe92d0000 ; unknown opcode
0x7664156a: svcge #0x34ff0
0x7664156e: .long 0xf8c3b081 ; unknown opcode
0x76641572: bgt 0x7827557a
0x76641576: andeq pc, r4, #-2147483648
0x7664157a: svceq #0xe8b2
(lldb) disass -s 0x50ebc11b -e 0x50ebc15b
0x50ebc11b: .long 0x00e014f8 ; unknown opcode
0x50ebc11f: ldrbteq r12, [r8], dotnet/coreclr#3472
0x50ebc123: .long 0x442000e0 ; unknown opcode
0x50ebc127: strbgt r4, [lr, #-0x5f6]!
0x50ebc12b: .long 0xf01ec5f6 ; unknown opcode
0x50ebc12f: .long 0x61f64847 ; unknown opcode
0x50ebc133: .long 0x13f2c543 ; unknown opcode
0x50ebc137: subhs r9, r7, r3, asr r8
0x50ebc13b: sublo r2, r6, #1146880
0x50ebc13f: ldmibpl r2!, {r1, r2, r6, r8, lr} ^
0x50ebc143: ldrbtvs r12, [r2], dotnet/runtime#4624
0x50ebc147: strbmi lr, [r7, #-0x6c]
0x50ebc14b: smmlsrgt r0, r2, r4, r2
0x50ebc14f: mrcmi p2, #0x2, r1, c0, c2, #0x7
0x50ebc153: .long 0xc76391f6 ; unknown opcode
0x50ebc157: stmdals r3!, {r1, r4, r5, r6, r7, r8, r10, r11, r12, lr} ^
(lldb)
And the /proc/pid/smaps around the relevant address:
4bb93000-4bba3000 rw-s 01dc0000 08:01 1310749 /mnt/external/TmpDataDir/Databases/db/Temp/scratch.0000000000.buffers
Size: 64 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 64 kB
Pss: 64 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 64 kB
Private_Dirty: 0 kB
Referenced: 64 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
VmFlags: rd wr sh mr mw me ms
4bba3000-4bba4000 ---p 00000000 00:00 0
Size: 4 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
VmFlags: mr mw me ac
P.S. using p/invoke memcpy instead of Unsafe.CopyBlockUnaligned "solves" the issue (and dramatically slower our app on the little RPi by 50-60% as we have a lot of blocks to mem copy)
Here is the disassembly with the raw bytes:
Note that 0x7664155e is the location that d points as the faulting instruction
(lldb) d -s 0x76641500 -e 0x76641600 -b
0x76641500: 0xf04f462b .long 0xf04f462b ; unknown opcode
0x76641504: 0x94000403 .long 0x94000403 ; unknown opcode
0x76641508: 0xfbbef05e blx 0x755fd68a
0x7664150c: 0x980ab928 stmdals r10, {r3, r5, r8, r11, r12, sp, pc}
0x76641510: 0xe8bdb001 .long 0xe8bdb001 ; unknown opcode
0x76641514: 0xb0044ff0 .long 0xb0044ff0 ; unknown opcode
0x76641518: 0x46844770 .long 0x46844770 ; unknown opcode
0x7664151c: 0xe8bdb001 .long 0xe8bdb001 ; unknown opcode
0x76641520: 0xbc0f4ff0 .long 0xbc0f4ff0 ; unknown opcode
0x76641524: 0xe92d4760 push {r5, r6, r8, r9, r10, lr}
0x76641528: 0xed2d4ff1 stc p15, c4, [sp, #-964]!
0x7664152c: 0xb4020b08 strlt r0, [r2], #-2824
0x76641530: 0xa80ab081 stmdage r10, {r0, r7, r12, sp, pc}
0x76641534: 0xfebaf7db mrc2 p7, #0x5, apsr_nzcv, c10, c11, #0x6
0x76641538: 0xbc02b001 .long 0xbc02b001 ; unknown opcode
0x7664153c: 0x0b08ecbd bleq 0x7687c838
0x76641540: 0x8ff1e8bd svchi #0xf1e8bd
0x76641544: 0x76a41a40 strtvc r1, [r4], r0, asr dotnet/coreclr#20
0x76641548: 0x76a41b04 strtvc r1, [r4], r4, lsl dotnet/coreclr#22
0x7664154c: 0xbf082a00 svclt #0x82a00
0x76641550: 0x68034770 stmdavs r3, {r4, r5, r6, r8, r9, r10, lr}
0x76641554: 0xba3cf6c1 blt 0x7757f060
0x76641558: 0xbf082a00 svclt #0x82a00
0x7664155c: 0x68034770 stmdavs r3, {r4, r5, r6, r8, r9, r10, lr}
0x76641560: 0xf6c1680b .long 0xf6c1680b ; unknown opcode
0x76641564: 0x0000bd6d .long 0x0000bd6d ; unknown opcode
0x76641568: 0x4ff0e92d svcmi #0xf0e92d
0x7664156c: 0xb081af03 addlt r10, r1, r3, lsl dotnet/runtime#3862
0x76641570: 0xd000f8c3 andle pc, r0, r3, asr dotnet/coreclr#17
0x76641574: 0xf102ca70 .long 0xf102ca70 ; unknown opcode
0x76641578: 0xe8b20204 .long 0xe8b20204 ; unknown opcode
0x7664157c: 0x47880f00 strmi r0, [r8, r0, lsl dotnet/runtime#3862]
0x76641580: 0xe8bdb001 .long 0xe8bdb001 ; unknown opcode
0x76641584: 0xb5808ff0 strlt r8, [r0, #0xff0]
0x76641588: 0xf8c3466f .long 0xf8c3466f ; unknown opcode
0x7664158c: 0x4790d000 ldrmi sp, [r0, r0]
0x76641590: 0x0000bd80 andeq r11, r0, r0, lsl dotnet/coreclr#27
0x76641594: 0xd0062a00 andle r2, r6, r0, lsl dotnet/coreclr#20
0x76641598: 0x466fb580 .long 0x466fb580 ; unknown opcode
0x7664159c: 0x780b7803 stmdavc r11, {r0, r1, r11, r12, sp, lr}
0x766415a0: 0xed50f6c1 ldcl p6, c15, [r0, #-772]
0x766415a4: 0x4803bd80 stmdami r3, {r7, r8, r10, r11, r12, sp, pc}
0x766415a8: 0x28006800 stmdahs r0, {r11, sp, lr}
0x766415ac: 0xf7a4bf18 .long 0xf7a4bf18 ; unknown opcode
0x766415b0: 0x4770bebf .long 0x4770bebf ; unknown opcode
0x766415b4: 0x76a41a40 strtvc r1, [r4], r0, asr dotnet/coreclr#20
0x766415b8: 0x00000000 andeq r0, r0, r0
0x766415bc: 0x00000000 andeq r0, r0, r0
0x766415c0: 0xbf004770 svclt #0x4770
0x766415c4: 0xbf00bf00 svclt #0xbf00
0x766415c8: 0xbf00bf00 svclt #0xbf00
0x766415cc: 0xbf00bf00 svclt #0xbf00
0x766415d0: 0x8f5ff3bf svchi #0x5ff3bf
0x766415d4: 0xf2406001 .long 0xf2406001 ; unknown opcode
0x766415d8: 0xf2c00301 .long 0xf2c00301 ; unknown opcode
0x766415dc: 0x42990300 addsmi r0, r9, #0, dotnet/coreclr#6
0x766415e0: 0xf641d30a .long 0xf641d30a ; unknown opcode
0x766415e4: 0xf2c74324 .long 0xf2c74324 ; unknown opcode
0x766415e8: 0xeb03334d bl 0x7670e324
0x766415ec: 0x78182390 ldmdavc r8, {r4, r7, r8, r9, sp}
0x766415f0: 0xbf1c28ff svclt #0x1c28ff
0x766415f4: 0x701820ff .long 0x701820ff ; unknown opcode
0x766415f8: 0x00004770 andeq r4, r0, r0, ror r7
0x766415fc: 0x00000000 andeq r0, r0, r0
Here is the results with raw bytes from lr
lldb) disass -s 0x50ebc11b -e 0x50ebc15b -b
0x50ebc11b: 0x00e014f8 .long 0x00e014f8 ; unknown opcode
0x50ebc11f: 0x04f8cd90 ldrbteq r12, [r8], dotnet/coreclr#3472
0x50ebc123: 0x442000e0 .long 0x442000e0 ; unknown opcode
0x50ebc127: 0xc56e45f6 strbgt r4, [lr, #-0x5f6]!
0x50ebc12b: 0xf01ec5f6 .long 0xf01ec5f6 ; unknown opcode
0x50ebc12f: 0x61f64847 .long 0x61f64847 ; unknown opcode
0x50ebc133: 0x13f2c543 .long 0x13f2c543 ; unknown opcode
0x50ebc137: 0x20479853 subhs r9, r7, r3, asr r8
0x50ebc13b: 0x32462946 sublo r2, r6, #1146880
0x50ebc13f: 0x59f24146 ldmibpl r2!, {r1, r2, r6, r8, lr} ^
0x50ebc143: 0x64f2c75c ldrbtvs r12, [r2], dotnet/runtime#4624
0x50ebc147: 0x4547e06c strbmi lr, [r7, #-0x6c]
0x50ebc14b: 0xc75024f2 smmlsrgt r0, r2, r4, r2
0x50ebc14f: 0x4e5012f2 mrcmi p2, #0x2, r1, c0, c2, #0x7
0x50ebc153: 0xc76391f6 .long 0xc76391f6 ; unknown opcode
0x50ebc157: 0x98635df2 stmdals r3!, {r1, r4, r5, r6, r7, r8, r10, r11, r12, lr} ^
Hmm, the lldb's disassembling is really broken. Can you please get me
x/64bx 0x50ebc110
and
x/256bx 0x76641500
It would be easier to put the bytes printed into an online arm disassembler to see what they are.
But given the fact that lldb is broken like this, I would recommend reproducing the issue under gdb, which should work fine including the disassembly.
(lldb) x/64bx 0x50ebc110
0x50ebc110: 0x90 0x60 0xd3 0x60 0x02 0x9a 0x03 0x9b
0x50ebc118: 0x04 0x98 0xdd 0xf8 0x14 0xe0 0x00 0x90
0x50ebc120: 0xcd 0xf8 0x04 0xe0 0x00 0x20 0x44 0xf6
0x50ebc128: 0x45 0x6e 0xc5 0xf6 0xc5 0x1e 0xf0 0x47
0x50ebc130: 0x48 0xf6 0x61 0x43 0xc5 0xf2 0x13 0x53
0x50ebc138: 0x98 0x47 0x20 0x46 0x29 0x46 0x32 0x46
0x50ebc140: 0x41 0xf2 0x59 0x5c 0xc7 0xf2 0x64 0x6c
0x50ebc148: 0xe0 0x47 0x45 0xf2 0x24 0x50 0xc7 0xf2
(lldb) x/256bx 0x76641500
0x76641500: 0x2b 0x46 0x4f 0xf0 0x03 0x04 0x00 0x94
0x76641508: 0x5e 0xf0 0xbe 0xfb 0x28 0xb9 0x0a 0x98
0x76641510: 0x01 0xb0 0xbd 0xe8 0xf0 0x4f 0x04 0xb0
0x76641518: 0x70 0x47 0x84 0x46 0x01 0xb0 0xbd 0xe8
0x76641520: 0xf0 0x4f 0x0f 0xbc 0x60 0x47 0x2d 0xe9
0x76641528: 0xf1 0x4f 0x2d 0xed 0x08 0x0b 0x02 0xb4
0x76641530: 0x81 0xb0 0x0a 0xa8 0xdb 0xf7 0xba 0xfe
0x76641538: 0x01 0xb0 0x02 0xbc 0xbd 0xec 0x08 0x0b
0x76641540: 0xbd 0xe8 0xf1 0x8f 0x40 0x1a 0xa4 0x76
0x76641548: 0x04 0x1b 0xa4 0x76 0x00 0x2a 0x08 0xbf
0x76641550: 0x70 0x47 0x03 0x68 0xc1 0xf6 0x3c 0xba
0x76641558: 0x00 0x2a 0x08 0xbf 0x70 0x47 0x03 0x68
0x76641560: 0x0b 0x68 0xc1 0xf6 0x6d 0xbd 0x00 0x00
0x76641568: 0x2d 0xe9 0xf0 0x4f 0x03 0xaf 0x81 0xb0
0x76641570: 0xc3 0xf8 0x00 0xd0 0x70 0xca 0x02 0xf1
0x76641578: 0x04 0x02 0xb2 0xe8 0x00 0x0f 0x88 0x47
0x76641580: 0x01 0xb0 0xbd 0xe8 0xf0 0x8f 0x80 0xb5
0x76641588: 0x6f 0x46 0xc3 0xf8 0x00 0xd0 0x90 0x47
0x76641590: 0x80 0xbd 0x00 0x00 0x00 0x2a 0x06 0xd0
0x76641598: 0x80 0xb5 0x6f 0x46 0x03 0x78 0x0b 0x78
0x766415a0: 0xc1 0xf6 0x50 0xed 0x80 0xbd 0x03 0x48
0x766415a8: 0x00 0x68 0x00 0x28 0x18 0xbf 0xa4 0xf7
0x766415b0: 0xbf 0xbe 0x70 0x47 0x40 0x1a 0xa4 0x76
0x766415b8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x766415c0: 0x70 0x47 0x00 0xbf 0x00 0xbf 0x00 0xbf
0x766415c8: 0x00 0xbf 0x00 0xbf 0x00 0xbf 0x00 0xbf
0x766415d0: 0xbf 0xf3 0x5f 0x8f 0x01 0x60 0x40 0xf2
0x766415d8: 0x01 0x03 0xc0 0xf2 0x00 0x03 0x99 0x42
0x766415e0: 0x0a 0xd3 0x41 0xf6 0x24 0x43 0xc7 0xf2
0x766415e8: 0x4d 0x33 0x03 0xeb 0x90 0x23 0x18 0x78
0x766415f0: 0xff 0x28 0x1c 0xbf 0xff 0x20 0x18 0x70
0x766415f8: 0x70 0x47 0x00 0x00 0x00 0x00 0x00 0x00
I will repo this on another machine with gdb right away
The disass even in thumb2 mode doesn't make sense at the point of failure. Maybe it is just another lldb issue. Let's see what we'll get in gdb.
with gdb, reproduced :
Thread 30 "Raven.Server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x4c1c4450 (LWP 1096)]
0x7668555e in JIT_MemCpy () from /mnt/ext-lab/RavenDB.regular/libcoreclr.so
(gdb) bt
#0 0x7668555e in JIT_MemCpy () from /mnt/ext-lab/RavenDB.regular/libcoreclr.so
dotnet/coreclr#1 0x50a4bb54 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
(gdb) info registers
r0 0x523b3ffd 1379614717
r1 0x4c1c31d8 1276916184
r2 0x2 2
r3 0x0 0
r4 0x6362dcbc 1667423420
r5 0x523b3ffd 1379614717
r6 0x7 7
r7 0x1c5e1ea5 475930277
r8 0x0 0
r9 0x6448bbf0 1682488304
r10 0x4c1c3ba4 1276918692
r11 0x4c1c31f0 1276916208
r12 0x76685559 1986549081
sp 0x4c1c31b0 0x4c1c31b0
lr 0x50a4bb55 1352973141
pc 0x7668555e 0x7668555e
cpsr 0x20000030 536870960
(gdb) disassemble 0x50a4bb55
No function contains specified address.
(gdb) disassemble 0x7668555e
Dump of assembler code for function JIT_MemCpy:
0x76685558 <+0>: cmp r2, #0
0x7668555a <+2>: it eq
0x7668555c <+4>: bxeq lr
=> 0x7668555e <+6>: ldr r3, [r0, #0]
0x76685560 <+8>: ldr r3, [r1, #0]
0x76685562 <+10>: b.w 0x76547040
End of assembler dump.
(gdb)
So..
(gdb) p/x $r0
$3 = 0x523b3ffd
(gdb) p $_siginfo._sifields._sigfault.si_addr
$1 = (void *) 0x523b4000
And SEGV is on:
=> 0x7668555e <+6>: ldr r3, [r0, #0]
Could it be reading from $r0 4 bytes (reg size) to [ 0x523b3ffd + 4 bytes ] (which ends after the mapped page) causing seg fault although we wanted to Unsafe.CopyBlock only 2 bytes.. ?
Yes, based on the register values, it is what it was doing. And I can see it is a bug in the asm JIT_MemCpy helper. It uses the read to check if the address is valid before it jumps to memcpy. However, reading 4 bytes is obviously wrong. It should use just a byte read instead. Based on the comment in the function code, it seems that there used to be a requirement that this function is called only for a 4 byte aligned addresses, but looking at the Windows version of this helper, the code doesn't require it.
https://github.com/dotnet/coreclr/blob/master/src/vm/arm/crthelpers.S#L44-L58
I'll create a PR with a fix.
Great! Thanks.
Just to make sure, is this fixes both Unsafe.CopyBlock and Unsafe.CopyBlockUnaligned?
Looking at JIT source, there is only a single place that invokes JIT_MemCpy. And the cpblk IL instruction is compiled at that place. Both Unsafe.CopyBlock and Unsafe.CopyBlockUnaligned use cpblk, as you can see here:
https://github.com/dotnet/corefx/blob/64c6d9fe5409be14bdc3609d73ffb3fea1f35797/src/System.Runtime.CompilerServices.Unsafe/src/System.Runtime.CompilerServices.Unsafe.il#L162-L206
@janvorli Hi again.
I get with dotnet-sdk-2.2.101:
(gdb) disassemble
Dump of assembler code for function JIT_MemCpy:
0x76644d04 <+0>: cmp r2, #0
0x76644d06 <+2>: it eq
0x76644d08 <+4>: bxeq lr
=> 0x76644d0a <+6>: ldr r3, [r0, #0]
0x76644d0c <+8>: ldr r3, [r1, #0]
0x76644d0e <+10>: b.w 0x76503364 <memcpy@plt>
End of assembler dump.
(gdb) where
#0 0x76644d0a in JIT_MemCpy () from /mnt/external/ravendb/src/Raven.Server/bin/Release/netcoreapp2.2/linux-arm/publish/libcoreclr.so
dotnet/coreclr#1 0x4fec4d5a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
EDIT : I see the changes are not merged.. Do we have estimation when this will be available ?
I see the changes are not merged.. Do we have estimation when this will be available ?
I've just asked today in the issue. I hope the branches are open for merging now.
@aviviadi it will be part of 2.1.8 release as planned. The 2.1 branch should be open for merging the change after 2.1.7 is out.
Most helpful comment
Yes, based on the register values, it is what it was doing. And I can see it is a bug in the asm JIT_MemCpy helper. It uses the read to check if the address is valid before it jumps to memcpy. However, reading 4 bytes is obviously wrong. It should use just a byte read instead. Based on the comment in the function code, it seems that there used to be a requirement that this function is called only for a 4 byte aligned addresses, but looking at the Windows version of this helper, the code doesn't require it.
https://github.com/dotnet/coreclr/blob/master/src/vm/arm/crthelpers.S#L44-L58
I'll create a PR with a fix.