Type | Version/Name
--- | ---
Distribution Name | Proxmox
Distribution Version | Proxmox 6.2 (Debian Buster)
Linux Kernel | 5.4.44-2-pve
Architecture | x86_64
ZFS Version | 0.8.4-pve1
SPL Version | 0.8.4-pve1
I'm experiencing a consistent general protection fault after upgrade from 0.7.x (Proxmox 5.4.x) to 0.8.4 (Proxmox 6.2). This fault looks similar to issue #7873, but not exactly the same. The result of this fault is a complete hang of most processes running within the Linux Container that uses the associated ZFS volume, along with various other unresponsive host processes. I cannot regain use of the container until the entire Proxmox host is (hard) rebooted. The normal reboot process fails to proceed due to the hung LXC.
ZFS is my root (rpool), two JBOD drives mirrored, with several Linux Containers using their own ZFS volumes.
Bug can be encountered by using the system normally; I can make it maybe two days before the issue forces me to reboot the machine.
I believe Proxmox has a scheduled scrub that - while running - increases the chance of occurrence. However, daily use also seems to cause the same issue. This has been consistently triggered by the Plex Media Server process for me; I have not seen any other processes noted in the errors, but I do not see how this can be a Plex issue (all fingers point to a ZFS issue).
I use ECC memory and have run memtest86+ recently to verify that memory is good.
Drives are healthy, pool is as well.
[136422.505658] general protection fault: 0000 [#1] SMP PTI
[136422.505866] CPU: 15 PID: 25510 Comm: Plex Media Serv Tainted: P IO 5.4.44-2-pve #1
[136422.506115] Hardware name: HP ProLiant ML350 G6, BIOS D22 05/05/2011
[136422.506303] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[136422.506437] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6 48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[136422.507020] RSP: 0018:ffffacf29a163c80 EFLAGS: 00010282
[136422.507171] RAX: e089443875c085d0 RBX: ffffa0aac4d9c150 RCX: 0000000000000000
[136422.507375] RDX: 0000000000000001 RSI: ffffffffc01c1478 RDI: ffffffffc01c1478
[136422.507579] RBP: ffffacf29a163c80 R08: 0000000000000008 R09: ffffa0a8c3406f40
[136422.507784] R10: ffffa0ab33be8000 R11: 0000008000000000 R12: ffffa0ab33be8000
[136422.508031] R13: ffffa0aac4d9c178 R14: 0000000000000000 R15: 0000000000000000
[136422.508236] FS: 00007f6f78ff9700(0000) GS:ffffa0adc39c0000(0000) knlGS:0000000000000000
[136422.508465] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[136422.508630] CR2: 00007f6fbc294278 CR3: 0000000469bfc000 CR4: 00000000000006e0
[136422.508834] Call Trace:
[136422.508910] avl_nearest+0x2a/0x30 [zavl]
[136422.509140] zfs_rangelock_enter+0x405/0x580 [zfs]
[136422.509286] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[136422.509419] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[136422.509587] zfs_get_data+0x157/0x340 [zfs]
[136422.509746] zil_commit_impl+0x9ad/0xd90 [zfs]
[136422.509913] zil_commit+0x3d/0x60 [zfs]
[136422.510062] zfs_fsync+0x77/0xe0 [zfs]
[136422.510252] zpl_fsync+0x68/0xa0 [zfs]
[136422.510364] vfs_fsync_range+0x48/0x80
[136422.510474] ? __fget_light+0x59/0x70
[136422.510582] do_fsync+0x3d/0x70
[136422.510674] __x64_sys_fsync+0x14/0x20
[136422.510786] do_syscall_64+0x57/0x190
[136422.510895] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[136422.511041] RIP: 0033:0x7f6fffef4b17
[136422.511146] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 04 f5 ff ff 89 df 89 c2 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 46 f5 ff ff 8b 44 24
[136422.511700] RSP: 002b:00007f6f78ff79c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[136422.511916] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6fffef4b17
[136422.512121] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000000c
[136422.512373] RBP: 00000000038164b8 R08: 0000000000000000 R09: 00007f6fbc447770
[136422.512583] R10: 00007f6fbc0c4460 R11: 0000000000000293 R12: 0000000000000000
[136422.512792] R13: 0000000003825eb8 R14: 0000000000000002 R15: 0000000000000000
[136422.512999] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables mptctl mptbase binfmt_misc iptable_filter bpfilter softdog cpuid nfnetlink_log nfnetlink radeon ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt intel_powerclamp usblp kvm_intel ipmi_ssif input_leds kvm hpilo i7core_edac pcspkr irqbypass serio_raw ipmi_si intel_cstate ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid gpio_ich psmouse pata_acpi uhci_hcd ehci_pci lpc_ich tg3 ehci_hcd hpsa scsi_transport_sas
[136422.515330] ---[ end trace b7f8cd7b0f091d95 ]---
[136422.515472] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[136422.515649] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6 48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[136422.516197] RSP: 0018:ffffacf29a163c80 EFLAGS: 00010282
[136422.516352] RAX: e089443875c085d0 RBX: ffffa0aac4d9c150 RCX: 0000000000000000
[136422.516577] RDX: 0000000000000001 RSI: ffffffffc01c1478 RDI: ffffffffc01c1478
[136422.516826] RBP: ffffacf29a163c80 R08: 0000000000000008 R09: ffffa0a8c3406f40
[136422.517030] R10: ffffa0ab33be8000 R11: 0000008000000000 R12: ffffa0ab33be8000
[136422.517235] R13: ffffa0aac4d9c178 R14: 0000000000000000 R15: 0000000000000000
[136422.517439] FS: 00007f6f78ff9700(0000) GS:ffffa0adc39c0000(0000) knlGS:0000000000000000
[136422.517669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[136422.517875] CR2: 00007f6fbc294278 CR3: 0000000469bfc000 CR4: 00000000000006e0
I can report a similar problem, also with Plex in a Proxmox 6.2 container. ECC RAM, healthy mirrored pool, etc.
[386809.776800] CPU: 2 PID: 20182 Comm: Plex Media Serv Tainted: P O 5.4.44-2-pve #1
[386809.776823] Hardware name: Dell Inc. PowerEdge T30/07T4MC, BIOS 1.2.0 12/04/2019
[386809.776845] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[386809.776857] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6
48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[386809.776904] RSP: 0018:ffffa27904d9fc80 EFLAGS: 00010282
[386809.777648] RAX: e089443875c085d0 RBX: ffff966f834a2320 RCX: 0000000000000000
[386809.778441] RDX: 0000000000000001 RSI: ffffffffc0363478 RDI: ffffffffc0363478
[386809.779104] RBP: ffffa27904d9fc80 R08: 0000000000000008 R09: ffff96760a402fc0
[386809.779883] R10: ffff96713cdff500 R11: 0000008000000000 R12: ffff96713cdff500
[386809.780548] R13: ffff966f834a2348 R14: 0000000000000000 R15: 0000000000000000
[386809.781193] FS: 00007fdd287f0700(0000) GS:ffff96760db00000(0000) knlGS:0000000000000000
[386809.781837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[386809.782488] CR2: 00007fa9226c5968 CR3: 000000023e356004 CR4: 00000000003626e0
[386809.783169] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[386809.783832] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[386809.784515] Call Trace:
[386809.785191] avl_nearest+0x2a/0x30 [zavl]
[386809.785877] zfs_rangelock_enter+0x405/0x580 [zfs]
[386809.786524] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[386809.787152] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[386809.787790] zfs_get_data+0x157/0x340 [zfs]
[386809.788439] zil_commit_impl+0x9ad/0xd90 [zfs]
[386809.789078] zil_commit+0x3d/0x60 [zfs]
[386809.789774] zfs_fsync+0x77/0xe0 [zfs]
[386809.790382] zpl_fsync+0x68/0xa0 [zfs]
[386809.790982] vfs_fsync_range+0x48/0x80
[386809.791541] ? __fget_light+0x59/0x70
[386809.792100] do_fsync+0x3d/0x70
[386809.792684] __x64_sys_fsync+0x14/0x20
[386809.793220] do_syscall_64+0x57/0x190
[386809.793794] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Also seeing this issue, but on Ubuntu 18.04.5 LTS with 0.8.4. I also had this issue with 0.8.3, but did not have it with 7.x. I also have Plex running as well and while the effects aren't as bad (stopping containers, etc) it does effectively zombie the Plex server.
[16019.192423] general protection fault: 0000 [#1] SMP PTI
[16019.192454] Modules linked in: xt_mark veth vhost_net vhost tap xt_nat nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables devlink iptable_filter aufs overlay binfmt_misc intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf intel_pch_thermal bridge stp llc joydev input_leds zfs(POE) zunicode(POE) zavl(POE) icp(POE) zlua(POE) ipmi_ssif zcommon(POE) znvpair(POE) spl(OE) ipmi_si ipmi_devintf ipmi_msghandler mac_hid mei_me mei acpi_pad ie31200_edac shpchp lpc_ich sch_fq_codel ib_iser rdma_cm
[16019.192701] iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi jc42 coretemp lp parport ip_tables x_tables autofs4 algif_skcipher af_alg dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ast ttm drm_kms_helper syscopyarea aesni_intel igb sysfillrect sysimgblt aes_x86_64 hid_generic dca e1000e i2c_algo_bit fb_sys_fops crypto_simd usbhid ahci mpt3sas glue_helper drm ptp hid libahci cryptd raid_class pps_core scsi_transport_sas video
[16019.192881] CPU: 3 PID: 4535 Comm: Plex Media Serv Tainted: P OE 4.15.0-112-generic #113-Ubuntu
[16019.192912] Hardware name: Supermicro X10SLM-F/X10SLM-F, BIOS 3.0 04/24/2015
[16019.192938] RIP: 0010:avl_walk+0x33/0x60 [zavl]
[16019.192961] RSP: 0018:ffff9c366f08bc70 EFLAGS: 00010282
[16019.192979] RAX: e089445075c085d0 RBX: ffff8b42a28e5800 RCX: 0000000000000000
[16019.193001] RDX: 0000000000000001 RSI: ffffffffc05cb608 RDI: 0000000000000000
[16019.193024] RBP: ffff9c366f08bc70 R08: 0000000000000008 R09: ffff8b457f003200
[16019.193046] R10: ffff9c366f08bc20 R11: ffff8b457360ae50 R12: ffff8b44163f3918
[16019.193069] R13: ffff8b44163f3940 R14: ffff8b42a28e5300 R15: 0000000000000000
[16019.193092] FS: 00007f5716ffd700(0000) GS:ffff8b459fcc0000(0000) knlGS:0000000000000000
[16019.193118] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16019.193137] CR2: 00007f891cbb2000 CR3: 00000006b812c001 CR4: 00000000001626e0
[16019.193159] Call Trace:
[16019.193173] avl_nearest+0x2b/0x30 [zavl]
[16019.193241] zfs_rangelock_enter+0x3b8/0x550 [zfs]
[16019.193274] ? spl_kmem_zalloc+0xe9/0x150 [spl]
[16019.193306] ? spl_kmem_zalloc+0xe9/0x150 [spl]
[16019.193401] zfs_get_data+0x136/0x350 [zfs]
[16019.193496] zil_commit_impl+0x9b9/0xd60 [zfs]
[16019.193589] zil_commit+0x3d/0x60 [zfs]
[16019.193675] zfs_fsync+0x77/0xe0 [zfs]
[16019.193727] zpl_fsync+0x68/0xa0 [zfs]
[16019.193757] vfs_fsync_range+0x51/0xb0
[16019.193784] do_fsync+0x3d/0x70
[16019.193808] SyS_fsync+0x10/0x20
[16019.195027] do_syscall_64+0x73/0x130
[16019.196163] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[16019.197124] RIP: 0033:0x7f57bb8b6b17
[16019.197888] RSP: 002b:00007f5716ffbb60 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[16019.198537] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f57bb8b6b17
[16019.199280] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000000c
[16019.199938] RBP: 0000000002dc8aa8 R08: 0000000000000000 R09: 000000000005d0ec
[16019.200576] R10: 00007f57280008d0 R11: 0000000000000293 R12: 0000000000000000
[16019.201304] R13: 0000000002e4ad38 R14: 0000000000000002 R15: 0000000000000000
[16019.201929] Code: 47 10 bf 01 00 00 00 29 d7 48 89 e5 4c 01 c6 48 85 f6 74 40 48 63 d2 48 89 f1 48 8b 04 d6 48 85 c0 74 1a 48 63 cf eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 4c 29 c0 5d c3 39 d7 74 f7 48 8b 41
[16019.203480] RIP: avl_walk+0x33/0x60 [zavl] RSP: ffff9c366f08bc70
[16019.204133] ---[ end trace f50c07bc2d334abc ]---
I've seen these same stack traces in Ubuntu 16.04 (zfs 0.7.13) and in a Debian 10 lxc container (Debian 10 host/zfs 0.8.4).
Also seeing this inside a container running on proxmox (running Plex):
[1060935.777369] general protection fault: 0000 [#1] SMP PTI
[1060935.778037] CPU: 14 PID: 25452 Comm: Plex Media Serv Tainted: P O 5.4.60-1-pve #1
[1060935.779104] Hardware name: Supermicro Super Server/X10SDV-TLN4F, BIOS 1.0b 09/09/2015
[1060935.780263] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[1060935.781426] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6 48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[1060935.783639] RSP: 0018:ffffb1cc4e3d3c80 EFLAGS: 00010282
[1060935.784424] RAX: e089443875c085d0 RBX: ffff9c078df44e40 RCX: 0000000000000000
[1060935.785162] RDX: 0000000000000001 RSI: ffffffffc021a478 RDI: ffffffffc021a478
[1060935.785862] RBP: ffffb1cc4e3d3c80 R08: 0000000000000008 R09: ffff9c1b79406f40
[1060935.786565] R10: ffff9c0f10ecca00 R11: 0000008000000000 R12: ffff9c0f10ecca00
[1060935.787276] R13: ffff9c078df44e68 R14: 0000000000000000 R15: 0000000000000000
[1060935.788000] FS: 00007f384bfff700(0000) GS:ffff9c1b7fb80000(0000) knlGS:0000000000000000
[1060935.788704] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1060935.789456] CR2: 00007f392ba31000 CR3: 0000001eaa0b4002 CR4: 00000000003626e0
[1060935.790257] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1060935.791343] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1060935.792038] Call Trace:
[1060935.792726] avl_nearest+0x2a/0x30 [zavl]
[1060935.793518] zfs_rangelock_enter+0x405/0x580 [zfs]
[1060935.794390] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[1060935.795137] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[1060935.795863] zfs_get_data+0x157/0x340 [zfs]
[1060935.796573] zil_commit_impl+0x9ad/0xd90 [zfs]
[1060935.797281] zil_commit+0x3d/0x60 [zfs]
[1060935.797976] zfs_fsync+0x77/0xe0 [zfs]
[1060935.798647] zpl_fsync+0x68/0xa0 [zfs]
[1060935.799300] vfs_fsync_range+0x48/0x80
[1060935.799986] ? __fget_light+0x59/0x70
[1060935.800882] do_fsync+0x3d/0x70
[1060935.801665] __x64_sys_fsync+0x14/0x20
[1060935.802329] do_syscall_64+0x57/0x190
[1060935.802969] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1060935.803587] RIP: 0033:0x7f392a12188b
[1060935.804215] Code: 4a 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 03 f7 ff ff 8b 7c 24 0c 41 89 c0 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 89 44 24 0c e8 41 f7 ff ff 8b 44
[1060935.805905] RSP: 002b:00007f384bffdd40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[1060935.806696] RAX: ffffffffffffffda RBX: 0000000002d017e8 RCX: 00007f392a12188b
[1060935.807452] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000015
[1060935.808436] RBP: 0000000002c86338 R08: 0000000000000000 R09: 000000000004d96e
[1060935.809156] R10: 00007f379008b970 R11: 0000000000000293 R12: 0000000000000000
[1060935.809797] R13: 0000000002d01758 R14: 0000000000000002 R15: 0000000000000000
[1060935.810429] Modules linked in: tcp_diag inet_diag nf_conntrack_netlink xt_nat xt_tcpudp binfmt_misc xt_conntrack xt_MASQUERADE xfrm_user xt_addrtype iptable_nat veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter bonding openvswitch softdog nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ast glue_helper drm_vram_helper ttm rapl drm_kms_helper intel_cstate pcspkr mxm_wmi intel_pch_thermal drm fb_sys_fops syscopyarea sysfillrect joydev input_leds sysimgblt mei_me mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc
[1060935.810462] fscache overlay aufs ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c ses enclosure usbmouse usbkbd hid_generic usbhid hid gpio_ich ahci xhci_pci igb mpt3sas ehci_pci ixgbe i2c_i801 libahci lpc_ich i2c_algo_bit xhci_hcd raid_class ehci_hcd xfrm_algo scsi_transport_sas dca mdio wmi
[1060935.820245] ---[ end trace 2f95ade701523fe1 ]---
[1060937.895322] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[1060937.896785] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6 48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[1060937.899611] RSP: 0018:ffffb1cc4e3d3c80 EFLAGS: 00010282
[1060937.900854] RAX: e089443875c085d0 RBX: ffff9c078df44e40 RCX: 0000000000000000
[1060937.902077] RDX: 0000000000000001 RSI: ffffffffc021a478 RDI: ffffffffc021a478
[1060937.903590] RBP: ffffb1cc4e3d3c80 R08: 0000000000000008 R09: ffff9c1b79406f40
[1060937.904987] R10: ffff9c0f10ecca00 R11: 0000008000000000 R12: ffff9c0f10ecca00
[1060937.906421] R13: ffff9c078df44e68 R14: 0000000000000000 R15: 0000000000000000
[1060937.907870] FS: 00007f384bfff700(0000) GS:ffff9c1b7fb80000(0000) knlGS:0000000000000000
[1060937.909202] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1060937.910201] CR2: 00007f392ba31000 CR3: 0000001eaa0b4002 CR4: 00000000003626e0
[1060937.911049] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1060937.912436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
root@plex:~# uname -a
Linux plex 5.4.60-1-pve #1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200) x86_64 x86_64 x86_64 GNU/Linux
root@plex:~#
I'm going to try to move plex to a VM (with NFS) rather than an lxc container to see if it resolves the issue. My guess is this is related to ZFS usage inside an LXC container. Plex probably just is accessing the filesystem in such a way as to trigger it more consistently.
Your comment reminded me to post my workaround...
As indicated by the stacktraces, this appears to be an issue in the ZFS fsync() implementation. Disabling ZFS synchronous writes for the affected dataset (i.e. "zfs set sync=disabled somedataset") resolves the issue. I've been running this way for a month without the issue reappearing.
I presume the primary Plex usage for fsync() is its SQLite databases, so doing this undermines Plex's data consistency expectations, and isn't something to be done lightly. In my case, its an isolated dataset just for Plex and I'm not concerned about Plex corruption or even complete data loss.
I wonder if this related to this: https://github.com/openzfs/zfs/issues/10440 ?
If so, would that mean that this is an issue with zfs and SQLite in general?
Could be. It may not really be sqlite specific, but just that the I/O
pattern sqlite generally has in use is triggering it. Which means it could
happen from other apps / access patterns as well. Thanks for linking
though...that issue seems to have more investigation into the root cause
and work that's happening on it.
On Fri, Oct 9, 2020 at 11:56 AM 4oo4 notifications@github.com wrote:
I wonder if this related to this: #10440
https://github.com/openzfs/zfs/issues/10440 ?If so, would that mean that this is an issue with zfs and SQLite in
general?—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/openzfs/zfs/issues/10593#issuecomment-706262730, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAQ2PCW4CMW6JHRUAQY5FP3SJ4XD3ANCNFSM4O7PXVHA
.
I am still having this problem in kernels 5.8 and 5.9. Setting sync=disable does not solve the issue for me. Did anyone found a workaround?
Thanks
For me, moving Plex to a VM instead of a container fixed it. I haven't seen the issue since. All of Plex's data (sqlite and such) I'm using inside the container's filesystem (which is ironic since that virtual disk is a ZFS volume...so you'd think it would have the same issues). The actual video is NFS shared into the container.
My guess is that the problem has something to do with ZFS being used by a containerized app with a fast R/W pattern. Not sure why, but there's that. :)
I can provide a data point on the container aspect -- my Plex instance is running outside of a container and exhibits the issue at hand.
Interesting. Perhaps my using a ZFS volume vs a dataset is a factor then? Or, coincidence I guess. :)
If you wanted to test whether zvol is the issue @bdaroz you could make a zvol with, e.g. ext4 on the volume, then use that ext4 filesystem for plex data. If that still exhibits the issue...then I guess it's not a zvol vs dataset issue. If that fixes it...then there's another data point that points to the issue being dataset specific.
I don't know the ZFS code base at all so am not sure if that stack trace is more general or specific to datasets vs volumes.
Actually, poking around in the code more (though I am DEFINITELY not well versed with it), this looks like it's dealing with any commit through the ZIL. I believe the zil is used for zvols like it is for datasets (write ahead log for synchronous writes). So, if there is any difference in behavior then it's possibly due to synchronous writes in the filesystem on top of the zvol not triggering triggering a corresponding synchronous write on the zvol (or some change in access pattern at least).
Regardless of all that...the actual GP fault is in avl_nearest which is code that's not even ZFS specific (the file itself says as much: https://github.com/openzfs/zfs/blob/zfs-0.8-release/module/avl/avl.c. It's a balanced binary search tree (AVL tree). That it's dying there suggests the problem is in the tree code itself. My guess would be a multi-threading issue since if it were a more general logic issue I'd expect we'd see this problem a lot more often. But...you never know.
That file's had a lot of changes since the 0.8.4 release: https://github.com/openzfs/zfs/commits/master/module/avl/avl.c. So...maybe there has been a fix since this bug was filed and it's just not released.
This is all speculation though. :)
Oh. The newer code references here: https://illumos.org/man/9f/avl.
Which explicitly calls out:
AVL trees do not inherently have any internal locking, it is up to the
consumer to use locks as appropriate. See mutex(9F) and rwlock(9F) for
more information on synchronization primitives.
So, yeah, maybe mt related if the locking isn't properly protecting the tree. So, that might indicate an issue in code related to locking with zfs_rangelock_enter.
I recently hit very similar issue.
I have looked into the core dump, and the problem is not in AVL code.
The way it happened seems to be like this.
The TX_WRITE was created on a file, but the file is later deleted and a new directory is created with the same object id.
And when zil_commit happened, it goes into zfs_get_data, where it got the wrong znode pointing to the new directory.
And it tries to do range lock with size = zp->z_blksz which is 0.
https://github.com/openzfs/zfs/blob/85703f616ddb83bb94192e24fd8519661578aac2/module/zfs/zfs_vnops.c#L801
When there's two callers doing range lock on the same offset with size 0.
The second caller would get an uninitialized where
https://github.com/openzfs/zfs/blob/85703f616ddb83bb94192e24fd8519661578aac2/module/zfs/zfs_rlock.c#L415
This uninitialized where would later get dereferenced because it thinks the two size 0 range don't overlap.
https://github.com/openzfs/zfs/blob/85703f616ddb83bb94192e24fd8519661578aac2/module/zfs/zfs_rlock.c#L326
https://github.com/openzfs/zfs/blob/85703f616ddb83bb94192e24fd8519661578aac2/module/zfs/zfs_rlock.c#L342
If anyone here can repro this consistently, can you please try this patch.
https://gist.github.com/tuxoko/e9e08d07d2c182d983057c63d4796551
Most helpful comment
I can report a similar problem, also with Plex in a Proxmox 6.2 container. ECC RAM, healthy mirrored pool, etc.
[386809.776800] CPU: 2 PID: 20182 Comm: Plex Media Serv Tainted: P O 5.4.44-2-pve #1
[386809.776823] Hardware name: Dell Inc. PowerEdge T30/07T4MC, BIOS 1.2.0 12/04/2019
[386809.776845] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[386809.776857] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6
48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[386809.776904] RSP: 0018:ffffa27904d9fc80 EFLAGS: 00010282
[386809.777648] RAX: e089443875c085d0 RBX: ffff966f834a2320 RCX: 0000000000000000
[386809.778441] RDX: 0000000000000001 RSI: ffffffffc0363478 RDI: ffffffffc0363478
[386809.779104] RBP: ffffa27904d9fc80 R08: 0000000000000008 R09: ffff96760a402fc0
[386809.779883] R10: ffff96713cdff500 R11: 0000008000000000 R12: ffff96713cdff500
[386809.780548] R13: ffff966f834a2348 R14: 0000000000000000 R15: 0000000000000000
[386809.781193] FS: 00007fdd287f0700(0000) GS:ffff96760db00000(0000) knlGS:0000000000000000
[386809.781837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[386809.782488] CR2: 00007fa9226c5968 CR3: 000000023e356004 CR4: 00000000003626e0
[386809.783169] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[386809.783832] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[386809.784515] Call Trace:
[386809.785191] avl_nearest+0x2a/0x30 [zavl]
[386809.785877] zfs_rangelock_enter+0x405/0x580 [zfs]
[386809.786524] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[386809.787152] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[386809.787790] zfs_get_data+0x157/0x340 [zfs]
[386809.788439] zil_commit_impl+0x9ad/0xd90 [zfs]
[386809.789078] zil_commit+0x3d/0x60 [zfs]
[386809.789774] zfs_fsync+0x77/0xe0 [zfs]
[386809.790382] zpl_fsync+0x68/0xa0 [zfs]
[386809.790982] vfs_fsync_range+0x48/0x80
[386809.791541] ? __fget_light+0x59/0x70
[386809.792100] do_fsync+0x3d/0x70
[386809.792684] __x64_sys_fsync+0x14/0x20
[386809.793220] do_syscall_64+0x57/0x190
[386809.793794] entry_SYSCALL_64_after_hwframe+0x44/0xa9