Nixpkgs: NFS tests are broken

Created on 13 Jul 2018 · 38Comments · Source: NixOS/nixpkgs

It's a recent regression of both unstable and 18.03.

Apparently it's not 100% reproducible: on a local machine, I was able to make it succeed once (but it failed two other times there). Still, the 4.14.55 kernel bump seems to be the trigger, as I'm unable to reproduce the problem after reverting to 4.14.44 (3 successes).

Error log:



client1# checking /dev/vda...

client1# fsck (busybox 1.28.4)

client1# [fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/vda

client1# [    1.349044] usb 1-1: New USB device found, idVendor=0627, idProduct=0001

client1# [    1.350105] usb 1-1: New USB device strings: Mfr=1, Product=3, SerialNumber=5

client1# [    1.351037] usb 1-1: Product: QEMU USB Tablet

client1# [    1.351622] usb 1-1: Manufacturer: QEMU

client1# [    1.352255] usb 1-1: SerialNumber: 42

client1# /dev/vda: recovering journal

client1# /dev/vda contains a file system with errors, check forced.

client1# [    1.375927] hidraw: raw HID events driver (C) Jiri Kosina

client1# /dev/vda: 213/32768 files (0.9% non-contiguous), 10170/131072 blocks

client1# [    1.380052] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100

client1# [    1.380052]

client1# [    1.381262] CPU: 0 PID: 1 Comm: init Not tainted 4.14.55 #1-NixOS

client1# [    1.382092] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-pre

built.qemu-project.org 04/01/2014

client1# [    1.383664] Call Trace:

client1# [    1.384001]  dump_stack+0x5c/0x85

client1# [    1.384426]  panic+0xd5/0x232

client1# [    1.384822]  do_exit+0xad0/0xad0

client1# [    1.385358]  ? handle_mm_fault+0xaa/0x1e0

client1# [    1.385879]  do_group_exit+0x3a/0xa0

client1# [    1.386348]  SyS_exit_group+0x10/0x10

client1# [    1.386798]  do_syscall_64+0x67/0x120

client1# [    1.387251]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

client1# [    1.387887] RIP: 0033:0x7fa23463e676

client1# [    1.388351] RSP: 002b:00007ffd2581c1f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7

client1# [    1.389203] RAX: ffffffffffffffda RBX: 0000000001d1b704 RCX: 00007fa23463e676

client1# [    1.390065] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001

client1# [    1.391058] RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffff80

client1# [    1.391964] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffd2581c588

client1# [    1.392818] R13: 00007ffd2581c580 R14: 0000000000000000 R15: 0000000000000000

client1# [    1.393775] Kernel Offset: 0x2200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

client1# [    1.395035] Rebooting in 1 seconds..

blocker kernel nixos

Source

vcunat

Most helpful comment

This is presumably fixed in commit "ext4: fix check to prevent initializing reserved inodes" which should be in 4.17.12, 4.14.60, and 4.9.117.

dezgeg on 3 Aug 2018

🎉2

All 38 comments

Oh, interesting! The NFS test is not one I usually run, I shall add it to my list :) I can definitely reproduce the issue locally. Our nfs-utils is pretty old, I will see if bumping it makes a difference.

NeQuissimus on 13 Jul 2018

I wouldn't expect the utils to be able to kill the kernel.

vcunat on 13 Jul 2018

It doesn't kill it, it's just the stage-1 (which is pid 1 at that point) exiting for some reason.

dezgeg on 13 Jul 2018

yeah, I just thought I'd give it a try in case there is come compatibility issue. But an updated nfs-utils is a different beast. The clients don't work at all anymore once I update (which should probably be addressed).

Says this even if I do a $client1->succeed("mkdir /data"). (May be that startAll is too quick)

client1# [    5.173563] mount[660]: mount.nfs: mounting server:/data failed, reason given by server: No such file or directory

It is interesting just because there is nothing in the kernel changelog that really messed with NFS specifically; mostly cifs and ext4 stuff.

Edit: The NFS share sits on top of ext4, I wonder if that causes it
Edit2: It may in fact be the ext4 changes in the kernel. It seems to see corruption now that was never the case in previous test runs:

client1# /dev/vda contains a file system with errors, check forced.
client1# /dev/vda: 213/32768 files (0.9% non-contiguous), 10169/131072 blocks
client1# [    1.245287] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100

No idea why something would then try to kill PID 1...

NeQuissimus on 13 Jul 2018

I figured it out!

e2fsprogs-1.44.2 does not work with the ext4 changes in the kernel.
Bumping it to 1.44.3 solves the issue

NeQuissimus on 13 Jul 2018

Yes, presumably it's marked corrupted by:

client1# [ 13.115644] EXT4-fs error (device vda): ext4_has_uninit_itable:3111: comm mount: Inode table for bg 0 marked as needing zeroing
I doubt that's intentional behaviour by the kernel. Old filesystems should never stop working in a new kernel.

dezgeg on 13 Jul 2018

My PR was a duplicate of #43352 btw

NeQuissimus on 13 Jul 2018

How do you know it solves the issue? The fact that it's not 100% reproducible complicates this, but I managed to get the same oops even with 1.44.3, picked on top of 18.03 (861211217f).

vcunat on 14 Jul 2018

hmmm... really? The error was 100% reproducible for me locally, I ran the test 5 times. With es2fsprogs 1.44.3, I ran the test another 5 and it not once failed.

NeQuissimus on 14 Jul 2018

I've done four or five attempts of the above on this machine now, and all except one ended in a kernel backtrace. I certainly don't feel I understand the problem yet.

vcunat on 14 Jul 2018

The problem is that the filesystem is marked corrupted by this:
client1# [ 13.115644] EXT4-fs error (device vda): ext4_has_uninit_itable:3111: comm mount: Inode table for bg 0 marked as needing zeroing

and if you Google the message it's introduced in the backport of this commit which is new in 4.14.55: https://www.spinics.net/lists/linux-ext4/msg61254.html

Either the original commit or its backport is broken. I guess it would be good to check the latest kernel if the problem happens there as well to figure out which.

It's worth noting that the error happens in stage-2, where the filesystem is only remounted, not mounted. (it's created and mounted in stage-1). So maybe the patch is broken only in the remount case (which would also explain why it was missed if the original patch itself is broken and the bug was not introduced in the 4.14 backport).

dezgeg on 14 Jul 2018

If you run the tests with 4.17.6, the same issue occurs.
It would seem that something is broken with ext4 in all the recent kernels then...

NeQuissimus on 14 Jul 2018

Best to mail Ted Ts'o and the ext4 list then and specifically mention the error happens on remount.

dezgeg on 14 Jul 2018

I sent an email to the ext4 mailing list

NeQuissimus on 15 Jul 2018

That thread is expected to appear here: https://marc.info/?l=linux-ext4&r=1&b=201807&w=2 ? or am I looking on a wrong place? (I can't see anything related ATM.)

vcunat on 16 Jul 2018

I was looking here: https://www.spinics.net/lists/linux-ext4/ But my email does not show up. I have no idea how this mailing list works... Maybe it needs initial approval?! The email did not bounce back, so it's somewhere... I sent it to [email protected]

NeQuissimus on 16 Jul 2018

I had looked on that URL as well, but there the last one was from last Friday, whereas marc.info had some today timestamps.

vcunat on 16 Jul 2018

Do I have to do something special to be able to send mails to the kernel lists? Not sure why it would not show up in the archives...

NeQuissimus on 16 Jul 2018

Can't help with how to send to the list, but did you see this? https://marc.info/?l=linux-ext4&m=153178285402937&w=2

peterhoeg on 17 Jul 2018

Can you guys run the test with the latest kernel updates? I cannot reproduce the issue anymore with our test...

NeQuissimus on 17 Jul 2018

nfs3 and nfs4 tests still fail the same way for me on 796b2b524 (18.03, ATM with 4.14.56 kernel).

vcunat on 17 Jul 2018

Can you grab a master checkout? I have rebuilt the test a bunch of times now and I can no longer make it fail on master (ffaf902ee34d1c85b98903f1bcc9c573f145fd13). On 18.03, it does fail for me...

NeQuissimus on 17 Jul 2018

Same here. I have no idea what relevant thing has changed on master in the meantime. There's been a staging merge (but e.g. e2fsprogs is still 1.44.2).

vcunat on 17 Jul 2018

There are a few differences in the kernel configs but none explain the issue (master vs 18.03):

λ diff /nix/store/8an4didr8fszwl38pv6yh1q6gw8mnw88-linux-config-4.14.56 /nix/store/30pnrb7whyvfrcp5x00bzd1czkiiz3y5-linux-config-4.14.56
3210c3210
< # CONFIG_RT2800USB_RT53XX is not set
---
> CONFIG_RT2800USB_RT53XX=y
8350c8350
< # CONFIG_KEY_DH_OPERATIONS is not set
---
> CONFIG_KEY_DH_OPERATIONS=y
8416c8416
< CONFIG_CRYPTO_KPP=m
---
> CONFIG_CRYPTO_KPP=y
8419c8419
< CONFIG_CRYPTO_DH=m
---
> CONFIG_CRYPTO_DH=y
8704c8704
< CONFIG_MPILIB=m
---
> CONFIG_MPILIB=y

NeQuissimus on 17 Jul 2018

The .55 -> .56 kernel bump apparently didn't make a difference; I just tested on a7d17e287.

vcunat on 17 Jul 2018

https://hydra.nixos.org/build/77598078

Hydra says that something in 8258f5...75942f fixed this. I can't spot anything relevant...

NeQuissimus on 17 Jul 2018

I could see nothing relevant either. This one is non-trivial to bisect over nixpkgs, too.

vcunat on 18 Jul 2018

Note that v4.14.56 does not include the patch mentioned above (https://marc.info/?l=linux-ext4&m=153178285402937&w=2) yet. But based on the description it sounds like it could be the same underlying issue. (And it also mentions that the problem is basically some race condition which should explain why it is intermittent).

dezgeg on 18 Jul 2018

Now I tried that patch with .56 on 18.03 2da449ea6, and the nfs3 and nfs4 tests seem failing the same way. (it applies correctly, etc.)



...

client1# <<< NixOS Stage 1 >>>

client1#

client1# loading module virtio_balloon...

client1# loading module virtio_console...

client1# loading module virtio_rng...

client1# loading module dm_mod...

client1# [    1.045588] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: [email protected]

client1# running udev...

client1# [    1.059339] systemd-udevd[84]: starting version 237

client2: running command: test -e locked

client2: exit status 1

client1# [    1.163723] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12

client1# [    1.172654] serio: i8042 KBD port at 0x60,0x64 irq 1

client1# [    1.223771] SCSI subsystem initialized

client1# [    1.228016] serio: i8042 AUX port at 0x60,0x64 irq 12

client1# [    1.243655] rtc_cmos 00:00: RTC can wake from S4

client1# [    1.245460] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10

client1# [    1.248226] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0

client1# [    1.249286] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram, hpet irqs

client1# [    1.282994] ACPI: bus type USB registered

client1# [    1.288025] usbcore: registered new interface driver usbfs

client1# [    1.288851] usbcore: registered new interface driver hub

client1# [    1.293668] usbcore: registered new device driver usb

client1# [    1.325410] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver

client1# [    1.332076] uhci_hcd: USB Universal Host Controller Interface driver

client1# [    1.345717] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10

client1# [    1.350565] scsi host0: ata_piix

client1# [    1.351588] scsi host1: ata_piix

client1# [    1.352111] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc160 irq 14

client1# [    1.359569] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc168 irq 15

client1# [    1.382383] uhci_hcd 0000:00:01.2: UHCI Host Controller

client1# [    1.383350] uhci_hcd 0000:00:01.2: new USB bus registered, assigned bus number 1

client1# [    1.384588] uhci_hcd 0000:00:01.2: detected 2 ports

client1# [    1.385455] uhci_hcd 0000:00:01.2: irq 11, io base 0x0000c080

client1# [    1.386283] usb usb1: New USB device found, idVendor=1d6b, idProduct=0001

client1# [    1.387087] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1

client1# [    1.388053] usb usb1: Product: UHCI Host Controller

client1# [    1.388718] usb usb1: Manufacturer: Linux 4.14.56 uhci_hcd

client1# [    1.389479] usb usb1: SerialNumber: 0000:00:01.2

client1# [    1.390874] hub 1-0:1.0: USB hub found

client1# [    1.391333] hub 1-0:1.0: 2 ports detected

client1# [    1.421717] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11

client1# [    1.501578] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0

client1# [    1.520712] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100

client1# [    1.522279] ata2.00: configured for MWDMA2

client1# [    1.523401] scsi 1:0:0:0: CD-ROM            QEMU     QEMU DVD-ROM     2.5+ PQ: 0 ANSI: 5

client1# [    1.570988] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray

client1# [    1.572127] cdrom: Uniform CD-ROM driver Revision: 3.20

client1# [    1.599331] random: fast init done

client1# [    1.600105] random: crng init done

client1# [    1.705327] 9pnet: Installing 9P2000 support

client1# [    1.719021] usb 1-1: new full-speed USB device number 2 using uhci_hcd

client1# [    1.824027] tsc: Refined TSC clocksource calibration: 2807.997 MHz

client1# [    1.825214] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2879c38a146, max_idle_ns: 440795318626 ns

client1# [    1.886563] usb 1-1: New USB device found, idVendor=0627, idProduct=0001

client1# [    1.887824] usb 1-1: New USB device strings: Mfr=1, Product=3, SerialNumber=5

client1# [    1.889001] usb 1-1: Product: QEMU USB Tablet

client1# [    1.889854] usb 1-1: Manufacturer: QEMU

client1# [    1.890421] usb 1-1: SerialNumber: 42

client1# [    1.908677] hidraw: raw HID events driver (C) Jiri Kosina

client1# [    1.924210] usbcore: registered new interface driver usbhid

client1# [    1.925158] usbhid: USB HID core driver

client1# [    1.926649] input: QEMU QEMU USB Tablet as /devices/pci0000:00/0000:00:01.2/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input/input2

client1# [    1.928784] hid-generic 0003:0627:0001.0001: input,hidraw0: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:01.2-1/input0

client1# kbd_mode: KDSKBMODE: Inappropriate ioctl for device

client1# %Gstarting device mapper and LVM...

client1# [    1.946983] clocksource: Switched to clocksource acpi_pm

client1# checking /dev/vda...

client1# fsck (busybox 1.28.1)

client1# [fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/vda

client1# /dev/vda: recovering journal

client1# /dev/vda contains a file system with errors, check forced.

client1# /dev/vda: 215/32768 files (0.9% non-contiguous), 10159/131072 blocks

client1# [    2.000867] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100

client1# [    2.000867]

client1# [    2.002099] CPU: 0 PID: 1 Comm: init Not tainted 4.14.56 #1-NixOS

client1# [    2.002894] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014

client1# [    2.004795] Call Trace:

client1# [    2.005181]  dump_stack+0x5c/0x85

client1# [    2.005649]  panic+0xd5/0x232

client1# [    2.006057]  do_exit+0xad0/0xad0

client1# [    2.006496]  ? handle_mm_fault+0xaa/0x1e0

client1# [    2.006992]  do_group_exit+0x3a/0xa0

client1# [    2.007462]  SyS_exit_group+0x10/0x10

client1# [    2.007917]  do_syscall_64+0x67/0x120

client1# [    2.008409]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

client1# [    2.009033] RIP: 0033:0x7f0bc09bc0e8

client1# [    2.009527] RSP: 002b:00007ffe698ebd18 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7

client1# [    2.010470] RAX: ffffffffffffffda RBX: 000000000112f7a0 RCX: 00007f0bc09bc0e8

client1# [    2.011281] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001

client1# [    2.012118] RBP: 0000000000000004 R08: 00000000000000e7 R09: ffffffffffffff80

client1# [    2.019450] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe698ec0c8

client1# [    2.020694] R13: 00007ffe698ec0c0 R14: 0000000000000000 R15: 0000000000000000

client1# [    2.021976] Kernel Offset: 0x39600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

client1# [    2.023635] Rebooting in 1 seconds..

client2: running command: test -e locked

client2: exit status 1

client2: running command: test -e locked

client2: exit status 1

vde_switch: EOF data port: Interrupted system call

client2: running command: test -e locked

client2: exit status 1

...

vcunat on 18 Jul 2018

Ah, bummer. I still suppose this error shows up earlier on, right?:

client1# [ 13.115644] EXT4-fs error (device vda): ext4_has_uninit_itable:3111: comm mount: Inode table for bg 0 marked as needing zeroing

dezgeg on 19 Jul 2018

Yes.
nfs3.log

vcunat on 19 Jul 2018

This is presumably fixed in commit "ext4: fix check to prevent initializing reserved inodes" which should be in 4.17.12, 4.14.60, and 4.9.117.

dezgeg on 3 Aug 2018

🎉2

I tested locally both tests before 4.14.60 and after (on 18.03), and the results support that.

vcunat on 3 Aug 2018

@GrahamcOfBorg test nfs3 nfs4

NeQuissimus on 4 Aug 2018

My local test runs are still going, my machine is pretty slow :( Let see what Borg says.
Or does it only react to PRs?

NeQuissimus on 4 Aug 2018

I can't see how it would figure out what commit to use on an issue. EDIT: Hydra example: https://hydra.nixos.org/job/nixos/release-18.03/nixos.tests.nfs3.x86_64-linux

vcunat on 4 Aug 2018

LGTM

NeQuissimus on 4 Aug 2018

Tests put back, Hydra seems OK :tada:

vcunat on 4 Aug 2018

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

libGL not working on non-NixOS (without setting up)

joepie91 · 102Comments

Julia Pkgs in NixOS

nico202 · 70Comments

Zero Hydra Failures for 17.09

globin · 65Comments

nixos/tests: Leftover tasks towards unified Python integration tests

tfc · 68Comments

Bug Fix Bonanza: Triage Vulnerabilities Since 16.03 (Vulnerability Roundup 1)

grahamc · 77Comments