Nixpkgs: busybox fsck breaks NixOS's fsck handling logic

Created on 8 May 2018 · 9Comments · Source: NixOS/nixpkgs

This code in stage-1-init.sh runs fsck $fsckFlags "$device":

https://github.com/NixOS/nixpkgs/blob/c2b668ee7264000b10d554754150ea9305b6ef3d/nixos/modules/system/boot/stage-1-init.sh#L260-L277

This fsck seems to be the generic wrapper that calls the corresponding fsck for the corresponding filesystem type, like fsck.ext4.

Note in the above code how NixOS has (correct) code to handle the exit code, e.g. when the bits of decimal 2 or 4 are set in the exit code (see man fsck).

But the generic fsck wrapper seems to not correctly pass through the exit code.

On my server I get this boot failure:

Checking /dev/disk/by-label/root...
fsck (busybox 1.27.2)
[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/disk/by-label/root
root: Superblock hint for external superblock should be 0x97f.  FIXED.
root: 73146/122093568 files (0.5% non-contiguous), 8267968/488345344 blocks
Warning: fsck.ext4 /dev/disk/by-label/root terminated by signal 1
fsck on /dev/disk/by-label/root failed.

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on '/mnt-root' and then start stage 2. Press one
of the following keys:

  r) to reboot immediately
  *) to ignore the error and continue

Here fsck.ext4 found a fixable problem and fixed it, then exiting with exit code 1 as expected.
But the generic fsck wrapper tuns that into exit code 8.

As a result, in stage-1-init.sh the wrong if-branch is taken,
and my server is stuck waiting for a key to be pressed instead of booting through.

Source

nh2

Most helpful comment

Upstream patch is at http://lists.busybox.net/pipermail/busybox/2018-May/086417.html

nh2 on 8 May 2018

👍2

All 9 comments

A workaround is to replace the line

fsck $fsckFlags "$device"

fsck.ext4 $fsckFlags "$device"

to directly invoke the fsck for your file system type (in my case ext4) without the generic fsck wrapper.

But we should really find out what the fsck wrapper is doing wrong. What code is running here, is this wrapper from busybox as suggested by the error message?

nh2 on 8 May 2018

FWIW I had issues with busybox utils in stage1 that were "resolved" by using utillinux instead for things like mount, etc. This was w/musl (and shouldn't happen, busybox+musl should work great!) but it would be interesting to see if the same workaround worked here as well.

Is there a NixOS test we can put together to explore this?
Also it says "terminated by signal 1" which seems curious-- it often means exit code lower bits will be 1 but doesn't seem normal behavior. Who's sending SIGHUP to fsck?

Also, FWIW:
https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?id=72089cf6b4a77214ec4fd21d5ee5bf56958781cb#n70

Which seems to suggest '8' means fsck exit error.

dtzWill on 8 May 2018

Looks like it's a bug in busybox.

I have found the source code in busybox that emits the message

Warning: fsck.ext4 /dev/disk/by-label/root terminated by signal 1

It is https://git.busybox.net/busybox/tree/e2fsprogs/fsck.c?h=1_28_3#n456

 child_died:

    status = WEXITSTATUS(status);
    if (WIFSIGNALED(status)) {
        sig = WTERMSIG(status);
        status = EXIT_UNCORRECTED;
        if (sig != SIGINT) {
            printf("Warning: %s %s terminated "
                "by signal %d\n",
                inst->prog, inst->device, sig);
            status = EXIT_ERROR;
        }
    }

Luckily it is easy to distinguish this from the upstream, non-busybox e2progs fsck wrapper, because that one has the following source code instead: https://github.com/tytso/e2fsprogs/blob/d5bd126cd6a06dd2a8652557776ea298fb442912/misc/fsck.c#L619-L620

The message is different (with signal vs by signal).

Suspiciously, this code is handles the case where fsck was terminated by a signal (WIFSIGNALED), but in my case, fsck.ext4 simply exits with code 1, without any signal being involved. So I suspect that somehow the wrong code path is taken.

I have not confirmed whether the upstream fsck wrapper, too, suffers from the bug of not forwarding the exit code of fsck.ext4 correctly. And I cannot easily try it, because I can't run it from the busybox shell.

But I am reasonably sure that this issue only exists in busybox's fsck wrapper, and that it was introduced in this commit:

https://git.busybox.net/busybox/commit/?id=c4fb8c6ad52e8007c6fa07e40f043bb2e0c043d1

@@ -439,9 +446,8 @@ static int wait_one(int flags)
    }
  child_died:

-   if (WIFEXITED(status))
-       status = WEXITSTATUS(status);
-   else if (WIFSIGNALED(status)) {
+   status = WEXITSTATUS(status);
+   if (WIFSIGNALED(status)) {
        sig = WTERMSIG(status);
        status = EXIT_UNCORRECTED;
        if (sig != SIGINT) {

This change busybox changed the semantics of the original fsck code.

You can only reasonably call WEXITSTATUS() on status if you checked WIFEXITED() before, because as documented in waitstatus.h:

/* If WIFEXITED(STATUS), the low-order 8 bits of the status.  */
#define __WEXITSTATUS(status)   (((status) & 0xff00) >> 8)

Doing status = WEXITSTATUS(status); unconditionally will mask away the parts of status that indicate whether a signal was raised, so that afterwards WIFSIGNALED() is true even if no signal was raised.

So that's why busybox's fsck doesn't forward fsck.ext4's status code correctly.

nh2 on 8 May 2018

👍1

I have written a patch for busybox that fixes the issue (https://github.com/nh2/busybox/commit/b9dbf69a5ad0cb585392a88853a729ee159366c7), tested that it fixes the reboot hang for my NixOS server, and submitted it to the upstream mailing list.

On NixOS I'm using

{
  busybox = super.busybox.overrideAttrs (oldAttrs: rec {
    patches = oldAttrs.patches ++ [
      ./fsck-Fix-incorrect-handling-of-child-exit-code.patch
    ];
  });
}

nh2 on 8 May 2018

Upstream patch is at http://lists.busybox.net/pipermail/busybox/2018-May/086417.html

nh2 on 8 May 2018

👍2

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.