Nixpkgs: Broken swraid boot/shutdown sequence

Created on 4 May 2016 · 37Comments · Source: NixOS/nixpkgs

This seems to affect every NixOS with swraid since 15.09. Both lvm raid and mdraid.

Symptoms:
RAID work at a performance much slower than usual on writing.
A good test is that of postgresql/bin/pg_test_fsync, and see fsync/s results. The first write(8k)-fsync() loop test will show over 100fsync/s on normal spindisks (ext4-over-raid1), and will show below 15fsync/s on that degraded performance (I have 3,5fsync/s on one disk for that test).

Moreover, the kernel log will contain a message like this in your boot:

[    1.768469] md: md127 in immediate safe mode
[    1.769362] md: md126 in immediate safe mode

See the lines in the swraid nixos test from hydra for 15.09: https://hydra.nixos.org/build/27544644/log/raw and for 16.03: https://hydra.nixos.org/build/35228170

This "immediate safe mode" seems to be for _shutdown_, writing metadata at every data write (some "sync" mode), for cases when the RAID cannot be stopped easily on shutdown (like rootfs). I read about this in this blog post from 2012: http://neil.brown.name/blog/20120615073245
Why we have this state set at boot, I don't know.

I have been requesting help from LVM developers (lvm raid) on this regard: https://bugzilla.redhat.com/show_bug.cgi?id=1332221

Solution:
1) It seems that our shutdown does not deactivate the arrays. It seems that it should, or at least _activate_ that "immediate safe mode".
2) (LVM) I've also been told that running a broad "vgchange -ay" (for lvm raids) on boot is not a safe thing to do, that we should only enable the minimum set of LVs for boot "lvchange -ay rootvg/rootlv", and leave the rest for systemd (stage 2).

Once a LV (lvm raid) has been started in such mode and runs slow, it can be made fast again by deactivating it (lvchange -an) and activating it again (lvchange -ay). That requires umounting, if it holds a fs, of course.

For what I could google, only NixOS is affected by these "immediate safe mode" messages a boot. So we have to improve all this.

bug kernel nixos

Source

viric

👎1

All 37 comments

I got what is putting the raid in safe mode in stage 1. I added "boot.trace":

mai 05 11:13:12 comanegra stage-1-init: starting device mapper and LVM...
mai 05 11:13:12 comanegra stage-1-init: + lvm vgchange -ay raid2
mai 05 11:13:12 comanegra kernel: md: raid1 personality registered for level 1
mai 05 11:13:12 comanegra kernel: md/raid1:mdX: active with 2 out of 2 mirrors
mai 05 11:13:12 comanegra kernel: created bitmap (697 pages) for device mdX
mai 05 11:13:12 comanegra kernel: mdX: bitmap initialized from disk: read 44 pages, set 198 of 1426
mai 05 11:13:12 comanegra kernel: usb 3-2: new low-speed USB device number 2 using ohci-pci
mai 05 11:13:12 comanegra stage-1-init: 1 logical volume(s) in volume group "raid2" now active
..... MANY IRRELEVANT LINES
mai 05 11:13:12 comanegra stage-1-init: + mkdir -p /run/udev/rules.d
mai 05 11:13:12 comanegra stage-1-init: + echo ACTION=="add|change", SUBSYSTEM=="block", ENV{MAJOR}
mai 05 11:13:12 comanegra stage-1-init: + udevadm control --exit
mai 05 11:13:12 comanegra stage-1-init: + exec
mai 05 11:13:12 comanegra kernel: md: mdX in immediate safe mode

So, the "udevadm control --exit" before switch_to_root is interpreted as a system shutdown, and the RAIDs are put into immediate safe mode. Maybe we MAY NOT call "udevadm control --exit" at stage1.

viric on 5 May 2016

cc @aszlig

domenkozar on 5 May 2016

Maybe it's this line what's actually triggering it:

# Kill any remaining processes, just to be sure we're not taking any
# with us into stage 2. But keep storage daemons like unionfs-fuse.
pkill -9 -v -f '@'

... since the kernel code is explicitly checking for signals:

    if (signal_pending(current)) {
        if (mddev->pers->sync_request && !mddev->external) {
            printk(KERN_INFO "md: %s in immediate safe mode\n",
                   mdname(mddev));
            mddev->safemode = 2;
        }
        flush_signals(current);
    }

dezgeg on 5 May 2016

I saw that kernel code, but I don't know how to interpret it. A signal to _what_ process would be triggering that?

viric on 5 May 2016

I don't see any pkill in my "journalctl -b" (from what I copied the lines above).

Maybe it is something in stage 2 that triggers it... I made the correlation to the udevadm control --exit because it sounds like a shutdown.

viric on 5 May 2016

Ah, I see the pkill in the stage1... the exec redirects the log. grmbl. I thought "boot.trace" would echo the whole stage-1. grr

viric on 5 May 2016

I saw that kernel code, but I don't know how to interpret it. A signal to what process would be triggering that?

From a quick look it looks like it's called from a kernel thread called raid1d (or the like). Actually quite surprising if that's true and sending random signals to kernel threads actually works and does something; but after all, it seems prepared for that.

dezgeg on 5 May 2016

Tests can be done with "nix-build -A tests.installer.swraid nixos/release.nix", because the line appears there. Let's play with it.

viric on 5 May 2016

I'll check what process has @ in the name. We should find a pkill line that does not kill the kernel processes (PPID=2, it seems).

viric on 5 May 2016

By the way, I confirm that it is the pkill. Checked with the swraid nixos test. Removing the pkill, the safe mode message disappears.

Any proposal for a new pkill line that does not signal the kernel processes?

viric on 5 May 2016

I'd go for the boring solution since pkill doesn't seem capable of too complicated expressions:

for pid in $(pgrep -v -f '@'); do
    # Don't send signals to kernel threads as unexpected things can happen (Issue #15226)
    # Detection method from http://stackoverflow.com/questions/12213445/identifying-kernel-threads
    if [ "$(cat /proc/$pid/cmdline)" != "" ]; then
        kill -9 "$pid"
    fi
done

Not sure if the $() construct creates any temporary processes that the pgrep could catch, though...

dezgeg on 5 May 2016

👍1

A way to match kernel processes is to check for process group 0 and we already should have procfs available at that stage, so it could work with something like this:

kill -9 $(for pid in $(pgrep -v -f @); do [ "$(cut -d' ' -f5 "/proc/$pid/stat")" -ne 0 ] || continue; echo "$pid"; done)

aszlig on 5 May 2016

@dezgeg: Ah so we posted at roughly the same time, I've looked into the implementation of htop for how they identify kernel processes:
https://github.com/hishamhm/htop/blob/3a4c0fa2d60e2354bad25f74fc064c9abf658c48/linux/LinuxProcess.h#L121-L127

aszlig on 5 May 2016

The @ is for this:

machine#   177 0          0:00 @bin/unionfs /nix/.rw-store=rw:/nix/.ro-store=ro /mnt-root/nix/store

Why is there a @?

Full ps before pkill:

machine# + ps
machine# PID   USER     TIME   COMMAND
machine#     1 0          0:00 {init} /nix/store/dfhgcb3zhyq1cqkh64x95vvqwnlzqybh-extra-utils/bin/a
sh /init
machine#     2 0          0:00 [kthreadd]
machine#     3 0          0:00 [ksoftirqd/0]
machine#     4 0          0:00 [kworker/0:0]
machine#     5 0          0:00 [kworker/0:0H]
machine#     6 0          0:00 [kworker/u2:0]
machine#     7 0          0:00 [rcu_sched]
machine#     8 0          0:00 [rcu_bh]
machine#     9 0          0:00 [migration/0]
machine#    10 0          0:00 [kdevtmpfs]
machine#    11 0          0:00 [netns]
machine#    12 0          0:00 [perf]
machine#    13 0          0:00 [khungtaskd]
machine#    14 0          0:00 [writeback]
machine#    15 0          0:00 [ksmd]
machine#    16 0          0:00 [khugepaged]
machine#    17 0          0:00 [crypto]
machine#    18 0          0:00 [kintegrityd]
machine#    19 0          0:00 [bioset]
machine#    20 0          0:00 [kblockd]
machine#    21 0          0:00 [devfreq_wq]
machine#    22 0          0:00 [kworker/0:1]
machine#    23 0          0:00 [kswapd0]
machine#    24 0          0:00 [vmstat]
machine#    25 0          0:00 [fsnotify_mark]
machine#    36 0          0:00 [kthrotld]
machine#    37 0          0:00 [deferwq]
machine#    38 0          0:00 [kworker/u2:1]
machine#    54 0          0:00 tee -i /proc/self/fd/8
machine#    55 0          0:00 {init} /nix/store/dfhgcb3zhyq1cqkh64x95vvqwnlzqybh-extra-utils/bin/ash /init
machine#    89 0          0:00 [ata_sff]
machine#    90 0          0:00 [scsi_eh_0]
machine#    91 0          0:00 [scsi_tmf_0]
machine#    92 0          0:00 [scsi_eh_1]
machine#    93 0          0:00 [scsi_tmf_1]
machine#    94 0          0:00 [kworker/u2:2]
machine#    95 0          0:00 [kworker/u2:3]
machine#    96 0          0:00 [khvcd]
machine#    97 0          0:00 [kworker/0:2]
machine#    98 0          0:00 [bioset]
machine#    99 0          0:00 [bioset]
machine#   100 0          0:00 [bioset]
machine#   101 0          0:00 [kworker/0:3]
machine#   104 0          0:00 [kworker/0:1H]
machine#   145 0          0:00 [jbd2/vdb-8]
machine#   146 0          0:00 [ext4-rsv-conver]
machine#   148 0          0:00 [ext4lazyinit]
machine#   177 0          0:00 @bin/unionfs /nix/.rw-store=rw:/nix/.ro-store=ro /mnt-root/nix/store -o allow_other,cow,nonempty,chroot=/mnt-root,max_files=32768,hide_meta_files,dev,suid
machine#   196 0          0:00 ps

viric on 5 May 2016

I prefer @dezgeg method over cut -f5 on stat, and I run it. It seems to work fine!

viric on 5 May 2016

@viric, @dezgeg: This is quite error-prone as every user space process can run with an empty cmdline as well, see here for a PoC:

with import <nixpkgs> {};

let
  compile = text: runCommand "test" { inherit text; } ''
    echo "$text" | gcc -o "$out" -xc -
  '';

  shady = compile ''
    #include <stdio.h>
    #include <unistd.h>
    int main(void) {
      char *argv[] = {"", NULL};

      if (fork() > 0) return 0;
      if (fork() > 0) return 0;

      execve("${compile ''
        #include <stdio.h>
        #include <unistd.h>
        int main(void) {
          FILE *f;
          f = fopen("pidfile", "w+");
          fprintf(f, "%d\n", getpid());
          fclose(f);
          return sleep(60);
        }
      ''}", argv, NULL);
      return 0;
    }
  '';

in vmTools.runInLinuxVM (runCommand "test" {} ''
  "${shady}"
  while [ ! -s pidfile ]; do sleep 0.01; done
  upid="$(< pidfile)"
  kpid=2
  echo "User space cmdline: \"$(< "/proc/$upid/cmdline")\""
  cat "/proc/$upid/stat"
  echo "Kthread cmdline: \"$(< /proc/$kpid/cmdline)\""
  cat "/proc/$kpid/stat"
'')

Output:

User space cmdline: ""
145 (p7ppn374d5yj4bz) S 1 0 0 0 -1 1077936128 57 0 0 0 0 0 0 0 20 0 1 0 114 4284416 46 18446744073709551615 4194304 4196812 140726941414416 140726941414120 139674915616640 0 0 4 0 1 0 0 17 0 0 0 0 0 0 6293968 6294568 27910144 140726941417414 140726941417415 140726941417415 140726941417415 0
Kthread cmdline: ""
2 (kthreadd) S 0 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 4 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

aszlig on 5 May 2016

Ok, if you see cut -f5 of stat stable enough, let's go with it.

Next steps?

viric on 5 May 2016

Well, there are plenty of ways to lose; the cut -f on /proc/$pid/stat will also stop working if the binary name contains spaces (since comm is the second field).

dezgeg on 5 May 2016

@viric: I don't... another way could be to use /proc/PID/exe with readlink, but we have to be careful that we don't match terminating processes... but on the other hand, we want to kill the processes anyway.

aszlig on 5 May 2016

So I guess this one should work better:

kill -9 $(for pid in $(pgrep -v -f @); do readlink "/proc/$pid/exe" &> /dev/null || continue; echo "$pid"; done)

aszlig on 5 May 2016

I wonder if it would be a better strategy to write this in a language which doesn't execute other programs as part of determining which programs to kill.

grahamc on 5 May 2016

@grahamc: Yeah, would probably be less racy, because if some of the pids acquired by the for loop terminates in-between, the kill will fail.

aszlig on 5 May 2016

OTOH, maybe let's use cgroups for that. IIUC the pkill is for terminating any still running processes up to that point. If all of these processes were in the same cgroup, it would be way more selective and we should only kill the right processes.

aszlig on 5 May 2016

👍1

@viric: Btw. the @ is from https://github.com/NixOS/nixpkgs/blob/f32dcf7281acfb5bf8f836c02742ce233bd968db/pkgs/tools/filesystems/unionfs-fuse/prevent-kill-on-shutdown.patch

aszlig on 5 May 2016

👍1

@viric: Also note the commit message from c54eb790633c85bc22c4328e0f035cfa0be06780, which references https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/

aszlig on 5 May 2016

I wonder if the processes can be correctly and thoroughly categorized using cgroups. Do you know if that is the case? Also consider that users are able to very easily inject additional commands in to stage 1, so we need to be careful.

grahamc on 5 May 2016

Hm, the problem is that if we're running as root there is no way to prevent a program from circumventing it, except using namespaces. Unfortunately using namespaces here has other implications, for example if we're running in a PID+user namespace we can't easily have access to the PIDs in the parent namespace, which some command hooks may need. And if that's the case people might get confused by unexpected behavior (because in a PID namespace processes could have the same PID as processes in the root namespace).

So for now I think we should keep it at the following:

kill -9 $(for pid in $(pgrep -v -f '^@'); do readlink "/proc/$pid/exe" &> /dev/null || continue; echo "$pid"; done)

Note that I changed the regex in pgrep to match against @ for the first character only.

Also, I don't think that implementing this in another language would make it less racy, because we still need to gather all the PIDs and kill them one after another. A less racy way would be to use process groups, but again... a process is still able to switch to a new process group.

aszlig on 5 May 2016

For the uninformed ... what is the danger of letting processes continue running and leak to the next stage?

grahamc on 6 May 2016

For the uninformed ... what is the danger of letting processes continue running and leak to the next stage?

I think it increases the chance that the initramfs will be used after
the pivot, consuming a bit of RAM.

7c6f434c on 6 May 2016

I think we should go by the simplest codelines. It's stage 1, root only... very controlled environment. Let's try just to fix this raid problem. No need to write a oneliner, for my taste, @aszlig. :)

I'm running my raid systems with @dezgeg code, by now. But we have to put something clear upstream, properly commented on why we kill processes and why ones and not others.

viric on 6 May 2016

@viric: Doesn't need to be a oneliner, I just find forking kill that often a bit excessive, here's the same in multiple lines:

for pid in $(pgrep -v -f '^@'); do
  readlink "/proc/$pid/exe" &> /dev/null || continue
  echo "$pid"
done | xargs kill -9

aszlig on 6 May 2016

Ok. Should I prepare a commit/PR with that, writing some comments over it? What is the next step?

viric on 6 May 2016

@viric: Hm, I'm going to prepare a change to the installer test first, so that we have a regression test.

aszlig on 6 May 2016

Great! I wait for it. Do you mean the swraid installer test?

viric on 6 May 2016

@viric: Yep

aszlig on 6 May 2016

@viric: Sorry it took so long, but here is the PR along with a fix: #15275

aszlig on 6 May 2016

You were very fast! Now let's have this reviewed, accepted and ported to 16.03 quickly.

viric on 6 May 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

tcpcryptd should be enabled by default for security.

spacekitteh · 3Comments

Agree on common high-level packaging guidelines

copumpkin · 3Comments

nixos-unstable branch hasn't been updated in three months

ayyess · 3Comments

Rust: Update to 1.15

matthiasbeyer · 3Comments

libproxy and network-manager-applet depend on webkitgtk

edolstra · 3Comments