This seems to affect every NixOS with swraid since 15.09. Both lvm raid and mdraid.
Symptoms:
RAID work at a performance much slower than usual on writing.
A good test is that of postgresql/bin/pg_test_fsync, and see fsync/s results. The first write(8k)-fsync() loop test will show over 100fsync/s on normal spindisks (ext4-over-raid1), and will show below 15fsync/s on that degraded performance (I have 3,5fsync/s on one disk for that test).
Moreover, the kernel log will contain a message like this in your boot:
[ 1.768469] md: md127 in immediate safe mode
[ 1.769362] md: md126 in immediate safe mode
See the lines in the swraid nixos test from hydra for 15.09: https://hydra.nixos.org/build/27544644/log/raw and for 16.03: https://hydra.nixos.org/build/35228170
This "immediate safe mode" seems to be for _shutdown_, writing metadata at every data write (some "sync" mode), for cases when the RAID cannot be stopped easily on shutdown (like rootfs). I read about this in this blog post from 2012: http://neil.brown.name/blog/20120615073245
Why we have this state set at boot, I don't know.
I have been requesting help from LVM developers (lvm raid) on this regard: https://bugzilla.redhat.com/show_bug.cgi?id=1332221
Solution:
1) It seems that our shutdown does not deactivate the arrays. It seems that it should, or at least _activate_ that "immediate safe mode".
2) (LVM) I've also been told that running a broad "vgchange -ay" (for lvm raids) on boot is not a safe thing to do, that we should only enable the minimum set of LVs for boot "lvchange -ay rootvg/rootlv", and leave the rest for systemd (stage 2).
Once a LV (lvm raid) has been started in such mode and runs slow, it can be made fast again by deactivating it (lvchange -an) and activating it again (lvchange -ay). That requires umounting, if it holds a fs, of course.
For what I could google, only NixOS is affected by these "immediate safe mode" messages a boot. So we have to improve all this.
I got what is putting the raid in safe mode in stage 1. I added "boot.trace":
mai 05 11:13:12 comanegra stage-1-init: starting device mapper and LVM...
mai 05 11:13:12 comanegra stage-1-init: + lvm vgchange -ay raid2
mai 05 11:13:12 comanegra kernel: md: raid1 personality registered for level 1
mai 05 11:13:12 comanegra kernel: md/raid1:mdX: active with 2 out of 2 mirrors
mai 05 11:13:12 comanegra kernel: created bitmap (697 pages) for device mdX
mai 05 11:13:12 comanegra kernel: mdX: bitmap initialized from disk: read 44 pages, set 198 of 1426
mai 05 11:13:12 comanegra kernel: usb 3-2: new low-speed USB device number 2 using ohci-pci
mai 05 11:13:12 comanegra stage-1-init: 1 logical volume(s) in volume group "raid2" now active
..... MANY IRRELEVANT LINES
mai 05 11:13:12 comanegra stage-1-init: + mkdir -p /run/udev/rules.d
mai 05 11:13:12 comanegra stage-1-init: + echo ACTION=="add|change", SUBSYSTEM=="block", ENV{MAJOR}
mai 05 11:13:12 comanegra stage-1-init: + udevadm control --exit
mai 05 11:13:12 comanegra stage-1-init: + exec
mai 05 11:13:12 comanegra kernel: md: mdX in immediate safe mode
So, the "udevadm control --exit" before switch_to_root is interpreted as a system shutdown, and the RAIDs are put into immediate safe mode. Maybe we MAY NOT call "udevadm control --exit" at stage1.
cc @aszlig
Maybe it's this line what's actually triggering it:
# Kill any remaining processes, just to be sure we're not taking any
# with us into stage 2. But keep storage daemons like unionfs-fuse.
pkill -9 -v -f '@'
... since the kernel code is explicitly checking for signals:
if (signal_pending(current)) {
if (mddev->pers->sync_request && !mddev->external) {
printk(KERN_INFO "md: %s in immediate safe mode\n",
mdname(mddev));
mddev->safemode = 2;
}
flush_signals(current);
}
I saw that kernel code, but I don't know how to interpret it. A signal to _what_ process would be triggering that?
I don't see any pkill in my "journalctl -b" (from what I copied the lines above).
Maybe it is something in stage 2 that triggers it... I made the correlation to the udevadm control --exit because it sounds like a shutdown.
Ah, I see the pkill in the stage1... the exec redirects the log. grmbl. I thought "boot.trace" would echo the whole stage-1. grr
I saw that kernel code, but I don't know how to interpret it. A signal to what process would be triggering that?
From a quick look it looks like it's called from a kernel thread called raid1d (or the like). Actually quite surprising if that's true and sending random signals to kernel threads actually works and does something; but after all, it seems prepared for that.
Tests can be done with "nix-build -A tests.installer.swraid nixos/release.nix", because the line appears there. Let's play with it.
I'll check what process has @ in the name. We should find a pkill line that does not kill the kernel processes (PPID=2, it seems).
By the way, I confirm that it is the pkill. Checked with the swraid nixos test. Removing the pkill, the safe mode message disappears.
Any proposal for a new pkill line that does not signal the kernel processes?
I'd go for the boring solution since pkill doesn't seem capable of too complicated expressions:
for pid in $(pgrep -v -f '@'); do
# Don't send signals to kernel threads as unexpected things can happen (Issue #15226)
# Detection method from http://stackoverflow.com/questions/12213445/identifying-kernel-threads
if [ "$(cat /proc/$pid/cmdline)" != "" ]; then
kill -9 "$pid"
fi
done
Not sure if the $() construct creates any temporary processes that the pgrep could catch, though...
A way to match kernel processes is to check for process group 0 and we already should have procfs available at that stage, so it could work with something like this:
kill -9 $(for pid in $(pgrep -v -f @); do [ "$(cut -d' ' -f5 "/proc/$pid/stat")" -ne 0 ] || continue; echo "$pid"; done)
@dezgeg: Ah so we posted at roughly the same time, I've looked into the implementation of htop for how they identify kernel processes:
https://github.com/hishamhm/htop/blob/3a4c0fa2d60e2354bad25f74fc064c9abf658c48/linux/LinuxProcess.h#L121-L127
The @ is for this:
machine# 177 0 0:00 @bin/unionfs /nix/.rw-store=rw:/nix/.ro-store=ro /mnt-root/nix/store
Why is there a @?
Full ps before pkill:
machine# + ps
machine# PID USER TIME COMMAND
machine# 1 0 0:00 {init} /nix/store/dfhgcb3zhyq1cqkh64x95vvqwnlzqybh-extra-utils/bin/a
sh /init
machine# 2 0 0:00 [kthreadd]
machine# 3 0 0:00 [ksoftirqd/0]
machine# 4 0 0:00 [kworker/0:0]
machine# 5 0 0:00 [kworker/0:0H]
machine# 6 0 0:00 [kworker/u2:0]
machine# 7 0 0:00 [rcu_sched]
machine# 8 0 0:00 [rcu_bh]
machine# 9 0 0:00 [migration/0]
machine# 10 0 0:00 [kdevtmpfs]
machine# 11 0 0:00 [netns]
machine# 12 0 0:00 [perf]
machine# 13 0 0:00 [khungtaskd]
machine# 14 0 0:00 [writeback]
machine# 15 0 0:00 [ksmd]
machine# 16 0 0:00 [khugepaged]
machine# 17 0 0:00 [crypto]
machine# 18 0 0:00 [kintegrityd]
machine# 19 0 0:00 [bioset]
machine# 20 0 0:00 [kblockd]
machine# 21 0 0:00 [devfreq_wq]
machine# 22 0 0:00 [kworker/0:1]
machine# 23 0 0:00 [kswapd0]
machine# 24 0 0:00 [vmstat]
machine# 25 0 0:00 [fsnotify_mark]
machine# 36 0 0:00 [kthrotld]
machine# 37 0 0:00 [deferwq]
machine# 38 0 0:00 [kworker/u2:1]
machine# 54 0 0:00 tee -i /proc/self/fd/8
machine# 55 0 0:00 {init} /nix/store/dfhgcb3zhyq1cqkh64x95vvqwnlzqybh-extra-utils/bin/ash /init
machine# 89 0 0:00 [ata_sff]
machine# 90 0 0:00 [scsi_eh_0]
machine# 91 0 0:00 [scsi_tmf_0]
machine# 92 0 0:00 [scsi_eh_1]
machine# 93 0 0:00 [scsi_tmf_1]
machine# 94 0 0:00 [kworker/u2:2]
machine# 95 0 0:00 [kworker/u2:3]
machine# 96 0 0:00 [khvcd]
machine# 97 0 0:00 [kworker/0:2]
machine# 98 0 0:00 [bioset]
machine# 99 0 0:00 [bioset]
machine# 100 0 0:00 [bioset]
machine# 101 0 0:00 [kworker/0:3]
machine# 104 0 0:00 [kworker/0:1H]
machine# 145 0 0:00 [jbd2/vdb-8]
machine# 146 0 0:00 [ext4-rsv-conver]
machine# 148 0 0:00 [ext4lazyinit]
machine# 177 0 0:00 @bin/unionfs /nix/.rw-store=rw:/nix/.ro-store=ro /mnt-root/nix/store -o allow_other,cow,nonempty,chroot=/mnt-root,max_files=32768,hide_meta_files,dev,suid
machine# 196 0 0:00 ps
I prefer @dezgeg method over cut -f5 on stat, and I run it. It seems to work fine!
@viric, @dezgeg: This is quite error-prone as every user space process can run with an empty cmdline as well, see here for a PoC:
with import <nixpkgs> {};
let
compile = text: runCommand "test" { inherit text; } ''
echo "$text" | gcc -o "$out" -xc -
'';
shady = compile ''
#include <stdio.h>
#include <unistd.h>
int main(void) {
char *argv[] = {"", NULL};
if (fork() > 0) return 0;
if (fork() > 0) return 0;
execve("${compile ''
#include <stdio.h>
#include <unistd.h>
int main(void) {
FILE *f;
f = fopen("pidfile", "w+");
fprintf(f, "%d\n", getpid());
fclose(f);
return sleep(60);
}
''}", argv, NULL);
return 0;
}
'';
in vmTools.runInLinuxVM (runCommand "test" {} ''
"${shady}"
while [ ! -s pidfile ]; do sleep 0.01; done
upid="$(< pidfile)"
kpid=2
echo "User space cmdline: \"$(< "/proc/$upid/cmdline")\""
cat "/proc/$upid/stat"
echo "Kthread cmdline: \"$(< /proc/$kpid/cmdline)\""
cat "/proc/$kpid/stat"
'')
Output:
User space cmdline: ""
145 (p7ppn374d5yj4bz) S 1 0 0 0 -1 1077936128 57 0 0 0 0 0 0 0 20 0 1 0 114 4284416 46 18446744073709551615 4194304 4196812 140726941414416 140726941414120 139674915616640 0 0 4 0 1 0 0 17 0 0 0 0 0 0 6293968 6294568 27910144 140726941417414 140726941417415 140726941417415 140726941417415 0
Kthread cmdline: ""
2 (kthreadd) S 0 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 4 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ok, if you see cut -f5 of stat stable enough, let's go with it.
Next steps?
Well, there are plenty of ways to lose; the cut -f on /proc/$pid/stat will also stop working if the binary name contains spaces (since comm is the second field).
@viric: I don't... another way could be to use /proc/PID/exe with readlink, but we have to be careful that we don't match terminating processes... but on the other hand, we want to kill the processes anyway.
So I guess this one should work better:
kill -9 $(for pid in $(pgrep -v -f @); do readlink "/proc/$pid/exe" &> /dev/null || continue; echo "$pid"; done)
I wonder if it would be a better strategy to write this in a language which doesn't execute other programs as part of determining which programs to kill.
@grahamc: Yeah, would probably be less racy, because if some of the pids acquired by the for loop terminates in-between, the kill will fail.
OTOH, maybe let's use cgroups for that. IIUC the pkill is for terminating any still running processes up to that point. If all of these processes were in the same cgroup, it would be way more selective and we should only kill the right processes.
@viric: Also note the commit message from c54eb790633c85bc22c4328e0f035cfa0be06780, which references https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/
I wonder if the processes can be correctly and thoroughly categorized using cgroups. Do you know if that is the case? Also consider that users are able to very easily inject additional commands in to stage 1, so we need to be careful.
Hm, the problem is that if we're running as root there is no way to prevent a program from circumventing it, except using namespaces. Unfortunately using namespaces here has other implications, for example if we're running in a PID+user namespace we can't easily have access to the PIDs in the parent namespace, which some command hooks may need. And if that's the case people might get confused by unexpected behavior (because in a PID namespace processes could have the same PID as processes in the root namespace).
So for now I think we should keep it at the following:
kill -9 $(for pid in $(pgrep -v -f '^@'); do readlink "/proc/$pid/exe" &> /dev/null || continue; echo "$pid"; done)
Note that I changed the regex in pgrep to match against @ for the first character only.
Also, I don't think that implementing this in another language would make it less racy, because we still need to gather all the PIDs and kill them one after another. A less racy way would be to use process groups, but again... a process is still able to switch to a new process group.
For the uninformed ... what is the danger of letting processes continue running and leak to the next stage?
For the uninformed ... what is the danger of letting processes continue running and leak to the next stage?
I think it increases the chance that the initramfs will be used after
the pivot, consuming a bit of RAM.
I think we should go by the simplest codelines. It's stage 1, root only... very controlled environment. Let's try just to fix this raid problem. No need to write a oneliner, for my taste, @aszlig. :)
I'm running my raid systems with @dezgeg code, by now. But we have to put something clear upstream, properly commented on why we kill processes and why ones and not others.
@viric: Doesn't need to be a oneliner, I just find forking kill that often a bit excessive, here's the same in multiple lines:
for pid in $(pgrep -v -f '^@'); do
readlink "/proc/$pid/exe" &> /dev/null || continue
echo "$pid"
done | xargs kill -9
Ok. Should I prepare a commit/PR with that, writing some comments over it? What is the next step?
@viric: Hm, I'm going to prepare a change to the installer test first, so that we have a regression test.
Great! I wait for it. Do you mean the swraid installer test?
@viric: Yep
@viric: Sorry it took so long, but here is the PR along with a fix: #15275
You were very fast! Now let's have this reviewed, accepted and ported to 16.03 quickly.