Describe the bug
When an md device is used (most likely because two or more block
devices are setup in a software RAID configuration) then mdadm
package is
installed and it's used especially at bootstrap time to assemble
these block devices and expose the md device.
In the 19.03 even if the version of the tool was the same, there was no
systemd service installed.
In the 19.09 a commit ( #b9b27912 ) explicitly enabled some systemd
service units to be installed, resulting into the mdmonitor.service
unit to be installed in the configuration.
The problem with that commit is that the installed unit has a
completely bogus configuration (i.e. not working in a nixos system),
resulting in a system with a degraded configuration.
The installed unit has the following source:
# systemctl cat mdmonitor
# /nix/store/108j7v1j85c97xwxfgvz11zak49h78x5-mdadm-4.1/lib/systemd/system/mdmonitor.service
# This file is part of mdadm.
#
# mdadm is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
[Unit]
Description=MD array monitor
DefaultDependencies=no
[Service]
Environment= MDADM_MONITOR_ARGS=--scan
EnvironmentFile=-/run/sysconfig/mdadm
ExecStartPre=-/usr/lib/mdadm/mdadm_env.sh
ExecStart=/nix/store/108j7v1j85c97xwxfgvz11zak49h78x5-mdadm-4.1/sbin/mdadm --monitor $MDADM_MONITOR_ARGS
And the log of the failing service is:
mdadm[5723]: mdadm: No mail address or alert command - not monitoring.
systemd[1]: Starting MD array monitor...
systemd[1]: Started MD array monitor.
systemd[1]: mdmonitor.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: mdmonitor.service: Failed with result 'exit-code'.
systemd[1]: Starting MD array monitor...
systemd[6632]: mdmonitor.service: Executable /usr/lib/mdadm/mdadm_env.sh missing, skipping: No such file or directory
systemd[1]: Started MD array monitor.
mdadm[6633]: mdadm: No mail address or alert command - not monitoring.
systemd[1]: mdmonitor.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: mdmonitor.service: Failed with result 'exit-code'.
It appears that apart from the clearly wrong path expectancies this
module want some other missing parameters like an email of the person
to contact.
To Reproduce
Steps to reproduce the behavior:
Configure a 19.09 system with a RAID device
Expected behavior
For the monitoring service to run or for its unit to be missing.
Metadata
"x86_64-linux"
Linux 5.3.8, NixOS, 19.09pre-git (Loris)
yes
yes
nix-env (Nix) 2.3
"nixos-18.09.2526.ce0ec21d6cc"
/nix/var/nix/profiles/per-user/root/channels/nixos
Maintainer information:
# a list of nixpkgs attributes affected by the problem
attribute:
- mdadm
# a list of nixos modules affected by the problem
module:
cc: @abbradar
Can confirm, I just updated the CI server for static-haskell-nix from 19.03 to 19.09 and found that the new service mdmonitor.service
fails at startup.
I don't know what mdadm_env.sh typically contains, but it is optional. (I don't need to pass extra env for into the service.)
I've been running with this config since starting to use mdadm, so I have not encountered this issue:
```
environment.etc."mdadm.conf".text = ''
MAILADDR root
'';
````
One option is to let NixOS configure /etc/mdadm.conf
. But not all systems have email sending capabilities, so it might hide the fact that notifications will not be sent (or received). Another option is to have mdmonitor.service be opt-in. I think I like that best.
Could someone please say what the status with this is? The workaround works, but someone who has an mdraid shouldn't have to go searching the net to find a solution to something that should be working out of the box.
What should be the default behaviour? No monitoring or monitoring with notification to (perhaps effectively) /dev/null? (At least the current behaviour makes it obvious that monitoring is off :-D)
My preferred solution is to opt-in to mdmonitor.service. That also sidesteps the issue of whether NixOS or the user should own /etc/mdadm.conf. The mdmonitor service comes from upstream via systemd.packages = [ pkgs.mdadm ]
. I _think_ NixOS has a way to override these, but I don't know how.
@bjornfor You're right and I can see your point, but just letting a service fail without more information is not really user-friendly. Maybe a warning that explicitly points out why the service is failing? Or maybe something in the documentation instead?
Maybe a warning that explicitly points out why the service is failing?
The service says "mdadm: No mail address or alert command - not monitoring." right before exiting. Improving on that message must be done upstream.
"Bjørn" == Bjørn Forsman notifications@github.com writes:
>> Maybe a warning that explicitly points out why the service is failing?
Bjørn> The service says "mdadm: No mail address or alert command -
Bjørn> not monitoring." right before exiting. Improving on that
Bjørn> message must be done upstream.
what? The issue here is that a Service I cannot configure via Nix gets
setup without a warning and that causes for a system using mdadm to
appear as degraded. The right thing to do in my opinion is to:
1) add a configuration entry where the service can be disabled
2) In case it's enabled, it will be necessary to fill-in another attr
for the mail address, at least
@bjornfor While you're correct that the service alerts and fails with that message, it doesn't mean that the NixOS user knows how to solve it unless they go searching for the above solution you provided here in this very Github ticket. The point is that if we want to make NixOS more accessible to people who aren't that "deep" into NixOS (and I can appreciate that "deep" is a relative term here), then we must provide clear hints/pointers to possible issues and their solutions. May it be in the documentation or by alerting the user during eval time.
Since NixOS is so different from other systems I'd argue it is very much our responsibility to provide guidance on how to solve possible issues, upstream or not.
Another problem:
When you change e.g. environment.etc."mdadm.conf".text
, then mdmonintor.service
is not restarted.
Also, once it has failed, a few times, it will not restart at all, and due to absence of a NixOS service module one cannot override it to restart forever.
Where is the service defined again?
I think we should disable the automatic usage of the upstream systemd unit, defined here:
Because it is seems to be not even very correct (for our use cases): It does not get started as part of multi-user.target
, so it is not run on switch-to-configuration
or nixops deploy
. For an important alerting service like mdadm, that is bad.
Also does anybody know what the NIXOS=1
does? Edit: that's in the patch here.
Do we have means to only pull a NixOS unit in if udev has detected a RAID device but still trigger that detection during activation?
Also we might want to setup a default delivery command or setup local mail delivery for cases like these. We actually miss any kind of way of informing the admin if things are going wrong in the background. We only have the systemd journal that users/admins have to scrape through.
IMO we should just remove this unit for now, as it being on by default is just breakage for lots of people. We may want to reintroduce it with proper configuration as a NixOS module, and maybe even have the NixOS module be auto-activated if things in the configuration that look like a RAID setup are there so that the user is forced to set an email address and doesn't just not know when things go wrong, but… for now this looks like a net liability?
FWIW, the thing I personally use, that doesn't require polluting /etc
:
{systemd.services.mdadm-monitor = {
description = "Monitor RAID disks";
wantedBy = [ "multi-user.target" ];
script = "${pkgs.mdadm}/bin/mdadm --monitor -m root /dev/md126 /dev/md127";
};}
I agree the unit file should be removed for now. It fails out of the box and terminate, thus does not even run.
On 18:54 15.04.20, Niklas Hambüchen wrote:
I agree the unit file should be removed for now. It fails out of the
box and terminate, thus does not even run.
I am not sure if removing it is a net-win. It surely remove the pressure
for anyone to look into it. We can't reasonably detect it from the NixOS
configuration if such a device exists. They are usually automatically
assembled and then just exist as yet another disk in /dev/mapper.
The way it currentl is handled is being spawned through udev whenevver a
device is found that "needs" monitoring.
Having defaults that just work would make a lot of sense to me. My ideal
solution (if I had a wish) would be having local mail delivery to root
and just set that as default for the service. The service would still
only be started through udev.
That all being said I've just deactivated the unit via:
systemd.units."mdmonitor.service".enable = false;
I just ran into this on my personal machine and I think I can work out a patch acceptable with the various interests here. Not sure what the time frame will be, though.
Most helpful comment
@bjornfor While you're correct that the service alerts and fails with that message, it doesn't mean that the NixOS user knows how to solve it unless they go searching for the above solution you provided here in this very Github ticket. The point is that if we want to make NixOS more accessible to people who aren't that "deep" into NixOS (and I can appreciate that "deep" is a relative term here), then we must provide clear hints/pointers to possible issues and their solutions. May it be in the documentation or by alerting the user during eval time.
Since NixOS is so different from other systems I'd argue it is very much our responsibility to provide guidance on how to solve possible issues, upstream or not.