At the time of writing, trying to connect to https://hydra.nixos.org/ give the following error:
DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=hydra;host=10.254.1.2;user=hydra;','',...) failed: could not connect to server: Connection timed out
Is the server running on host "10.254.1.2" and accepting
TCP/IP connections on port 5432? at /nix/store/bf666x0vqgdaxfal1ywvi9any2vx9q7x-hydra-perl-deps/lib/perl5/site_perl/5.30.0/DBIx/Class/Storage/DBI.pm line 1517. at /nix/store/8271gj7cpcbz4mf6s0kc6905fnfh1bng-hydra-0.1.20191108.4779757/libexec/hydra/lib/Hydra/Helper/CatalystUtils.pm line 418
cc @edolstra @grahamc @FRidh (not sure who else is on the infrastructure team)
Hey Jon, thanks for the ping. I noticed this too at about 17:15UTC, when the alarms about failing channel bump jobs started happening:
I went to check out the log on the server which runs these updates, and saw:
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: path=$(curl -LH 'Accept: application/json' https://hydra.nixos.org/job/nix-pills/master/html-split/latest | jq -re '.buildoutputs.out.path') && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: nix-store -r $path && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: ln -sfn $path/share/doc/nix-pills nixos/nix-pills-raw
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: Dload Upload Total Spent Left Speed
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: [10.1K blob data]
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: make: *** [Makefile:50: nixos/nix-pills-raw] Error 1
so I just tested connecting to the hydra.nixos.org server (ceres
) which went well. I then started to suspect a database server (chef
) problem. I tried to SSH, and can't connect. I took a look at the metrics for chef
, and ... well:
https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef:9100&from=1576857600000&to=1576861200000
it is down and has been for a bit.
I have since escalated to @edolstra and @rbvermaa, who have keys to this particular part of the castle.
I don't see any related incidents: https://www.hetzner-status.de/en.html
@grahamc thanks for the quick response :)
Also, I sent you an email :) to [email protected], please take a look
Tried to reach the database server (chef), but couldn't. Performing a hard reset now.
Machine did not come up with hard reset, nor with rescue system. Engaged Hetzner support.
This is incredible timing. The infra team was discussing last week about getting a replacement machine for the database, because:
We had put in the request for the new machine yesterday :).
I updated the "hydra is down "error page at https://hydra.nixos.org/ in https://github.com/NixOS/nixos-org-configurations/commit/72e03ec41f2e146de9c972cb9b9731584079daba to link the monitoring infra and tickets with the label "infrastructure". The infra team is discussing options in case Hetzner support doesn't reply soon.
We have heard back from support about the database machine (chef
), they say an SSD has failed. The machine has been booted in to a rescue console to find out if the disk was part of the root filesystem or the /data
array.
We brought chef
up successfully and the /data
mirror is indeed degraded. We're considering next steps.
We're preparing to take a backup directly off the disk, and then we will replace the failed disk.
The first, block-wise backup is on its way:
pv /dev/md127 | gzip | mbuffer -q | ssh 芦remote禄 'mbuffer -q | gzip -d > /hydradbrecv/storage/dd'
transferring at roughly 100MiB/s, with an ETA of about 1h30m.
The receiving end is a ZFS pool:
# zpool iostat -v 1
capacity operations bandwidth
pool alloc free read write read write
--------------------------------- ----- ----- ----- ----- ----- -----
hydradbrecv 4.77G 883G 0 238 118 12.8M
mirror 2.39G 442G 0 119 59 6.41M
pci-0000:01:00.0-scsi-0:0:1:0 - - 0 59 29 3.21M
pci-0000:01:00.0-scsi-0:0:2:0 - - 0 60 29 3.21M
mirror 2.38G 442G 0 118 59 6.40M
pci-0000:01:00.0-scsi-0:0:3:0 - - 0 59 29 3.20M
pci-0000:01:00.0-scsi-0:0:4:0 - - 0 59 29 3.20M
gzip
was a bad choice, so I switched to lz4
, which isn't taking 100% of a core:
pv /dev/md127 | lz4 | mbuffer -q | ssh 芦remote禄 'mbuffer -q | lz4 -d > /hydradbrecv/storage/dd'
transferring at roughly 350MiB/s, with an ETA of about 30m.
The receiving end is a ZFS pool:
# zpool iostat -v 1
capacity operations bandwidth
pool alloc free read write read write
--------------------------------- ----- ----- ----- ----- ----- -----
hydradbrecv 2.76G 885G 0 2.17K 0 118M
mirror 1.38G 443G 0 1.12K 0 59.5M
pci-0000:01:00.0-scsi-0:0:1:0 - - 0 572 0 29.8M
pci-0000:01:00.0-scsi-0:0:2:0 - - 0 579 0 29.8M
mirror 1.38G 443G 0 1.04K 0 58.1M
pci-0000:01:00.0-scsi-0:0:3:0 - - 0 531 0 29.1M
pci-0000:01:00.0-scsi-0:0:4:0 - - 0 535 0 29.1M
--------------------------------- ----- ----- ----- ----- ----- -----
If you want to watch in real time: https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&refresh=30s&var-instance=chef:9100&from=1576876200000&to=now&fullscreen&panelId=13
The dd
copy arrived intact:
backup# md5sum ./dd
5db3000dc3486a2e701e976bef0c988f ./dd
vs.
chef# pv /dev/md127 | md5sum -
447GiB 0:18:26 [ 413MiB/s] [=================================>] 100%
5db3000dc3486a2e701e976bef0c988f -
Apparently the block device compresses very well:
backup# du --apparent-size -h ./dd
448G ./dd
backup# du -h ./dd
99G ./dd
I'm also replicating this dd
backup elsewhere as well:
# ssh 芦remote禄 'zfs send -c hydradbrecv/storage@2019-12-20T21:51UTC--dd-only | mbuffer' \
| mbuffer | zfs recv -s rpool/hydradbstorage
On to a pg_dumpall based backup. We'll first block external access to postgres via iptables to prevent any services from causing IO and stressing the disk further:
ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP
iptables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
iptables -I nixos-fw -p tcp --dport 5432 -j DROP
then start postgresql and:
pg_dumpall | lz4 | mbuffer -q | ssh 芦remote禄 'mbuffer -q | lz4 -d > /hydradbrecv/storage/pg_dumpall.sql'
pg_dumpall
finished:
# du --apparent-size -h ./pg_dumpall.sql
109G ./pg_dumpall.sql
# du -h ./pg_dumpall.sql
48G ./pg_dumpall.sql
I took a snapshot and will be sending this to my secondary backup machine as soon as the first transfer finishes (likely won't finish before morning, as it is going much further away than Germany to Amsterdam, and also not going to a fancypants Packet datacenter.)
We've now requested Hetzner replace the failed disk now. We should be back online once that RAID array recovers. No word yet on the new server we ordered.
It looks like Hetzner took the machine down to replace the disk at 2019-12-21 00:20GMT:
[>....................] recovery = 0.5% (2366080/468719424) finish=39.4min speed=197173K/sec
[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Mon Mar 19 14:49:55 2018
Raid Level : raid1
Array Size : 468719424 (447.01 GiB 479.97 GB)
Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Dec 21 00:39:38 2019
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Consistency Policy : bitmap
Rebuild Status : 12% complete
Name : chef:data (local to host chef)
UUID : c3e8f78f:39d17bae:d1522247:19c60b91
Events : 691
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
2 8 49 1 spare rebuilding /dev/sdd1
```/dev/md127:
Version : 1.2
Creation Time : Mon Mar 19 14:49:55 2018
Raid Level : raid1
Array Size : 468719424 (447.01 GiB 479.97 GB)
Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Dec 21 01:05:43 2019
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Consistency Policy : bitmap
Rebuild Status : 79% complete
Name : chef:data (local to host chef)
UUID : c3e8f78f:39d17bae:d1522247:19c60b91
Events : 1026
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
2 8 49 1 spare rebuilding /dev/sdd1
```
We have a clean array!
[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Mon Mar 19 14:49:55 2018
Raid Level : raid1
Array Size : 468719424 (447.01 GiB 479.97 GB)
Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Dec 21 01:13:57 2019
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : chef:data (local to host chef)
UUID : c3e8f78f:39d17bae:d1522247:19c60b91
Events : 1126
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
2 8 49 1 active sync /dev/sdd1
We're back!
I'm closing this issue. We will be getting a new machine for the database within the next week, to address the next impending-doom problem:
[root@chef:~]# df -ha /data
Filesystem Size Used Avail Use% Mounted on
/dev/md127 439G 426G 14G 97% /data
Thank you @grahamc for your hard work. I (as well as many others) appreciate it :)
Most helpful comment
We're back!
I'm closing this issue. We will be getting a new machine for the database within the next week, to address the next impending-doom problem: