Nixpkgs: 馃毃 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 馃毃

Created on 20 Dec 2019  路  21Comments  路  Source: NixOS/nixpkgs

At the time of writing, trying to connect to https://hydra.nixos.org/ give the following error:

DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=hydra;host=10.254.1.2;user=hydra;','',...) failed: could not connect to server: Connection timed out
Is the server running on host "10.254.1.2" and accepting
TCP/IP connections on port 5432? at /nix/store/bf666x0vqgdaxfal1ywvi9any2vx9q7x-hydra-perl-deps/lib/perl5/site_perl/5.30.0/DBIx/Class/Storage/DBI.pm line 1517. at /nix/store/8271gj7cpcbz4mf6s0kc6905fnfh1bng-hydra-0.1.20191108.4779757/libexec/hydra/lib/Hydra/Helper/CatalystUtils.pm line 418

cc @edolstra @grahamc @FRidh (not sure who else is on the infrastructure team)

blocker channel blocker infrastructure

Most helpful comment

We're back!

https://status.nixos.org/grafana/d/MJw9PcAiz/hydra-jobs?orgId=1&from=1576888500000&to=1576889100000

I'm closing this issue. We will be getting a new machine for the database within the next week, to address the next impending-doom problem:

[root@chef:~]# df -ha /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      439G  426G   14G  97% /data

All 21 comments

Hey Jon, thanks for the ping. I noticed this too at about 17:15UTC, when the alarms about failing channel bump jobs started happening:

https://status.nixos.org/grafana/d/fBW4tL1Wz/scheduled-task-state-channels-website?orgId=1&from=1576855800000&to=1576864200000

image

I went to check out the log on the server which runs these updates, and saw:

Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: path=$(curl -LH 'Accept: application/json' https://hydra.nixos.org/job/nix-pills/master/html-split/latest | jq -re '.buildoutputs.out.path') && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: nix-store -r $path && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: ln -sfn $path/share/doc/nix-pills nixos/nix-pills-raw
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: [10.1K blob data]
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: make: *** [Makefile:50: nixos/nix-pills-raw] Error 1

so I just tested connecting to the hydra.nixos.org server (ceres) which went well. I then started to suspect a database server (chef) problem. I tried to SSH, and can't connect. I took a look at the metrics for chef, and ... well:

https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef:9100&from=1576857600000&to=1576861200000

image

it is down and has been for a bit.

I have since escalated to @edolstra and @rbvermaa, who have keys to this particular part of the castle.

I don't see any related incidents: https://www.hetzner-status.de/en.html

@grahamc thanks for the quick response :)

Also, I sent you an email :) to [email protected], please take a look

Tried to reach the database server (chef), but couldn't. Performing a hard reset now.

Machine did not come up with hard reset, nor with rescue system. Engaged Hetzner support.

This is incredible timing. The infra team was discussing last week about getting a replacement machine for the database, because:

https://status.nixos.org/prometheus/graph?g0.range_input=8w&g0.end_input=2019-12-20%2019%3A02&g0.expr=node_filesystem_avail_bytes%7Bmountpoint%3D%22%2Fdata%22%2Cinstance%3D%22chef%3A9100%22%7D&g0.tab=0

We had put in the request for the new machine yesterday :).

I updated the "hydra is down "error page at https://hydra.nixos.org/ in https://github.com/NixOS/nixos-org-configurations/commit/72e03ec41f2e146de9c972cb9b9731584079daba to link the monitoring infra and tickets with the label "infrastructure". The infra team is discussing options in case Hetzner support doesn't reply soon.

We have heard back from support about the database machine (chef), they say an SSD has failed. The machine has been booted in to a rescue console to find out if the disk was part of the root filesystem or the /data array.

We brought chef up successfully and the /data mirror is indeed degraded. We're considering next steps.

We're preparing to take a backup directly off the disk, and then we will replace the failed disk.

The first, block-wise backup is on its way:

pv /dev/md127 | gzip | mbuffer -q | ssh 芦remote禄 'mbuffer -q | gzip -d > /hydradbrecv/storage/dd'

transferring at roughly 100MiB/s, with an ETA of about 1h30m.

The receiving end is a ZFS pool:

# zpool iostat -v 1
                                     capacity     operations     bandwidth
pool                               alloc   free   read  write   read  write
---------------------------------  -----  -----  -----  -----  -----  -----
hydradbrecv                        4.77G   883G      0    238    118  12.8M
  mirror                           2.39G   442G      0    119     59  6.41M
    pci-0000:01:00.0-scsi-0:0:1:0      -      -      0     59     29  3.21M
    pci-0000:01:00.0-scsi-0:0:2:0      -      -      0     60     29  3.21M
  mirror                           2.38G   442G      0    118     59  6.40M
    pci-0000:01:00.0-scsi-0:0:3:0      -      -      0     59     29  3.20M
    pci-0000:01:00.0-scsi-0:0:4:0      -      -      0     59     29  3.20M

gzip was a bad choice, so I switched to lz4, which isn't taking 100% of a core:

pv /dev/md127 | lz4 | mbuffer -q | ssh 芦remote禄 'mbuffer -q | lz4 -d > /hydradbrecv/storage/dd'

transferring at roughly 350MiB/s, with an ETA of about 30m.

The receiving end is a ZFS pool:

# zpool iostat -v 1
                                     capacity     operations     bandwidth
pool                               alloc   free   read  write   read  write
---------------------------------  -----  -----  -----  -----  -----  -----
hydradbrecv                        2.76G   885G      0  2.17K      0   118M
  mirror                           1.38G   443G      0  1.12K      0  59.5M
    pci-0000:01:00.0-scsi-0:0:1:0      -      -      0    572      0  29.8M
    pci-0000:01:00.0-scsi-0:0:2:0      -      -      0    579      0  29.8M
  mirror                           1.38G   443G      0  1.04K      0  58.1M
    pci-0000:01:00.0-scsi-0:0:3:0      -      -      0    531      0  29.1M
    pci-0000:01:00.0-scsi-0:0:4:0      -      -      0    535      0  29.1M
---------------------------------  -----  -----  -----  -----  -----  -----

If you want to watch in real time: https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&refresh=30s&var-instance=chef:9100&from=1576876200000&to=now&fullscreen&panelId=13

The dd copy arrived intact:

backup# md5sum ./dd
5db3000dc3486a2e701e976bef0c988f  ./dd

vs.

chef# pv /dev/md127 | md5sum -
 447GiB 0:18:26 [ 413MiB/s] [=================================>] 100%
5db3000dc3486a2e701e976bef0c988f  -

Apparently the block device compresses very well:

backup# du --apparent-size -h ./dd
448G    ./dd

backup# du -h ./dd
99G ./dd

I'm also replicating this dd backup elsewhere as well:

# ssh 芦remote禄 'zfs send -c hydradbrecv/storage@2019-12-20T21:51UTC--dd-only | mbuffer' \
    | mbuffer | zfs recv -s rpool/hydradbstorage

On to a pg_dumpall based backup. We'll first block external access to postgres via iptables to prevent any services from causing IO and stressing the disk further:

ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP
iptables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
iptables -I nixos-fw -p tcp --dport 5432 -j DROP

then start postgresql and:

pg_dumpall | lz4 | mbuffer -q | ssh 芦remote禄 'mbuffer -q | lz4 -d > /hydradbrecv/storage/pg_dumpall.sql'

pg_dumpall finished:

# du --apparent-size -h ./pg_dumpall.sql
109G    ./pg_dumpall.sql

# du -h ./pg_dumpall.sql
48G ./pg_dumpall.sql

I took a snapshot and will be sending this to my secondary backup machine as soon as the first transfer finishes (likely won't finish before morning, as it is going much further away than Germany to Amsterdam, and also not going to a fancypants Packet datacenter.)


We've now requested Hetzner replace the failed disk now. We should be back online once that RAID array recovers. No word yet on the new server we ordered.

It looks like Hetzner took the machine down to replace the disk at 2019-12-21 00:20GMT:
<a href="https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef">https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef</a>:9100&from=1576882800000&to=1576884600000

[>....................]  recovery =  0.5% (2366080/468719424) finish=39.4min speed=197173K/sec
[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 00:39:38 2019
             State : clean, degraded, recovering
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 12% complete

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 691

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      spare rebuilding   /dev/sdd1

```/dev/md127:
Version : 1.2
Creation Time : Mon Mar 19 14:49:55 2018
Raid Level : raid1
Array Size : 468719424 (447.01 GiB 479.97 GB)
Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

 Intent Bitmap : Internal

   Update Time : Sat Dec 21 01:05:43 2019
         State : clean, degraded, recovering
Active Devices : 1

Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Consistency Policy : bitmap

Rebuild Status : 79% complete

          Name : chef:data  (local to host chef)
          UUID : c3e8f78f:39d17bae:d1522247:19c60b91
        Events : 1026

Number   Major   Minor   RaidDevice State
   0       8       33        0      active sync   /dev/sdc1
   2       8       49        1      spare rebuilding   /dev/sdd1

```

We have a clean array!

[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 01:13:57 2019
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 1126

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      active sync   /dev/sdd1

We're back!

https://status.nixos.org/grafana/d/MJw9PcAiz/hydra-jobs?orgId=1&from=1576888500000&to=1576889100000

I'm closing this issue. We will be getting a new machine for the database within the next week, to address the next impending-doom problem:

[root@chef:~]# df -ha /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      439G  426G   14G  97% /data

Thank you @grahamc for your hard work. I (as well as many others) appreciate it :)

Was this page helpful?
0 / 5 - 0 ratings