Nixpkgs: 🏗️ ⚠️ Hydra database maintenance will stop builds on 2020-01-07.

Created on 7 Jan 2020 · 21Comments · Source: NixOS/nixpkgs

The infrastructure team will be replacing Hydra's database server tomorrow (2020-01-07.) This upgrade will be to a much faster host with faster disks and more RAM. All good things for a database server!

As mentioned in #76106, Hydra's database server is running out of disk space:

The down time will likely begin in America/New_York's morning, and we don't know how long it will take, but it will be more than a couple hours and hopefully less than 16.

blocker channel blocker infrastructure

Source

grahamc

🎉6 🚀1

Most helpful comment

We're up!

grahamc on 8 Jan 2020

🎉6

All 21 comments

just curious, how long has that machine been running?

jonringer on 7 Jan 2020

👍1

Thanks for your effort and informing us! :heart:

davidak on 7 Jan 2020

👍1

@jonringer if you mean uptime, about 17 days: https://status.nixos.org/prometheus/graph?g0.range_input=1h&g0.expr=(time()%20-%20node_boot_time_seconds%7Binstance%3D%22chef%3A9100%22%2Crole%3D%22database%22%7D)%20%2F%20(60%20%2060%20%2024)&g0.tab=1

if you mean how long have we had it, it seems 2016-03-08: https://github.com/NixOS/nixos-org-configurations/commit/15f4fd850e5b9ff98dafff5e2e39e1c35023f040#diff-0d35c1f291bb3530413fad1ff7c03b7c

grahamc on 7 Jan 2020

❤1

It is now in to afternoon in America/New_York time and we haven't started yet. We're working on some tooling to get ready.

grahamc on 7 Jan 2020

The new machine, haumea, is up: https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&refresh=30s&var-instance=haumea:9100&from=1578424500000&to=now

grahamc on 7 Jan 2020

We're going to take Hydra down any minute now.

grahamc on 7 Jan 2020

Haumea's postgres dashboard is here: https://status.nixos.org/grafana/d/rrbV5fdik/postgres-node?orgId=1&refresh=30s

Chefs postgres dashboard is here: https://status.nixos.org/grafana/d/rrbV5fdik/postgres-node?orgId=1&refresh=30s&var-host=postgresql&var-db=All

grahamc on 7 Jan 2020

👀2

those are some sexy dashboards

jonringer on 7 Jan 2020

If you're watching the per-instance-metrics dashboard you'll have noticed data transfer has tapered off and then completely stalled. There is still a lot of data to transfer. postgresql is doing a ton of read/write operations on the receiving end still, so I expect it has to do with index generation.

grahamc on 7 Jan 2020

I've confirmed this slowdown is due to creating indexes and a WAL not quite tuned for an import scenario.

while true; do printf "\n\n\n\n======> %s\n" "$(date)" >&2; echo "select * from pg_stat_activity where application_name = 'pg_restore';"; sleep 5; done | psql -x postgres

grahamc on 7 Jan 2020

According to:

psql hydra -c "SELECT relname, n_live_tup FROM pg_stat_all_tables where schemaname = 'public'"

all the data has indeed been received. We're still working through indexes and constraints. Once this finishes, we're going to do a snapshot of the postgresql ZFS dataset, start postgres, point Hydra to the new system, and start everything back up. However, if it finishes quite late, I might wait until Eelco wakes up in the morning to do the final bring-up.

grahamc on 8 Jan 2020

Restore finished.

grahamc on 8 Jan 2020

Snapshot taken, and I'm realizing all my previous snapshots are named as if it was 2019.

grahamc on 8 Jan 2020

😄1

Chef's postgres is stopped, and Ceres (hydra.nixos.org) has had chef's wireguard peer removed to prevent accidental writes while Eelco updates hydra to use Haumea.

grahamc on 8 Jan 2020

ceres systemd[1]: hydra-queue-runner.service:
    Consumed 2w 1d 20h 41min 33.725s CPU time,
    received 4.3T IP traffic, sent 1.4T IP traffic.

grahamc on 8 Jan 2020

😄3

We're up!

grahamc on 8 Jan 2020

🎉6

Very nice! The machine is much more responsive.

markuskowa on 8 Jan 2020

❤1

[root@haumea:~]# zfs get used,logicalused,compressratio rpool/safe/postgres
NAME                 PROPERTY       VALUE  SOURCE
rpool/safe/postgres  used           173G   -
rpool/safe/postgres  logicalused    332G   -
rpool/safe/postgres  compressratio  1.93x  -

grahamc on 8 Jan 2020

unpinning as this is no longer an active endeavor

jonringer on 8 Jan 2020

👍1

Thank you for the great work @grahamc :).

We appreciate you! :)

jonringer on 8 Jan 2020

❤2 👍1

Big thanks to @edolstra for doing the hard part :)

grahamc on 8 Jan 2020

❤2

Was this page helpful?