The infrastructure team will be replacing Hydra's database server tomorrow (2020-01-07.) This upgrade will be to a much faster host with faster disks and more RAM. All good things for a database server!
As mentioned in #76106, Hydra's database server is running out of disk space:
The down time will likely begin in America/New_York's morning, and we don't know how long it will take, but it will be more than a couple hours and hopefully less than 16.
just curious, how long has that machine been running?
Thanks for your effort and informing us! :heart:
@jonringer if you mean uptime, about 17 days: https://status.nixos.org/prometheus/graph?g0.range_input=1h&g0.expr=(time()%20-%20node_boot_time_seconds%7Binstance%3D%22chef%3A9100%22%2Crole%3D%22database%22%7D)%20%2F%20(60%20%2060%20%2024)&g0.tab=1
if you mean how long have we had it, it seems 2016-03-08: https://github.com/NixOS/nixos-org-configurations/commit/15f4fd850e5b9ff98dafff5e2e39e1c35023f040#diff-0d35c1f291bb3530413fad1ff7c03b7c
It is now in to afternoon in America/New_York time and we haven't started yet. We're working on some tooling to get ready.
The new machine, haumea, is up: https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&refresh=30s&var-instance=haumea:9100&from=1578424500000&to=now
We're going to take Hydra down any minute now.
Haumea's postgres dashboard is here: https://status.nixos.org/grafana/d/rrbV5fdik/postgres-node?orgId=1&refresh=30s
Chefs postgres dashboard is here: https://status.nixos.org/grafana/d/rrbV5fdik/postgres-node?orgId=1&refresh=30s&var-host=postgresql&var-db=All
those are some sexy dashboards
If you're watching the per-instance-metrics dashboard you'll have noticed data transfer has tapered off and then completely stalled. There is still a lot of data to transfer. postgresql
is doing a ton of read/write operations on the receiving end still, so I expect it has to do with index generation.
I've confirmed this slowdown is due to creating indexes and a WAL not quite tuned for an import scenario.
while true; do printf "\n\n\n\n======> %s\n" "$(date)" >&2; echo "select * from pg_stat_activity where application_name = 'pg_restore';"; sleep 5; done | psql -x postgres
According to:
psql hydra -c "SELECT relname, n_live_tup FROM pg_stat_all_tables where schemaname = 'public'"
all the data has indeed been received. We're still working through indexes and constraints. Once this finishes, we're going to do a snapshot of the postgresql ZFS dataset, start postgres, point Hydra to the new system, and start everything back up. However, if it finishes quite late, I might wait until Eelco wakes up in the morning to do the final bring-up.
Restore finished.
Snapshot taken, and I'm realizing all my previous snapshots are named as if it was 2019.
Chef's postgres is stopped, and Ceres (hydra.nixos.org) has had chef's wireguard peer removed to prevent accidental writes while Eelco updates hydra to use Haumea.
ceres systemd[1]: hydra-queue-runner.service:
Consumed 2w 1d 20h 41min 33.725s CPU time,
received 4.3T IP traffic, sent 1.4T IP traffic.
We're up!
Very nice! The machine is much more responsive.
[root@haumea:~]# zfs get used,logicalused,compressratio rpool/safe/postgres
NAME PROPERTY VALUE SOURCE
rpool/safe/postgres used 173G -
rpool/safe/postgres logicalused 332G -
rpool/safe/postgres compressratio 1.93x -
unpinning as this is no longer an active endeavor
Thank you for the great work @grahamc :).
We appreciate you! :)
Big thanks to @edolstra for doing the hard part :)
Most helpful comment
We're up!