The infrastructure team will be replacing Hydra's database server tomorrow (2020-01-07.) This upgrade will be to a much faster host with faster disks and more RAM. All good things for a database server!
As mentioned in #76106, Hydra's database server is running out of disk space:
The down time will likely begin in America/New_York's morning, and we don't know how long it will take, but it will be more than a couple hours and hopefully less than 16.
just curious, how long has that machine been running?
Thanks for your effort and informing us! :heart:
@jonringer if you mean uptime, about 17 days:
if you mean how long have we had it, it seems 2016-03-08:
It is now in to afternoon in America/New_York time and we haven't started yet. We're working on some tooling to get ready.
The new machine, haumea, is up:
We're going to take Hydra down any minute now.
Haumea's postgres dashboard is here:
Chefs postgres dashboard is here:
those are some sexy dashboards
If you're watching the per-instance-metrics dashboard you'll have noticed data transfer has tapered off and then completely stalled. There is still a lot of data to transfer. postgresql
is doing a ton of read/write operations on the receiving end still, so I expect it has to do with index generation.
I've confirmed this slowdown is due to creating indexes and a WAL not quite tuned for an import scenario.
while true; do printf "\n\n\n\n======> %s\n" "$(date)" >&2; echo "select * from pg_stat_activity where application_name = 'pg_restore';"; sleep 5; done | psql -x postgres
According to:
psql hydra -c "SELECT relname, n_live_tup FROM pg_stat_all_tables where schemaname = 'public'"
all the data has indeed been received. We're still working through indexes and constraints. Once this finishes, we're going to do a snapshot of the postgresql ZFS dataset, start postgres, point Hydra to the new system, and start everything back up. However, if it finishes quite late, I might wait until Eelco wakes up in the morning to do the final bring-up.
Restore finished.
Snapshot taken, and I'm realizing all my previous snapshots are named as if it was 2019.
Chef's postgres is stopped, and Ceres ( has had chef's wireguard peer removed to prevent accidental writes while Eelco updates hydra to use Haumea.
ceres systemd[1]: hydra-queue-runner.service:
Consumed 2w 1d 20h 41min 33.725s CPU time,
received 4.3T IP traffic, sent 1.4T IP traffic.
We're up!
Very nice! The machine is much more responsive.
[root@haumea:~]# zfs get used,logicalused,compressratio rpool/safe/postgres
rpool/safe/postgres used 173G -
rpool/safe/postgres logicalused 332G -
rpool/safe/postgres compressratio 1.93x -
unpinning as this is no longer an active endeavor
Thank you for the great work @grahamc :).
We appreciate you! :)
Big thanks to @edolstra for doing the hard part :)
Most helpful comment
We're up!