Cockroach: stability: azure clock can't stop drifting

Created on 26 Feb 2017  路  19Comments  路  Source: cockroachdb/cockroach

cockroach-cerulean-0001 was recently brought up as the new host after the previous one went down for some reason. maybe maintenance or failure, nothing in the Azure logs so far.

The node quickly logged fatal on clock skew. ntpd is running, and I tried stopping it to run ntpdate, both against the pool we're currently using (ntp.ubuntu.com) and against time.windows.com. Both show ~430ms offset. ntpdate seems to correctly set the hwclock, but it gets reset shortly after.
Here's the graph from the last hour:
screen shot 2017-02-26 at 1 13 25 pm

The dips down to 0 match the times when I can ntpdate manually.
Not sure there's much we can do other than escale to support, kill the vm, and log this for the continuous saga that is "time sucks".

C-enhancement S-2-temp-unavailability

Most helpful comment

After a bit of back-and-forth with Azure support folks, I've been told the following:
we can unbind the hyperv Time Synchronization device by doing the following:

# Find the ID of the device:
$ curl -O https://raw.githubusercontent.com/torvalds/linux/master/tools/hv/lsvmbus
$ python lsvmbus -vv | grep -w "Time Synchronization" -A 3
VMBUS ID 11: Class_ID = {9527e630-d0ae-497b-adce-e80ab0175caf} - [Time Synchronization]
  Device_ID = {2dd1ce17-079e-403c-b352-a1921ee207ee}
  Sysfs path: /sys/bus/vmbus/devices/2dd1ce17-079e-403c-b352-a1921ee207ee
  Rel_ID=11, target_cpu=0

# Unbind the device, using the device ID displayed above.
$ echo 2dd1ce17-079e-403c-b352-a1921ee207ee | sudo tee /sys/bus/vmbus/drivers/hv_util/unbind

After that, the clock seems to be settable again, manually (eg: date -s ...), and through ntpdate/ntpd.
However, ntp still seems to be drifting more than expected. Will keep checking.

All 19 comments

It is very curious that the ntpdate invocations only resulted in a temporary resetting of the clock offset. I would expect the clock to manually drift upward again after the ntpdate invocation, but instead it jumped right back to the offset it was at previously.

both ntpdate and ntpd are supposed to set the system clock, but they clearly aren't.
Something else is switching the time from under us (beyond the not-so-slow drift).
The last 24h (the machine's been up for about 23) show:
azure_clock

I'll file a support ticket and swap out the machine with another while keeping it around.

Also found the migration in the logs. Well, not the actual logs, it's in "Virtual Machine -> Resource Health -> View History"

This looks like a simple replacement after the existing node went down.

|Start Time | End Time | Status | Description |
|---|---|---|---
|2/26, 9:12 AM | Ongoing | Available | There aren鈥檛 any known Azure platform problems affecting this virtual machine |
|2/26, 8:59 AM | 2/26, 9:12 AM | Unavailable | We're sorry, your virtual machine isn't available because of an unhealthy host |
|2/23, 11:20 AM | 2/26, 8:59 AM | Available | There aren鈥檛 any known Azure platform problems affecting this virtual machine |
|> 14 days ago | 2/23, 11:20 AM | Unknown | We are currently unable to determine the health of this resource |

the last 24h are even more fun:
azure_clock_24h

Quick summary: when running as a guest of hyper-v, it seems we're forced onto the hyperv clock source which resets the system clock every time the host sends a clock message (5 seconds).
Asking for a tsc clock just plain doesn't work, and asking for acpi_pm for some reason still get overridden.

It seems the host time may just become a possible source of time in the future. The pr message seems to indicate that hyperv clocksource and ntp indeed to not play well together: https://patchwork.kernel.org/patch/9525945/

Same thing is currently happening on cockroach-denim-0006. It was also rescheduled due to a bad host:

start time: 3/6, 6:25 AM
end time: 3/6, 6:42 AM
status: Unavailable
description: We're sorry, your virtual machine isn't available because of an unhealthy host

Clock skew over the last 20h (monitoring was broken for a while before that):
denim-0006-clock

cerulean-0002 was rebooted while I was trying to tweak the clocksource. Since then, its clock has been way off:
cerulean-1-2-clock

cerulean-0001 remains bad.

After a bit of back-and-forth with Azure support folks, I've been told the following:
we can unbind the hyperv Time Synchronization device by doing the following:

# Find the ID of the device:
$ curl -O https://raw.githubusercontent.com/torvalds/linux/master/tools/hv/lsvmbus
$ python lsvmbus -vv | grep -w "Time Synchronization" -A 3
VMBUS ID 11: Class_ID = {9527e630-d0ae-497b-adce-e80ab0175caf} - [Time Synchronization]
  Device_ID = {2dd1ce17-079e-403c-b352-a1921ee207ee}
  Sysfs path: /sys/bus/vmbus/devices/2dd1ce17-079e-403c-b352-a1921ee207ee
  Rel_ID=11, target_cpu=0

# Unbind the device, using the device ID displayed above.
$ echo 2dd1ce17-079e-403c-b352-a1921ee207ee | sudo tee /sys/bus/vmbus/drivers/hv_util/unbind

After that, the clock seems to be settable again, manually (eg: date -s ...), and through ntpdate/ntpd.
However, ntp still seems to be drifting more than expected. Will keep checking.

Lovely. Seems like we might need per-cloud provider docs on configuring time synchronization. Cc @jseldess

yeah, this one's weird. It also seems to require a reboot (??) which is odd since that reactivates the driver, and you need to disable it again.
I'm also still waiting on Azure folks to tell me why the host clock seems off, or at least why it's setting a time that's constantly drifting (back and forth, sometimes by as much as +/- 500ms)

cerulean is back up, keeping cockroach-cerulean-000{1,2} which look back in sync, but only time will tell (ah!)

Ditto for cockroach-indigo-0007, it was rescheduled >14 days go (because timestamps are hard!)

screen shot 2017-03-15 at 7 00 58 pm

time-drift graph shows madness starting around 2017-02-23 17:01

Azure support is saying this is a bug in LIS (Linux integration service for Hyper-v) with a fix in the next release, but not exact ETA on the rollout. Until then, they recommend disabling timesync. I've been doing that on a per-node basis, but I may move this to our node init scripts.

A kernel patch making the hyper-v time sync a separate clock source is merged, and making its way into ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1676635

This should allow the guest VM to use ntp as usual, or use the host time as reference.

@mberhault Is this still a problem with the hyper-v time sync fix?

That fix is more of a work-around. I'm still waiting for the new LIS to roll out. It's supposed to fix this (see a few comments above).

Thanks for the pointer @mberhault. I'm having this same issue on a Debian 9 machine running Linux 4.9.0-6. Looks like Vitaly's patch made it in too late for me, in 4.10.

I'm running the smallest burst series VM, and am hitting CPU limits due to a cascade effect caused by the time changing every 5 seconds: it gets logged, then several services note the time changed, do their own thing and log it, and then the Azure Linux Agent runs rsyslog to offload the log files for diagnostics, but so much logging is happening that it never catches up and the VM ends up at 100% CPU, and the host eventually kills it. :(

Thanks for the work-around :)

Is there anything else to do here? I'm optimistically closing but @mberhault please check.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xudongzheng picture xudongzheng  路  3Comments

HeikoOnnebrink picture HeikoOnnebrink  路  4Comments

magaldima picture magaldima  路  3Comments

lwsanty picture lwsanty  路  4Comments

richardanaya picture richardanaya  路  3Comments