Linkerd2: detect clock skew in `linkerd check --pre`

Created on 28 Apr 2019 · 19Comments · Source: linkerd/linkerd2

A user reported this error from the linkerd-identity pod when installing Linkerd 2.3 on minikube:

Failed to verify issuer credentials for 'identity.linkerd.cluster.local' with trust anchors: x509: certificate has expired or is not yet valid" in linkerd-identity pod.

This was due to the clock of their minikube VM being wrong. Can we detect this as part of linkerd check?

arecli good first issue help wanted prioritP1

Source

wmorgan

Most helpful comment

I like the idea of the control plane reporting a timestamp, but I think it would also be useful to report clock skew when running linkerd check --pre, before the control plane is installed.

It's sort of hard to find, but maybe we could look at node heartbeat timestamps to figure out if any nodes in the cluster are out of sync?

This gives me the timestamps of when all of the nodes in my cluster last reported as Ready:

$ date -u
Tue Apr 30 18:58:48 UTC 2019

$ kubectl proxy &
Starting to serve on 127.0.0.1:8001

$ curl -s localhost:8001/api/v1/nodes | jq '.items[].status.conditions[] | select(.type=="Ready") | .lastHeartbeatTime'
"2019-04-30T18:58:49Z"
"2019-04-30T18:58:40Z"
"2019-04-30T18:58:43Z"
"2019-04-30T18:58:41Z"
"2019-04-30T18:58:46Z"

If those timestamps deviate by clockSkewAllownce + readinessReportInterval, then we could warn or error.

klingerf on 30 Apr 2019

👍3

All 19 comments

Hey, I was looking into this to take as a first issue.

However, I'm not sure what would be the best way to detect a time sync issue?

matej-g on 30 Apr 2019

Hi @matej-g! I haven't spent too much time thinking about this, but my hunch is that we'd want to have the CLI ask the control plane what time it is and, if those are off by more than the issuer configuration's "clock skew allowance", then the check would fail.

One idea--it would be good to have others chime in on this--is to add a timestamp onto the healthcheck response type.

Then, I think the CLI's check logic would be roughly the following:

Fetch the Config from the public api to get the clock skew allowance
For each controller:
- Get timestamp t0
- Healthcheck the controller
- Get timestamp t1

Then, we'd want to check that the time returned in the healthcheck was between t0 - clockSkewAllowance and t1 + clockSkewAllowance and, if it isn't, fail the healthcheck.

olix0r on 30 Apr 2019

I like the idea of the control plane reporting a timestamp, but I think it would also be useful to report clock skew when running linkerd check --pre, before the control plane is installed.

It's sort of hard to find, but maybe we could look at node heartbeat timestamps to figure out if any nodes in the cluster are out of sync?

This gives me the timestamps of when all of the nodes in my cluster last reported as Ready:

$ date -u
Tue Apr 30 18:58:48 UTC 2019

$ kubectl proxy &
Starting to serve on 127.0.0.1:8001

$ curl -s localhost:8001/api/v1/nodes | jq '.items[].status.conditions[] | select(.type=="Ready") | .lastHeartbeatTime'
"2019-04-30T18:58:49Z"
"2019-04-30T18:58:40Z"
"2019-04-30T18:58:43Z"
"2019-04-30T18:58:41Z"
"2019-04-30T18:58:46Z"

If those timestamps deviate by clockSkewAllownce + readinessReportInterval, then we could warn or error.

klingerf on 30 Apr 2019

👍3

@klingerf I was thinking the same (querying nodes for heartbeat timestamp), as I could not find any better way in which actual system time could be obtained from K8s.

matej-g on 30 Apr 2019

@klingerf what are those timestamps? Are these times reported by each respective node or are those the stamps of the apiserver receiving the reports?

christianhuening on 30 Apr 2019

@christianhuening That's a great point! Maybe those timestamps will only tell us if the apiserver's clock is out of sync, rather than reflecting whether any individual node's clock is out of sync. Hmm.

klingerf on 30 Apr 2019

@klingerf @christianhuening I'm thinking those are timestamps posted by the kubelet agent from each node, so it should be based on each node's individual system time

matej-g on 1 May 2019

@matej-g If that's the case, then we should be able to rely on them to reflect the clock skew of each node. In any case, I don't know of a different way to go about getting timestamps, so I think this is a good approach to start with, unless folks have other ideas about where we could find that info.

klingerf on 1 May 2019

@klingerf I mean, the way I understood the main issue here is more about clock skew between the machine running linkerd install and the node(s), not nodes among themselves (the original reporter was installing on a single-node Minikube install AFAIK). If you don't have your nodes synced, you will probably have more issues, including those outside of Linkerd scope.

matej-g on 1 May 2019

@matej-g Cool, yep, totally agree.

klingerf on 1 May 2019

@matej-g @klingerf well i am not sure about the timestamps. The docs say: Last time we got an update on a given condition.. Would've to look into the implementation to find the truth. However, for the use case at hand (install client <-> Server) the api server time is probably enough, since, as you already said, in case you have clock skew, you have bigger problems usually.

christianhuening on 1 May 2019

👍1

Just FYI I did some testing with my local cluster (1.13), manipulated system time on the worker node and it came back as the actual system time on the node, as reported by kubelet. So I think if we based it off of node timestamps, we can account for all nodes, in the unlikely event they themselves are out of sync.

Another thing I noticed is that if we want to account for the heartbeat interval, there does not seem to be a way to obtain this value from the server, as it is kubelet-side setting (--node-status-update-frequency). It defaults to 10 seconds (although I noticed e.g. Minikube uses minute intervals), so that could be used as default, with possible manual override to custom value.

matej-g on 1 May 2019

@matej-g Good sleuthing! And good find with the --node-status-update-frequency flag. Maybe @olix0r can weigh in here, but my inclination would be to fail the check if any report timestamp is more than 70 seconds old (10 second default clock skew allowance + 60 report interval). I agree that making that time window overrideable via a command line flag could be a good approach, too.

klingerf on 1 May 2019

@klingerf @matej-g that sounds reasonable to me, assuming @grampelberg is happy with this plan.

olix0r on 1 May 2019

Great, I started to roughly but the idea in place, I would report again beginning next week.

matej-g on 3 May 2019

+1 from me, sounds legit.

grampelberg on 3 May 2019

A couple of specifications for this:

Should this check be a part of some existing category or not (is it K8s API check? is it pre-install capability? is it a separate category?)
Should a failed check give error or warning? Although clock skew is not fatal and Linkerd install might work despite the fact, this condition should probably not be overlooked
For the hint anchor link, I'm guessing this will require changes in linkerd/website repository, should I prepare this also as a separate PR?

matej-g on 6 May 2019

Hey @matej-g, sorry for the delay here.

These are good questions, and I don't have great answers, since I think this check could make sense in a few different places, as either a warning or an error. But my suggestion is that for now we add it to the set of LinkerdPreInstallChecks that are run, and that we make it a fully-fledged error, instead of a warning.

And you're right about the hint anchor -- that will require a change here:

https://github.com/linkerd/website/blob/master/linkerd.io/content/2/tasks/troubleshooting.md

If you don't mind submitting a PR, we can get that PR merged prior to getting the main PR merged.

klingerf on 7 May 2019

No worries @klingerf and thanks for the clarification, I opened a PR, looking forward to feedback

matej-g on 8 May 2019

Was this page helpful?

0 / 5 - 0 ratings