Cluster-api: Collect cloud init logs/machine logs for E2E workload clusters

Created on 15 Jul 2020 · 20Comments · Source: kubernetes-sigs/cluster-api

[this is a mirror of https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/761]

Right now the CAPI test framework only collect logs from the management cluster. We should also collect workload cluster logs.

In CAPZ, or conformance tests, we have https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/hack/log/log-dump.sh#L88 to get the logs directly from machines.

We should run the same thing in E2E (possibly with a configurable list of commands)

/area testing

aretesting help wanted kinfeature lifecyclactive prioritimportant-longterm

Source

fabriziopandini

❤1

All 20 comments

/milestone v0.3.x
/priority important-longterm
/help

vincepri on 15 Jul 2020

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/milestone v0.3.x
/priority important-longterm
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 15 Jul 2020

This may be a bit tough to generalize across providers, since the mechanism for connecting to remote hosts may differ across providers.

For example, possibly using Azure bastion to access Azure hosts, or using AWS Session Manager for connecting to AWS instances, docker exec for CAPD, or using ssh for more generalized use cases.

Any common tooling we provide here would have to account for both custom commands and custom remote connection mechanisms.

detiber on 15 Jul 2020

👍1

@detiber check out https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/hack/log/log-dump.sh#L88, would something like it be generic enough to work across providers (ie. using a daemonset on the target cluster to collect the logs on each VM)?

CecileRobertMichon on 15 Jul 2020

@CecileRobertMichon assuming all nodes have bootstrapped properly it could likely work, but I think we'd still need to fall back to provider-specific ways to gather logs in the event of machines that fail to properly bootstrap.

detiber on 15 Jul 2020

By failed to bootstrapped do you mean the VM is succeeded/running but the node failed to kubeadm init or join the cluster? In that case we should be able to expect collect cloud init logs in the same way no?

If you're talking about an infrastructure failure (eg. instance doesn't come up and is in failed state), I agree that would be provider specific but ideally there would already be events/logs for the InfraMachine to help us diagnose these.

CecileRobertMichon on 15 Jul 2020

By failed to bootstrapped do you mean the VM is succeeded/running but the node failed to kubeadm init or join the cluster? In that case we should be able to expect collect cloud init logs in the same way no?

Yes, wasn't sure if we were planning on treating those separately from other logs on the host.

detiber on 15 Jul 2020

I agree this solution want work in case a node fails to init/join, but IMO is acceptable for e2e test because it will provide more visibility on workload cluster machines as soon as there is kubelet running (now there is no visibility at all).
The solution is provider agnostic, but if we can keep the list of commands configurable, we can cover provider-specific logs as well

fabriziopandini on 16 Jul 2020

https://github.com/vmware-tanzu/crash-diagnostics/pull/114 might come in handy here

moshloop on 16 Jul 2020

/milestone v0.4.0

vincepri on 3 Aug 2020

/kind feature

vincepri on 3 Aug 2020

@randomvariable you mentioned CAPA doing this slightly differently, can you please share a link to where that's being done?

CecileRobertMichon on 3 Aug 2020

@CecileRobertMichon, yup.

We have a ticker to scrape logs every 60s in https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/test/e2e/e2e_suite_test.go#L244

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/test/e2e/common.go then contains the implementation, which starts a shell using AWS Session Manager on each machine and scrapes the output of commands. Commands are defined at https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/test/e2e_new/common.go#L121

randomvariable on 3 Aug 2020

wdyt about a systemd service that exposes the logs, it would make the solution cloud-agnostic - It might even be possible with SystemD Conflicts properties to switch to local port once the kubelet comes up and is healthy

moshloop on 4 Aug 2020

wdyt about a systemd service that exposes the logs, it would make the solution cloud-agnostic - It might even be possible with SystemD Conflicts properties to switch to local port once the kubelet comes up and is healthy

I do worry about the potential to expose information to users that should not have access to the logs with this approach. Limiting remote access once the kubelet is available would limit the window that it would be possible, but I think we need to be cautious here, especially since the output of these logs could contain sensitive information such as bootstrap tokens.

detiber on 4 Aug 2020

👍1

@moshloop isn't using a daemonset already cloud agnostic? Is the benefit of using a systemd service mostly to be able to dump those logs before kubelet is up?

CecileRobertMichon on 4 Aug 2020

isn't using a daemonset already cloud agnostic? Is the benefit of using a systemd service mostly to be able to dump those logs before kubelet is up?

Correct, we would want a systemd service up until the daemonset takes over and/or kubelet is up and healthy.

I do worry about the potential to expose information

I agree this would need to be something that is opt-in only. There are measures that can be taken to ensure it is secure:

Protect the service with the master CA, accessing the logs then requires admin-level rights
Require the client cert to be within a certain age (even if it hasn't expired) to reduce the time window
Encrypt the contents with a different symmetric key

This could even result in an overall increased security posture if it replaces SSH entirely

kubectl log machine/machine-a would then be smallish step away

moshloop on 4 Aug 2020

👍1

My intent when writing this issue was really focused on improving the visibility only for E2E tests (now there is none on machine), and given this context:

I don't see security concerns for bootstrap tokens, because according on how prow works, logs are available only after the cluster is destroyed. Also, allowing each cloud provider to decide which commands to use for scraping logs provide an additional knob for avoiding to expose sensitive info.
kubectl log machine/machine-a or something similar to be used in a production cluster goes far and behind the original scope of this issue, so, personally, I would prefer to stick to the Azure approach (even if it reports data only for nodes where the kubelet is started) and eventually iterate in future

fabriziopandini on 5 Aug 2020

👍1

/assign
/lifecycle active

fabriziopandini on 2 Sep 2020

In the PR I have switched to a provider-specific approach because while investigating recent flakes about v1.19.0 upgrade we saw that the most critical issue to investigate are when the kubelet does not start

fabriziopandini on 2 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings