[this is a mirror of https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/761]
Right now the CAPI test framework only collect logs from the management cluster. We should also collect workload cluster logs.
In CAPZ, or conformance tests, we have https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/hack/log/log-dump.sh#L88 to get the logs directly from machines.
We should run the same thing in E2E (possibly with a configurable list of commands)
/area testing
/milestone v0.3.x
/priority important-longterm
/help
@vincepri:
This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/milestone v0.3.x
/priority important-longterm
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This may be a bit tough to generalize across providers, since the mechanism for connecting to remote hosts may differ across providers.
For example, possibly using Azure bastion to access Azure hosts, or using AWS Session Manager for connecting to AWS instances, docker exec for CAPD, or using ssh for more generalized use cases.
Any common tooling we provide here would have to account for both custom commands and custom remote connection mechanisms.
@detiber check out https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/hack/log/log-dump.sh#L88, would something like it be generic enough to work across providers (ie. using a daemonset on the target cluster to collect the logs on each VM)?
@CecileRobertMichon assuming all nodes have bootstrapped properly it could likely work, but I think we'd still need to fall back to provider-specific ways to gather logs in the event of machines that fail to properly bootstrap.
By failed to bootstrapped do you mean the VM is succeeded/running but the node failed to kubeadm init or join the cluster? In that case we should be able to expect collect cloud init logs in the same way no?
If you're talking about an infrastructure failure (eg. instance doesn't come up and is in failed state), I agree that would be provider specific but ideally there would already be events/logs for the InfraMachine to help us diagnose these.
By failed to bootstrapped do you mean the VM is succeeded/running but the node failed to kubeadm init or join the cluster? In that case we should be able to expect collect cloud init logs in the same way no?
Yes, wasn't sure if we were planning on treating those separately from other logs on the host.
I agree this solution want work in case a node fails to init/join, but IMO is acceptable for e2e test because it will provide more visibility on workload cluster machines as soon as there is kubelet running (now there is no visibility at all).
The solution is provider agnostic, but if we can keep the list of commands configurable, we can cover provider-specific logs as well
https://github.com/vmware-tanzu/crash-diagnostics/pull/114 might come in handy here
/milestone v0.4.0
/kind feature
@randomvariable you mentioned CAPA doing this slightly differently, can you please share a link to where that's being done?
@CecileRobertMichon, yup.
We have a ticker to scrape logs every 60s in https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/test/e2e/e2e_suite_test.go#L244
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/test/e2e/common.go then contains the implementation, which starts a shell using AWS Session Manager on each machine and scrapes the output of commands. Commands are defined at https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/test/e2e_new/common.go#L121
wdyt about a systemd service that exposes the logs, it would make the solution cloud-agnostic - It might even be possible with SystemD Conflicts properties to switch to local port once the kubelet comes up and is healthy
wdyt about a systemd service that exposes the logs, it would make the solution cloud-agnostic - It might even be possible with SystemD Conflicts properties to switch to local port once the kubelet comes up and is healthy
I do worry about the potential to expose information to users that should not have access to the logs with this approach. Limiting remote access once the kubelet is available would limit the window that it would be possible, but I think we need to be cautious here, especially since the output of these logs could contain sensitive information such as bootstrap tokens.
@moshloop isn't using a daemonset already cloud agnostic? Is the benefit of using a systemd service mostly to be able to dump those logs before kubelet is up?
isn't using a daemonset already cloud agnostic? Is the benefit of using a systemd service mostly to be able to dump those logs before kubelet is up?
Correct, we would want a systemd service up until the daemonset takes over and/or kubelet is up and healthy.
I do worry about the potential to expose information
I agree this would need to be something that is opt-in only. There are measures that can be taken to ensure it is secure:
This could even result in an overall increased security posture if it replaces SSH entirely
kubectl log machine/machine-a would then be smallish step away
My intent when writing this issue was really focused on improving the visibility only for E2E tests (now there is none on machine), and given this context:
kubectl log machine/machine-a or something similar to be used in a production cluster goes far and behind the original scope of this issue, so, personally, I would prefer to stick to the Azure approach (even if it reports data only for nodes where the kubelet is started) and eventually iterate in future/assign
/lifecycle active
In the PR I have switched to a provider-specific approach because while investigating recent flakes about v1.19.0 upgrade we saw that the most critical issue to investigate are when the kubelet does not start