Sig-release: block releases on non-synthetic clusters

Created on 14 Oct 2019  路  36Comments  路  Source: kubernetes/sig-release

summarizing from #sig-release on Slack:

  • We should block on at least one real cluster doing something at least mostly covering conformance
  • Lacking alternatives, we should use what we use in blocking presubmit (cluster/kube-up.sh)
  • We should seriously consider blocking on more tests than this. lots of reasonably core features have not yet risen to conformance but are useful signal
  • If the test cases fail, please try to identify the sig owning the feature and assign them instead of punting to the owner of the provisioner. all provisioners will not scale to this, including KIND...
  • If cluster bring up or tear down _actually_ starts failing, please poke me (@BenTheElder on all mediums) or one of the OWNERS, we probably have a serious problem

Ideally we should re-introduce more "diversity" of cluster providers long term, eg with bringing back OpenStack again (cc @dims), cluster-api-provider-* (cc @dims again 馃檭) and more.

KIND is awesome 馃槈 but shouldn't be the only release signal running core tests (~conformance) like "can we kubectl run a pod".

cc @alejandrox1 @tpepper @justaugustus

arerelease-eng arerelease-team kinbug kinfeature prioritimportant-longterm sirelease

Most helpful comment

@aojea right now, there are no multi-node jobs running , but we have one in progress for CPO migration testing which setup multinode https://github.com/theopenlab/openlab-zuul-jobs/pull/649 ,
Later, may be we could migrate other jobs as well .
/cc @adisky

All 36 comments

/priority critical-urgent
/milestone v1.17
/area release-eng release-team
/sig release
/kind bug feature

For visibility:
@kubernetes/sig-release-admins @kubernetes/release-engineering @kubernetes/release-team

/assign @alejandrox1

Some notes to provide the full story on this issue.

The Initial comment that brought this to the attention of sig release was the following:

We have no normal (not alpha features, not ingress tests, not skew testing, not scale testing) e2e suites on actual hardware of any sort in release-master blocking

Currently, this is true. master-blocking dashboard is composed of the following jobs:

  • skew-cluster-latest-kubectl-stable1-gce
  • gce-cos-master-alpha-features
  • gce-device-plugin-gpu-master
  • gci-gce-ingress
  • node-kubelet-master
  • build-master-fast
  • gce-cos-master-scalability-100
  • bazel-build-master
  • bazel-test-master
  • integration-master
  • verify-master
  • kind-master-parallel
  • kind-ipv6-master-parallel

A "normal" test would be one that:

  • uses e2e.test
  • is no disruptive (tests that are known to be likely to screw up the cluster intentionally)
  • not node reboot testing (similar to above ... these tests involve rebooting a node)
  • is not scale / performance testing
  • is not e2e_node (again, targets the nodes rather than the cluster, also we already block on these)
  • is not just a suite for specific features EG GPUs or Alpha features

An example of such a job would be one that runs the conformance suite, one such as https://k8s-testgrid.appspot.com/sig-release-master-informing#GCE,%20master%20(dev)&width=5

It is important to note here that jobs in informing dashboards (i.e., sig-release-master-informing) are constantly monitored and any issues in them are expected to be promptly resolved, see release blocking jobs: release informing dashboard.
The release informing dashboards have a less strict criteria (e.g., they can be run less frequently than the jobs in master blocking, they can be more flaky). Otherwise, these jobs SHOULD block a release if there is a known issue involving Kubernetes.

Never the less, we require a job that tests core features and that it does so in an environment that is as close as possible to what most users will experience.
As such, we should proceed and move the conformance GCE conformance job to master blocking. Plus it meets most of the release blocking criteria (the job runs every 6 hours but it actually takes 1hr30mins to run, GCE master (dev) graph test duration).


Hopefully this issue highlights the need for broader collaboration and the need for more communication between SIGs.
There is an existing issue that is similar in intent to this, Decide on procedure for jobs entering (and leaving) Blocking https://github.com/kubernetes/sig-release/issues/774 , also Lots of kubernetes E2E tests are not executed by any specific test suitehttps://github.com/kubernetes/test-infra/issues/14647 .
We require to have release blocking tests suites that cover the wide range of tests.

With this is mind, the first action item, the simplest one, is to move the GCE conformance tests to master and to increase the frequency with which it runs in order to completely meet the release blocking criteria.
A second step, a step towards improving quality and assurance would be to investigate what holes our CI has - find out which other tests we are not running.
Furthermore, we need to collaborate with SIGs and investigate what jobs of theirs we can use to improve the quality of Kubernetes or investigate if they have similar issues.
With that, I'll get started on this and will help guide this thing to where it needs to go :smile:

Historically the GCE jobs haven't had anyone responsive owning them, which is one of the reasons they're not in Blocking in the first place. Who owns the gce-conformance job now, and will the be responsive for troubleshooting?

  • Feel free to assign me if cluster up / down / ... breaks in that job. This is a hypothetical problem anyhow. The only outages we've had are with the test-infra (boskos).

    • This same code for up/down is used in blocking presubmit, so it's hard to break

    • We do have other blocking CI already using this same tooling (ingress, GPU, reboot ...)

  • Some of the _test cases_ break or are flaky and the relevant SIGs own their tests. Please assign them. This applies to all methods of provisioning clusters.

AIUI, we had issues with upgrade / downgrade not being owned. I'm not asking to reinstate those. But cluster up/down works fine and is used extremely broadly in our CI.

@alejandrox1 @BenTheElder

@dims just curious against what OpenStack version is running this job?
Is it running against a real cloud or devstack?

I believe this is running against "real" openstack run by CityNetwork:
https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/jobs.yaml#L568-L581

@kiwik @ZhengZhenyu @wangxiyuan @liusheng - can one of you please confirm?

https://github.com/theopenlab/openlab-zuul-jobs/blob/master/playbooks/cloud-provider-openstack-acceptance-test-e2e-conformance/run.yaml#L85

sudo -E PATH=$PATH SHELLOPTS=$SHELLOPTS ./hack/local-up-cluster.sh -O

I guess that's not providing the signal that Ben is mentioning 馃槄

Based on the playbooks it seems that the kubernetes cluster is installed using ./hack/local-up-cluster.sh inside one VM and use the Openstack APIs to test the openstack controller with real APIs. That's a synthetic cluster.

However, the intention is deploying the cluster as a user will do, I mean, create VMs for the kubernetes nodes and test against a cluster with multiple nodes.

@wangxiyuan is the job testing a real (non-synthetic) cluster?

@aojea, I'm not very familar with K8S. I can just say something from job side. The job install k8s cluster(with CPO) using ./hack/local-up-cluster.sh and config the real cloud Citynetwork as the openstack part in CPO. Citynetwork provides standard OpenStack APIs for test.
I guess @adisky and @ramineni and tell more.

@aojea right now, there are no multi-node jobs running , but we have one in progress for CPO migration testing which setup multinode https://github.com/theopenlab/openlab-zuul-jobs/pull/649 ,
Later, may be we could migrate other jobs as well .
/cc @adisky

cc @chrigl

Now we have both CAPA and CAPG that are release informing:

I guess a quick summary on this to track progress and see where we can land...

Firs off, thank you @dims for all the work getting capg and capa in master informing!
Other than that, we have the conformance jobs running in GCE so we have some good coverage with non-synthetic clusters.


The open question right now isregarding the openstack conformance jobs.
As per @aojea comment

Based on the playbooks it seems that the kubernetes cluster is installed using ./hack/local-up-cluster.sh inside one VM and use the Openstack APIs to test the openstack controller with real APIs. That's a synthetic cluster.

However, the intention is deploying the cluster as a user will do, I mean, create VMs for the kubernetes nodes and test against a cluster with multiple nodes.

The openstack conformance job helps test kubernetes e2e but Kubernetes is not setup in a way that is recommended for users - this last step is what we need.
There currently is some ongoing work on getting CI with CAPO (cluster api provider openstack) https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/484 which would solve the issues of running CI on non-synthetic clusters.

@ramineni , you made a mention of theopenlab/openlab-zuul-jobs#649 , which seems to be related to https://github.com/theopenlab/openlab/issues/342 . From https://github.com/theopenlab/openlab/issues/342 it seems that there are CI jobs that test Kubernetes on Openstack. Are the results from those jobs available somewhere where we (the kubernetes community) can monitor them?

/priority important-longterm

let's keep this open for tracking the openstack progress:

There currently is some ongoing work on getting CI with CAPO (cluster api provider openstack) kubernetes-sigs/cluster-api-provider-openstack#484 which would solve the issues of running CI on non-synthetic clusters.

Once that work is done, we can probably close this issue.

let's keep this open for tracking the openstack progress:

There currently is some ongoing work on getting CI with CAPO (cluster api provider openstack) kubernetes-sigs/cluster-api-provider-openstack#484 which would solve the issues of running CI on non-synthetic clusters.

Once that work is done, we can probably close this issue.

just fyi. This will probably take a while. I don't have a lot of time to work on this left this year and there are some blockers like getting a new GCB bucket, integrating OpenLab with the CAPO Repo, building images with Packer via OpenLab Zuul..

thanks for the background, @sbueringer !

Moving this to 1.18 milestone so that team can continue tracking this work, as it's longterm at this point.

/cc @kubernetes/ci-signal
could someone please take a look at this, see if there have been any major changes?

@alejandrox1 we have enabled the cpo migration tests on multinode cluster deployed using kubeadm.
job name: cloud-provider-openstack-multinode-csi-migration-test
http://status.openlabtesting.org/builds?project=kubernetes%2Fcloud-provider-openstack&pipeline=periodic-14%09

@dims @BenTheElder -- Where did we land here? Good to close?

@justaugustus The OpenStack test is almost ready. Only a few failures in the Conformance tests to debug & send the results to TestGrid. But I'm not sure if you want to keep this issue open as we already have some tests via other ClusterAPI providers (afaik)

Can we keep this open until the openstack tests are all set? Just so we know what all is out there

@justaugustus @dims @alejandroEsc

OpenStack tests are mostly working now. They are run twice a day against Kubernetes master: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-openstack#capo-conformance-stable-k8s-master

The only remaining issue is that if they are executed on the wrong OpenLab cluster they don't work. There seems to be a difference in the used CPU (for more details see: theopenlab/openlab#504).

But in these runs the cluster cannot be set up, so the job fails before executing the e2e tests and uploading logs to the bucket. So the impact on the runs shown in testgrid is that they are less frequent but not flaky. If that's good enough for now you can close this issue here, I'll try to fix the OpenLab issue anyway.

Btw: is it possible to remove/hide old runs or do they rotate at some point in testgrid? The oroblem is I executed the e2e tests with parallel during the first runs and now I have tests like BeforeSuite in testgrid > https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-openstack#capo-conformance-stable-k8s-master

Btw: is it possible to remove/hide old runs or do they rotate at some point in testgrid?

They're considered stale after num_columns_recent

https://github.com/kubernetes/test-infra/blob/master/testgrid/config.md#what-counts-as-recent

If you're not explicitly setting that value, the defaults are here https://github.com/kubernetes/test-infra/blob/master/config/testgrids/default.yaml

So you should only have to wait 10 runs

I'm sorry but these CAPO jobs do not seem to test non-syntethic clusters, checking at the jobs and looking at the issue https://github.com/theopenlab/openlab/issues/504

The CAPO tests installs a devstack and then spins up a Kubernetes cluster via CAPO. In failed runs the Kubernetes master node doesn't come up (server boots but Kubernetes master components are in CrashLoop because of CPU starvation). I guess that's either caused by over-provisioning or some problem with nested virtualization

The test does everything in a fully virtualized environment, that's ok for e2e tests, but the goal of this issue is to provide a signal on real environments. In this case, the test should create clusters on the OpenStack from CityNetwork. directly not in devstack, the same that the GCE ones are doing.

The CAPA and CAPG jobs per example, seems to create the machines on the cloud providers

+ kubectl get machines --kubeconfig=/root/.kube/kind-config-clusterapi
NAME                                    PROVIDERID                    PHASE
test-1586834270-controlplane-0          aws:////i-08218c17884e42385   running
test-1586834270-controlplane-1          aws:////i-007b9c4ebbdfd1b70   provisioned
test-1586834270-controlplane-2          aws:////i-06cbe830be2e45c71   running
test-1586834270-md-0-6b49b79599-4vlmm   aws:////i-085de76c494a5c411   provisioned
test-1586834270-md-0-6b49b79599-gwr6q   aws:////i-0d3d76afaa1efcf12   provisioned

and

NAME                          PROVIDERID                                                        PHASE
test1-controlplane-0          gce://k8s-jkns-gci-gce-slow-1-3/us-east4-a/test1-controlplane-0   running
test1-controlplane-1          gce://k8s-jkns-gci-gce-slow-1-3/us-east4-b/test1-controlplane-1   provisioned
test1-controlplane-2          gce://k8s-jkns-gci-gce-slow-1-3/us-east4-c/test1-controlplane-2   provisioned
test1-md-0-585dd5848b-4gx9s   gce://k8s-jkns-gci-gce-slow-1-3/us-east4-a/test1-md-0-26lpc       provisioned
test1-md-0-585dd5848b-l89m8   gce://k8s-jkns-gci-gce-slow-1-3/us-east4-a/test1-md-0-6n2gn       provisioned
[04:08:08] Total machines : 5 / Running : 1 .. waiting for 10 seconds

@aojea Oh yes you're right. I'm installing Devstack and then a OpenStack cluster inside of devstack. So basically one layer of virtualization more and only a development distro of OpenStack. I had to use devstack because CityNetwork doesn't provide the OpenStack LBaaS API.

@aojea Oh yes you're right. I'm installing Devstack and then a OpenStack cluster inside of devstack. So basically one layer of virtualization more and only a development distro of OpenStack. I had to use devstack because CityNetwork doesn't provide the OpenStack LBaaS API.

Just for clarification and to avoid misunderstandings, that's perfectly fine for e2e testing, ... , just that it is not the signal that this specific issue is looking for :sweat_smile:

@aojea yup got it 馃槂

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

  • we can run way more than conformance now on kind and the gap is closing synthetic-wise
  • we do block releases on other clusters

to keep it from showing up in the "things that have been forgotten" list
/remove-lifecycle rotten

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cpanato picture cpanato  路  6Comments

Bubblemelon picture Bubblemelon  路  6Comments

saschagrunert picture saschagrunert  路  6Comments

justaugustus picture justaugustus  路  6Comments

jeremyrickard picture jeremyrickard  路  7Comments