Website: Travis CI is slow and fragile

Created on 12 Mar 2019  路  4Comments  路  Source: kubernetes/website

This is a...

  • [x] Feature Request
  • [x] Bug Report

whynotboth

Problem

Currently, k/website uses the free tier of Travis for CI testing. This worked acceptably well in the repository's first years—but as the volume of PRs increases, so have build times and fragilities.

Build times are slow

It's common to see Travis build times greater than 5 minutes, sometimes greater than 10 minutes, and greater than 20 minutes during periods of peak traffic. Other repos across the K8s org experience the same build times.

Travis is increasingly fragile

See #2210, #3360, #13122, https://github.com/kubernetes/website/pull/13128#issuecomment-472058425

In conclusion:

gross

Proposed solution:

It seems like our solutions are either to:

  • Pay for a higher usage tier org-wide on Travis; or
  • Migrate k/website's CI from Travis to an alternative like Prow

I've asked in Slack/#sig-testing about the feasibility of migrating to Prow; any action here is naturally dependent upon input from @kubernetes/sig-testing. 馃憢

UPDATE: @DanyC97 was way ahead of me on Slack.

/cc @DanyC97 @stevekuznetsov

Most helpful comment

Just spent some time investigating how fragile the gate is and pondering on alternatives for acceleration. The findings:

  1. the command git clone https://github.com/kubernetes/kubernetes is eating most cycles during tests, usually more than 400 seconds, i.e. 6 minutes.
  2. examples are relatively stable, PRs touching them are very rare.
  3. testing of examples is needed since our audience may see them as official examples and we encourage people to follow up our tasks documentation to learn kubernetes.
  4. travis does support some conditionals for us to leverage.
  5. there seem to be some prows which can help adding labels to a PR if it is touching things under, e.g. 'content/en/examples'.

For bullet 1 above, we can try replace it with wget https://github.com/kubernetes/kubernetes/archive/v1.13.4.tar.gz for example, which will reduce the size of code to be pulled from 900+MB to 30MB.

Considering the bullet 2, 4, 5 above, maybe we can do:

  • have the prow automatically label a PR with "examples" if the PR touches examples, and/or
  • revise the .travis.yml file to make the "Examples Test" job conditional

All 4 comments

Just spent some time investigating how fragile the gate is and pondering on alternatives for acceleration. The findings:

  1. the command git clone https://github.com/kubernetes/kubernetes is eating most cycles during tests, usually more than 400 seconds, i.e. 6 minutes.
  2. examples are relatively stable, PRs touching them are very rare.
  3. testing of examples is needed since our audience may see them as official examples and we encourage people to follow up our tasks documentation to learn kubernetes.
  4. travis does support some conditionals for us to leverage.
  5. there seem to be some prows which can help adding labels to a PR if it is touching things under, e.g. 'content/en/examples'.

For bullet 1 above, we can try replace it with wget https://github.com/kubernetes/kubernetes/archive/v1.13.4.tar.gz for example, which will reduce the size of code to be pulled from 900+MB to 30MB.

Considering the bullet 2, 4, 5 above, maybe we can do:

  • have the prow automatically label a PR with "examples" if the PR touches examples, and/or
  • revise the .travis.yml file to make the "Examples Test" job conditional

@tengqm 馃憢

For bullet 1 above, we can try replace it with wget https://github.com/kubernetes/kubernetes/archive/v1.13.4.tar.gz

400 seconds is a big chunk of time!

I'm curious what kind of improvement we would see by changing the command only in k/website, given that our Travis account is org-wide. Do you think that, in order to see a marked improvement, we'd have to propagate the change to every org repo using Travis? (23 repos, according to k/test-infra/prow/config.yaml.) Likewise, I wonder whether such an improvement would resolve technical debt or merely postpone it.

I agree that testing examples is necessary. In addition to automatic labeling for PRs to files in the examples path, we could add an OWNERS file to content/en/examples/ with more restrictive approval permissions, thereby ensuring review from a specialized pool. Given the rarity of changes to examples, such restrictions wouldn't seem to impose an undue burden on reviewers.

What do you think?

@zacharysarah I'd spend some time testing the ideas above and see whether and how much improvement we can get. Automated testing is still preferred (see #13145 where the missing of serviceAccountName was detected by the unit tests) over human inspection. Will get back to you after the experiments.

my 0.02$

examples are relatively stable, PRs touching them are very rare

i wouldn't say PRs touching is rare, i would say it will be rare once i manage to _collect_ all the examples which are currently in the docs but not in files. I've started on that road but is not completed (yet)

testing of examples is needed since our audience may see them as official examples and we encourage people to follow up our tasks documentation to learn kubernetes.

100%

In addition to automatic labeling for PRs to files in the examples path, we could add an OWNERS file to content/en/examples/ with more restrictive approval permissions, thereby ensuring review from a specialized pool

i personally wouldn't go that route, whoever is reviewing and approving can pay extra attention to the examples. Saying that i don't see why examples needs a special treatment, all the content is equal i think

Automated testing is still preferred (see #13145 where the missing of serviceAccountName was detected by the unit tests) over human inspection

again i agree, the issue @tengqm found and fixed in #13145 it was not introduced by this PR , it was in the docs for a while however is a valid point

And this indeed reinforce what @tengqm said about testing, hopefully we will have tested examples once i manage to complete the journey i mentioned above.

Was this page helpful?
0 / 5 - 0 ratings