Kops: Kops has a hard dependency on GitHub

Created on 2 Apr 2020 · 13Comments · Source: kubernetes/kops

GitHub is currently experiencing an outage. Kops uses raw.githubusercontent.com to fetch the channels files.

During this outage, kops fails with:

kops update cluster
Using cluster from kubectl context: <my-cluster>
I0402 13:25:47.824966    5970 context.go:249] hit maximum retries 5 with error unexpected response code "500 Internal Server Error" for "https://raw.githubusercontent.com/kubernetes/kops/master/channels/stable": 500: Internal Server Error
error reading channel "https://raw.githubusercontent.com/kubernetes/kops/master/channels/stable": unexpected response code "500 Internal Server Error" for "https://raw.githubusercontent.com/kubernetes/kops/master/channels/stable": 500: Internal Server Error

We should have kops handle this more gracefully, continuing on unless thats not possible. For example, update cluster should work and just not report any channel updates, but create cluster may not since it might rely on the AMI or kubernetes version info.

lifecyclrotten

Source

rifelpet

👍11

Most helpful comment

I would go a step further and say that Kops should not have a hard dependency on GitHub at all. We run Kops as part of an automated service and it's alarming to me that our production environment is regularly pulling down data from GitHub.

Could the channel content be hosted in the state bucket and/or the Kops deployment bucket? (By deployment bucket I mean the bucket where Kops pulls nodeup/protokube from. I'm not sure what the Kops team calls that bucket internally.)

geekofalltrades on 2 Apr 2020

👍6

All 13 comments

geekofalltrades on 2 Apr 2020

👍6

Ran into this in prod environment. Github seemed terribly slow kept failing on my browser too, and kops update cluster command kept failing to get the channels file. Is there a way to override the url at this moment before we have a better fix at this?

amadav on 21 Apr 2020

This days Github has quite few issues and this causes small "outages" due to the fact that one can't upgrade/provision clusters. Anything we can help with to get this issue going?

mariusv on 21 Apr 2020

Will add this on the discussion list for the next office hours and see what would be the best thing to do. :)

hakman on 21 Apr 2020

We could store latest downloaded channel information in S3 state bucket. That could give a "last known" state of channel information to actions like update cluster

marek-obuchowicz on 23 Apr 2020

This is a really bad issue and a sad issue and I have it:

 kops create cluster \
    --name=...
    --state=s3://$KOPS_BUCKET \
   ....
    --zones=...
    --node-count=$NODE_COUNT \
    --node-size=$NODE_SIZE \
    --master-size=$NODE_MASTER_SIZE
I0423 16:08:00.851481    3141 context.go:231] hit maximum retries 5 with error unexpected response code "500 Internal Server Error" for "https://raw.githubusercontent.com/kubernetes/kops/master/channels/stable": 500: Internal Server Error

error reading channel "https://raw.githubusercontent.com/kubernetes/kops/master/channels/stable": unexpected response code "500 Internal Server Error" for "https://raw.githubusercontent.com/kubernetes/kops/master/channels/stable": 500: Internal Server Error

I cannot deploy my environments now... I suppose I am not the only one stuck now.

In fact, I do not understand why kops need those channels' contents? Joining the idea of @geekofalltrades, Is it something that can be packaged in the build release it-self and not having dependencies at all?

UPDATE: If hard dependencies is needed (Github or S3), it would be nice to provide several fallbacks to Kopsto avoid the single point of failure

jiraguha on 23 Apr 2020

😄1 👍1

In theory, some if not all that that could be skipped if someone manually sets the:

k8s version
ig image

If those are set manually, I really see no reason for the channel check to be mandatory, maybe even could be completely skipped.

hakman on 23 Apr 2020

👍1

To give some background, the channels files serve a few purposes:

provide a recommended AMI for each k8s minor version
provide a recommended k8s patch version for each k8s minor version
provide a notice to upgrade k8s if a k8s patch version should not be used (security issues, etc.)
provide a recommended k8s minor version for each Kops version

Some of this information is only relevant during certain kops commands such as create cluster, update cluster, upgrade cluster. And in those cases sometimes the information is required and sometimes it is only a "nice to have".

The intent is for this information to be decoupled from kops binaries and releases. We can update the channels file without needing to release a new kops version. Hopefully this explains why it cant live in the cluster's state store.

A while back Kops started hosting nodeup binaries in multiple locations for redundancy, you can see that in your userdata for example. I would propose that we do the same with channels. Kops could look for channels files in multiple locations. We'll have to build out the CI tooling to update the channels file in every location anytime it is changed in the master branch but I think that is reasonable.

I've added this to tomorrow's office hours agenda so hopefully we can find a reasonable way forward.

rifelpet on 23 Apr 2020

👍2

@rifelpet, Thanks for the explanation and good idea.

Just to follow your proposition, I would run kops in that order:

A channel check using live dependencies provided by kops team (eventually in multiple locations for redundancy)
if (1.) does not work, show a warning and make the check using a location that could be provided by kops users (a cache of channels file own by the user running kops)
if (2. and 1.) fail, show a warning and having the ability to provides them manually or skip the channel check
If the channel check is skipped, use channels from kops released binary!

This would ensure that kops run whatever happens to channel file!

jiraguha on 23 Apr 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 22 Jul 2020

/remove-lifecycle stale

geekofalltrades on 22 Jul 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 20 Oct 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 19 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings