Kops: `kops create --dry-run` should not fail if cluster state already exists

Created on 17 Jan 2018 · 35Comments · Source: kubernetes/kops

What kops version are you running?

1.8.0

What Kubernetes version are you running?

1.8.0

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

After cluster state is already in state store (e.g. it is already created or manifest was uploaded to S3 via kops replace -f ...) try to produce a manifest using the --name (e.g. my-cluster.k8s.local) and --state (e.g. s3://my-cluster-state-store) of the existing cluster:

kops create --dry-run ...

What happened after the commands executed?

I0117 11:04:51.031445   28032 s3context.go:163] Found bucket "my-cluster-state-store" in region "eu-west-1"
I0117 11:04:51.031751   28032 s3fs.go:176] Reading file "s3://my-cluster-state-store/my-cluster.k8s.local/config"

cluster "my-cluster.k8s.local" already exists; use 'kops update cluster' to apply changes

What did you expect to happen?

No error. It should produce a manifest yaml independently of whether the cluster with such name already exists or not.
Ideally, it must not require the existence of the bucket at all on dry run.

I need such behavior in order to automate cluster (re)deploys. I want my script to always start with creation of the desired manifest via kops create --dry-run .... Because of the current behavior the second execution of my script fails (i.e. after the cluster is already created).

I found a workaround though:

1) Always provide an empty bucket on kops create --dry-run ...
2) After it produces the manifest replace the configBase field in yaml so that it points to the real S3 bucket which will be used to hold the state for the cluster instead of the empty bucket name.

lifecyclstale

Source

metametadata

👍10

Most helpful comment

My initial reaction is, if I run kops create --dry-run I would want all the same checks as a regular kops create, just no actual execution. That would include the check for existing clusters, similar to the way terraform plan takes into account existing state. I think it could get confusing allowing a dry run to pass a check (like existing cluster), that would break an actual creation.

That said your automation seems pretty awesome. I personally don't automate creation at that level. We just version our manifests in s3. I have one concern with your constant kops create cluster. @chrislovecnm can keep me honest on this but I think there have been times where kops will assume changes will only automatically be applied to a new cluster and thus by running it frequently, eventually you may get a new flag defaulted in a way you would not expect.

For instance, at one point:

iam:
  legacy: true

was added to older kops clusters to ensure change wasn't automatically pushed to existing clusters. If you had a cluster, then regenerated your cluster manifest from scratch, you'd get legacy: false and your cluster could see unexpected change. Now I'm not sure if additional changes like that could occur down the road but just one concern that comes to mind with the method above.

My preference would be an idempotent value like set (closest thing I could find in kubectl ):

kops set cluster --dry-run --output yaml 
                          --name $name
                          --state $state
                          --node-count $node-count 
                          --ssh-public-key $ssh-public-key .

Instead this would set those values to an existing cluster, or fall back to create a new cluster, thus dry-run / existing cluster should be idempotent.

I see the need for a solution, but I'm not sure changing kops create is the right way. I need to think a little more on this but that's what I have so far!

mikesplain on 18 Jan 2018

👍2

All 35 comments

Huh it is trying to run create in a created cluster. Sounds like you need a replace dry run? Why do you need a dry-run with an existing cluster, trying to figure out the need and a solution

chrislovecnm on 17 Jan 2018

Essentially I'm writing an "idempotent" function to automate cluster deploy, in pseudocode:

deploy(cluster-config) {...}

So that I can commit the cluster config into Git repo as part of the script (which is then e.g. invoked by CI server):

prod-cluster = {name: "prod.k8s.local", state: "prod-state-store", node-count: 1, node-size: "t2.micro" ...}
deploy(prod-cluster)

Then if I want to increase the number on nodes I make a new commit changing the number in the script:

prod-cluster = {... node-count: 100 ...}
deploy(prod-cluster)

Thus deploy can be called more or less frequently in an automated way and must only depend on the input arguments.

Currently I have this algorithm working in deploy (inspired by manifests_and_customizing_via_api.md):

// Produce a cluster manifest from scratch based on the function args
manifest = kops create cluster --dry-run --output yaml 
                          --name $name
                          --state $state
                          --node-count $node-count 
                          --ssh-public-key $ssh-public-key ...
// + use the "empty bucket" workaround I described in the initial comment

// Upload manifest to S3. 
// --force forces any changes, which will also create any non-existing resource.
kops replace --force -f /dev/stdin < manifest 

// Create/update SSH public key.
// HACK: by default, if key fingerprint is different then it adds the new key into state store leading to an error on `kops update`:
// "Exactly one 'admin' SSH public key can be specified when running with AWS; please delete a key using `kops delete secret`".
// That's why delete is needed. But delete fails if there are no keys to delete, that's why exceptions are ignored.
// Also see https://github.com/kubernetes/kops/issues/4291.
ignoring-exceptions(kops delete secret sshpublickey "admin")
kops create secret sshpublickey "admin" -i $ssh-public-key 

// Create/update the cluster.
kops update cluster --yes

// Apply additional changes if needed. E.g. this is needed when node EC2 size changes and instances have to be restarted.
// After cluster is created it's not available immediately, that's why a retry loop is needed.
retrying(kops rolling-update cluster --yes)

// Wait for cluster to be ready.
retrying(kops validate cluster)

Ideally, I don't want to add any conditional logic based on whether the cluster is already created or not and/or in which state it is now. Instead, I only want to provide the new state and let kops bring the cluster to this desired state. Such desired state must be inferred from the limited number of params I pass into deploy, thus I currently have to use kops create --dry-run to create the desired manifest.

edit: added state and name params for clarity.

metametadata on 17 Jan 2018

@robinpercy / @mikesplain any ideas? Should --dry-run work with an existing cluster with kops create?

chrislovecnm on 17 Jan 2018

For instance, at one point:

iam:
  legacy: true

My preference would be an idempotent value like set (closest thing I could find in kubectl ):

kops set cluster --dry-run --output yaml 
                          --name $name
                          --state $state
                          --node-count $node-count 
                          --ssh-public-key $ssh-public-key .

Instead this would set those values to an existing cluster, or fall back to create a new cluster, thus dry-run / existing cluster should be idempotent.

I see the need for a solution, but I'm not sure changing kops create is the right way. I need to think a little more on this but that's what I have so far!

mikesplain on 18 Jan 2018

👍2

@mikesplain Thanks.

I think it could get confusing allowing a dry run to pass a check (like existing cluster), that would break an actual creation.

I agree.

I can also see the problems with the legacy iam example, didn't think about such cases before.

And kops set looks neat.

metametadata on 18 Jan 2018

@metametadata Exactly! Do you have any other thoughts? If this is something you're interested in, I'd say we should flush out more ideas here, before implementing.

We always love new contributions if you have some time to contribute something like this :)

mikesplain on 19 Jan 2018

Is there a comparable kubectl command that we can model? Not sure if there is a 'set' in kubectl. We hope to eventually be using kubectl as well as the kops binaries when we have the kops-server running. So if we can have parity that would be awesome.

chrislovecnm on 19 Jan 2018

kubectl has apply which creates or brings the resource to the specified state using the passed configuration:

# Apply the configuration in pod.json to a pod.
kubectl apply -f ./pod.json

# Apply the JSON passed into stdin to a pod.
cat pod.json | kubectl apply -f -

The provided configuration doesn't have to contain all the fields, there're a lot of optional fields which have sensible default values. This looks similar to how kops doesn't require the full manifest on creating the cluster and "infers" values from the limited number of explicitly provided CLI args.

So for a moment let's imagine that kops has apply, smt. like:

kops apply -f ./manifest.yaml

There're several questions already:

1) I'm not sure that --state and --name CLI args should be allowed for kops apply since there're similar fields in the manifest (e.g. metadata/name, configBase, masterPublicName).

2) In manifest.yaml I should be able to specify only the fields I'm interested in. The rest should be inferred. But I don't think it will be possible though because manifest currently is very different from what I can do with kops create. E.g. there's no nodes field in the manfiset, instead there's a full-blown list of InstanceGroup resources.

3) Because currently kops allows specifying manifest fields via CLI args (e.g. kops create --nodes= ...) then some users would expect that smt. like this should be also possible with kops apply? E.g. kops apply --nodes ... or there can be an additional command introduced to infer the manifest (e.g. kops generate-manifest --nodes ... > manifest.yaml). Note that there's no such thing in kubectl, it always works with yamls/jsons.

4) All in all: I don't quite grasp how kops "manifest" and "state" of cluster (which is spread between config, cluster.spec and instancegroup/?) map to kubectl "configurations" and "inner state" of Kube resources.

The good thing about `kubectl` is that users don't have to think about how/where "inner state" is saved at all and how it's migrated between upgrades vs. `kops` which requires the specific location of state (S3 bucket) and encourages replacing/editing it manually. 

`kops` "manifest" is not really the same thing as `kubectl` "configuration" and seems to be "the snapshot of the cluster state", while CLI args of kops (e.g. `--nodes ..`) really correspond the `kubectl` "configuration".

So yeah, so far I don't have any good ideas on how to design the new API to resemble kubectl because of the explicit editable "inner state" of kops cluster and the differences between manifest, S3 state and CLI args, but I hope this comment can still be helpful to move the discussion forward.

metametadata on 23 Jan 2018

To chime in, I've been looking for something exactly like kops apply as well.

I'm writing some scripts that automate the creation of our cluster, and the nodes' AWS policy needs additionalPolicies for Route53 (for k8s/external-dns). That option is not available for the CLI, but it would make more overall sense if I could kops create --dry-run a cluster, modify the yaml, and run an idempotent kops apply with the minimally-correct cluster config to create or update.

quicksnap on 24 Jan 2018

So kubectl apply is sorta like kops replace btw.

chrislovecnm on 25 Jan 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 25 Apr 2018

🎉1

/remove-lifecycle stale

vincent-dm on 25 Apr 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Jul 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 23 Aug 2018

/remove-lifecycle rotten

metametadata on 23 Aug 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 21 Nov 2018

/remove-lifecycle stale

metametadata on 21 Nov 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 19 Feb 2019

/remove-lifecycle stale

metametadata on 19 Feb 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 20 May 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 19 Jun 2019

/remove-lifecycle rotten

metametadata on 19 Jun 2019

I also need an apply mechanism. I'm working on a multi-tenant cluster provisioning system that works across cloud providers, so I need to automate everything. My current approach is to run kops export kubecfg, check if stderr has a "not found" string, if it doesn't, then I run kops get cluster -o json, otherwise I run kops create cluster --dry-run -o json. I then update the manifest and run a replace --force.

My desire is to have helpers used in create_cluster.go exported, so I can generate a mostly complete manifest, fill in the gaps with user config, then apply that manifest. For instance, I need the code that maps subnet regions for GCP, so I don't have to roll that myself. Something like this would greatly reduce the complexity of my system.

kyleterry on 6 Aug 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 4 Nov 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 4 Dec 2019

/remove-lifecycle rotten

metametadata on 4 Dec 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Mar 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 3 Apr 2020

/remove-lifecycle rotten

metametadata on 3 Apr 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Jul 2020

It would definitely be very beneficial for us as well to make the workflow around updating existing clusters cleaner. The earlier referenced documentation (https://kops.sigs.k8s.io/manifests_and_customizing_via_api/) sounds like what we want, but this bit:

At this time you must run kops create cluster and then export the YAML from the state store. We plan in the future to have the capability to generate kops YAML via the command line. The following is an example of creating a cluster and exporting the YAML.

is awkward to use when you have an existing cluster you want to export the full yaml manifest for. Is there another workflow for exporting a complete cluster manifest?

stantonk on 21 Jul 2020

We use a separate bucket for these runs. I believe with a more recent version of kops you can use a local-file state store.

johngmyers on 21 Jul 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 20 Aug 2020

/remove-lifecycle rotten

metametadata on 20 Aug 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings