Spinnaker: Refactor and solidify contract for Kubernetes operation inputs and outputs

Created on 14 Apr 2020 · 3Comments · Source: spinnaker/spinnaker

Issue Summary:

After we refactor the manifest deployment operations to remove the need for the liveManifestCalls flag, we should take the opportunity to re-evaluate and strongly type the contracts among Deck's execution details, Orca's manifest-related tasks, and Clouddriver's manifest operations.

Cloud Provider(s):

Kubernetes V2

Feature Area:

Deck, Orca, Clouddriver

Description:

This is not a definite plan, but possibilities to consider while doing this cleanup:

Is it necessary to pass the entire applied manifest from Clouddriver to Orca to Deck? For users with large multi-manifests, this is bloating the execution context to the point where Deck's execution details cannot load.
Is Clouddriver's OperationResult the right model for the type of information we need to surface for every Kubernetes operation? For instance, it might be useful to surface more targeted information for Undo Rollout stages.
Let's try to remove polymorphisms in the Kubernetes operation outputs: for example, currently the output manifests key stores either applied manifests or applied patches depending on stage type, which I believe is why we poll continuously for the updated manifest rather than using the context we already have when rendering Deploy and Patch Manifest stage's execution details.
Execution details need to be useful when operations fail: let's try to surface as much information as possible, such as what baked manifest we attempted to deploy, on stage failure.

no-lifecycle providekubernetes-v2 sikubernetes

Source

maggieneterval

👍7 🎉2 🚀1

Most helpful comment

Just wanted to throw this scenario in there as well from a previous issue: https://github.com/spinnaker/spinnaker/issues/3228

In this particular pipeline structure, the fan-in fan-out nature leads to huge payloads for downstream stages due to parentExecutions containing all the manifests (especially if deploy stages have large manifest deployments). In our case, we can even start seeing pipeline execution context sizes near 100s of megabytes in SQL (though even before that point the system is pretty much breaking down because none of this data can be fetched efficiently, or updated).

Just for some additional context, we had implemented a hacky fix in our fork as far back as 1.8 (I think) to help mitigate this issue by removing a bunch of stuff from the context:

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt:

  private fun Stage.processTaskOutput(result: TaskResult) {
    val filterKeys: MutableList<String> = mutableListOf("stageTimeoutMs",
                                                        "outputs.createdArtifacts",
                                                        "outputs.manifestNamesByNamespace",
                                                        "outputs.boundArtifacts",
                                                        "outputs.manifests",
                                                        "artifacts")
    if (!context.contains("expectedArtifacts")) {
      filterKeys.add("resolvedExpectedArtifacts")
    }
    val filteredOutputs = result.outputs.filterKeys { !filterKeys.contains(it) }
    if (result.context.isNotEmpty() || filteredOutputs.isNotEmpty()) {
      context.putAll(result.context)
      outputs.putAll(filteredOutputs)
      repository.storeStage(this)
    }
  }

This seemed to be "working" for us until we upgraded to 1.19 recently, in which this code no longer seems to be taking effect even though we have it in our fork. I wasn't involved in this change in our fork so I don't know a lot the context around it, but it does seem like what we're doing probably does break some stuff, since I see outputs.manifests and outputs.manifestNamesByNamespace used for certain functionalities (CleanupArtifacts, traffic management).

Anyway, just wanted to throw this out there since I've been looking into the exploding manifests issue again recently - I'm pretty scared to touch things because like mentioned in this issue, there isn't a well defined contract about what needs to be in the execution context, so having that and trying to get manifest contents out of the execution context would be great. From what I can tell from searching around in the code, I think most functionality just needs the manifest name, namespace, and annotations.

As a sidenote, getting manifest contents out of execution context would also help if you're trying to deploy Secrets from a private github repo or something, since Kubernetes Secrets are just plaintext and currently will show up there.

mtweten on 22 Jun 2020

👍4

All 3 comments

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

spinnakerbot on 29 May 2020

Just wanted to throw this scenario in there as well from a previous issue: https://github.com/spinnaker/spinnaker/issues/3228

Just for some additional context, we had implemented a hacky fix in our fork as far back as 1.8 (I think) to help mitigate this issue by removing a bunch of stuff from the context:

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt:

  private fun Stage.processTaskOutput(result: TaskResult) {
    val filterKeys: MutableList<String> = mutableListOf("stageTimeoutMs",
                                                        "outputs.createdArtifacts",
                                                        "outputs.manifestNamesByNamespace",
                                                        "outputs.boundArtifacts",
                                                        "outputs.manifests",
                                                        "artifacts")
    if (!context.contains("expectedArtifacts")) {
      filterKeys.add("resolvedExpectedArtifacts")
    }
    val filteredOutputs = result.outputs.filterKeys { !filterKeys.contains(it) }
    if (result.context.isNotEmpty() || filteredOutputs.isNotEmpty()) {
      context.putAll(result.context)
      outputs.putAll(filteredOutputs)
      repository.storeStage(this)
    }
  }

mtweten on 22 Jun 2020

👍4

Unassigning now that this is in our active tracking project.