Spinnaker: Accounts with liveManifestCalls don't work with dynamic target selection

Created on 21 Mar 2020  路  3Comments  路  Source: spinnaker/spinnaker

Overview

spinnaker/clouddriver#3227 added a flag liveManifestCalls to Kubernetes V2 accounts in clouddriver, as a workaround to some pretty drastic performance issues that were present at the time.

When the flag is enabled, the 'Deploy Manifest' stage waits for the newly-deployed resource by directly polling the cluster instead of by checking in Spinnaker's cache. This generally causes the stage to finish more quickly, as it can complete as soon as the resource is ready instead of once the new resource is reflected in the cache. This was particularly important when caching performance was still very poor, as deploys would often time out waiting tens of minutes for the cache to refresh.

Since that flag was added, there have been significant performance improvements to the caching logic (including removing at least one O(N^2) algorithm); this means that deploy should in general complete within a few minutes with the flag disabled. That being said, many users have be come accustomed to deploys completing in a matter of seconds; deprecating and removing the flag would increase this time to on the order of a caching cycle (~1-2 minutes) which would appear as a regression to end users. Based on feedback from the Kubernetes V2 SIG, we will no longer move forward with the initial plan to deprecate and remove the flag.

Issues with liveManifestCalls

While directly polling the cluster during a deploy does allow the status of the deploy to be ascertained and reported more quickly than by polling the cache, one significant disadvantage is that the stage completes before the cache reflects these changes. In general, there is a (somewhat implied) contract within Spinnaker that stages mutating infrastructure will not complete until the cache has been updated to reflect these mutations. This is so that downstream stages can use the cache as the source of truth when making decisions about their operations. (This is the origin of the many "Force Cache Refresh" tasks across the many cloud providers/stages.). It is not clear from the comments on the implementation of liveManifestMode whether the significant effects of breaking this contract were considered, or if there was a plan on how to work around them.

The result is that any downstream stages that rely on the cache being up-to-date (as stages are generally allowed to do) will either fail or produce incorrect results. Some examples are:

  • Any stages that use dynamic target selection to patch/enable/disable a resource. These will look in the cache to find the oldest/newest/etc. resource, and will act based on the state of the cache when they run (which may omit a newly deployed/deleted/patched resource from a prior stage)
  • Spinnaker traffic management, which is a special case of the above point. As the traffic management functionality relies on looking in the cache for the newest/second newest/etc. replica set, it will fail if the cache does not reflect reality.

Plan

If we are going to continue to support liveManifestCalls, we need to figure out a way to fix the above issues. This issue here is to track the general solution.

A reasonable solution might be:

  • Remove the liveManifestCalls flag so that all accounts behave the same way, and instead decide based on the type of information being requested whether to make a live call or to use the cache
  • In the case of deployments, continue to make live calls (to keep deployments fast), but figure out a way
    to leave the cache in a consistent state at the end of the stage (by making a targeted update) so that downstream stages can still use the cache

Obviously that is a very vague plan, but I think it could be reasonably implemented based on some initial reading of the code. I'll update this issue as we figure out more details, and will close other issues reporting specific symptoms in favor of this issue.

no-lifecycle sikubernetes

Most helpful comment

Dynamic target selection now uses live lookups only, and does not rely on the cache. This means that features that rely on dynamic target selection, such as rollout strategies and patching or deleting the oldest or newest member of a cluster, will now behave consistently regardless of whether liveManifestCalls is enabled. The following PRs enabled this fix:

This fix will be released with Spinnaker 1.23, currently scheduled for mid-October.

All 3 comments

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

Unassigning now that this is in our active tracking project.

Dynamic target selection now uses live lookups only, and does not rely on the cache. This means that features that rely on dynamic target selection, such as rollout strategies and patching or deleting the oldest or newest member of a cluster, will now behave consistently regardless of whether liveManifestCalls is enabled. The following PRs enabled this fix:

This fix will be released with Spinnaker 1.23, currently scheduled for mid-October.

Was this page helpful?
0 / 5 - 0 ratings