spinnaker/clouddriver#3227 added a flag liveManifestCalls to Kubernetes V2 accounts in clouddriver, as a workaround to some pretty drastic performance issues that were present at the time.
When the flag is enabled, the 'Deploy Manifest' stage waits for the newly-deployed resource by directly polling the cluster instead of by checking in Spinnaker's cache. This generally causes the stage to finish more quickly, as it can complete as soon as the resource is ready instead of once the new resource is reflected in the cache. This was particularly important when caching performance was still very poor, as deploys would often time out waiting tens of minutes for the cache to refresh.
Since that flag was added, there have been significant performance improvements to the caching logic (including removing at least one O(N^2) algorithm); this means that deploy should in general complete within a few minutes with the flag disabled. That being said, many users have be come accustomed to deploys completing in a matter of seconds; deprecating and removing the flag would increase this time to on the order of a caching cycle (~1-2 minutes) which would appear as a regression to end users. Based on feedback from the Kubernetes V2 SIG, we will no longer move forward with the initial plan to deprecate and remove the flag.
liveManifestCallsWhile directly polling the cluster during a deploy does allow the status of the deploy to be ascertained and reported more quickly than by polling the cache, one significant disadvantage is that the stage completes before the cache reflects these changes. In general, there is a (somewhat implied) contract within Spinnaker that stages mutating infrastructure will not complete until the cache has been updated to reflect these mutations. This is so that downstream stages can use the cache as the source of truth when making decisions about their operations. (This is the origin of the many "Force Cache Refresh" tasks across the many cloud providers/stages.). It is not clear from the comments on the implementation of liveManifestMode whether the significant effects of breaking this contract were considered, or if there was a plan on how to work around them.
The result is that any downstream stages that rely on the cache being up-to-date (as stages are generally allowed to do) will either fail or produce incorrect results. Some examples are:
If we are going to continue to support liveManifestCalls, we need to figure out a way to fix the above issues. This issue here is to track the general solution.
A reasonable solution might be:
liveManifestCalls flag so that all accounts behave the same way, and instead decide based on the type of information being requested whether to make a live call or to use the cacheObviously that is a very vague plan, but I think it could be reasonably implemented based on some initial reading of the code. I'll update this issue as we figure out more details, and will close other issues reporting specific symptoms in favor of this issue.
This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:
@spinnakerbot remove-label stale
Unassigning now that this is in our active tracking project.
Dynamic target selection now uses live lookups only, and does not rely on the cache. This means that features that rely on dynamic target selection, such as rollout strategies and patching or deleting the oldest or newest member of a cluster, will now behave consistently regardless of whether liveManifestCalls is enabled. The following PRs enabled this fix:
This fix will be released with Spinnaker 1.23, currently scheduled for mid-October.
Most helpful comment
Dynamic target selection now uses live lookups only, and does not rely on the cache. This means that features that rely on dynamic target selection, such as rollout strategies and patching or deleting the oldest or newest member of a cluster, will now behave consistently regardless of whether
liveManifestCallsis enabled. The following PRs enabled this fix:This fix will be released with Spinnaker 1.23, currently scheduled for mid-October.