Containers-roadmap: [EKS] [request]: EKS 1.12 to 1.13 Upgrade Didn't create eks.privileged by default

Created on 17 Sep 2019  Â·  13Comments  Â·  Source: aws/containers-roadmap

Tell us about your request
What do you want us to build?

I'm looking for documentation to be consist on the issue of if eks.privileged PodSecurityPolicy is created when upgrading to EKS 1.13

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We recently upgraded EKS from 1.12 to 1.13 and caused a short outage in our cluster when the eks.privileged PSP was not created by default as documented in the blog here: https://aws.amazon.com/blogs/opensource/using-pod-security-policies-amazon-eks-clusters/ (Near the end is the line "For clusters that have been upgraded from previous versions, a fully-permissive PSP is automatically created during the upgrade process."

This PSP not existing blocked any app not using it's own PSP from being launched, and it also blocked new worker nodes from coming online because the aws-node daemonset that is responsible for managing CNI could not launch either.

Are you currently working around this issue?
How are you currently solving this problem?

We manually deployed the PSP to recover the cluster

Additional context
Anything else we should know?

  • As stated previously, the blog post said this would be created automatically, but there is no such verbiage in the PSP documentation in the official documentation website: https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html

  • Assuming this wasn't a one time fluke, this might be reproducible from launching any pre-1.13 cluster and upgrading it. If it somehow matters, our cluster started on 1.10, and has followed each minor version (1.10, 1.11, 1.12, 1.13).

EKS Proposed

Most helpful comment

We faced the same issue yesterday for an EKS upgrade from 1.12 to 1.13 (in us-east-1 region), and we ended up adding the 3 missing components (eks.privileged PodSecurityPolicy, ClusterRole & ClusterRoleBinding) using the yaml files we obtained from AWS documentation.

We did same upgrades in other environments on Wed, Sep 11 and Thu, Sep 12, so it looks this is a recent issue.

We did noticed that on this occasion, our Terraform upgrade implementation took considerable less time.

Hope this helps others having same issue! @kashook @AlexMichaelJonesNC

All 13 comments

We upgraded from EKS 1.12 to 1.13 today and experienced the same issue. The eks.privileged PodSecurityPolicy and the associated ClusterRole and ClusterRoleBinding were not created automatically.

Please provide cluster ARNs and we will look into it.

We faced the same issue yesterday for an EKS upgrade from 1.12 to 1.13 (in us-east-1 region), and we ended up adding the 3 missing components (eks.privileged PodSecurityPolicy, ClusterRole & ClusterRoleBinding) using the yaml files we obtained from AWS documentation.

We did same upgrades in other environments on Wed, Sep 11 and Thu, Sep 12, so it looks this is a recent issue.

We did noticed that on this occasion, our Terraform upgrade implementation took considerable less time.

Hope this helps others having same issue! @kashook @AlexMichaelJonesNC

@agustin-caylent Could you provide the impacted cluster ARN? We've been trying to reproduce this on our end without luck. So must be a race condition some sort.

@kashook @AlexMichaelJonesNC please provide your cluster ARNs as well to help us troubleshoot.

Thanks

@agustin-caylent @kashook @AlexMichaelJonesNC for the failure you've seen, is any psp objects present while upgrading from 1.12 to 1.13?

Yes, our cluster did have existing PSP objects. We stood up a development copy of our cluster on 1.12 to try the 1.13 upgrade prior to upgrading our production cluster, and this issue did not occur in the dev cluster. (It also had the existing PSP objects).

I can provide our cluster ARN but would prefer not to do so in a public post. Is there a way I can provide it to you privately?

The cluster upgrade process will skip creating eks.privileged psp if any psp already present (this is to prevent overriding any existing psp objects created prio).

@kashook Please open a support case to us with ur dev cluster and prod cluster arn. So we can take a look

@leakingtapan , yes we did have existing PSP objects. But we did the same upgrade on the cluster from another environment the week before (same configuration), and the error did not appeared there.

I will open a support case in AWS, to prevent exposing our customers cluster ARN. @jqmichael

I created a support case with the ARNs and referenced this issue.

We also hit this issue in us-east-1. New pods aren't scheduled and show this error (events on a Daemonset):

Events:
  Type     Reason        Age                     From                  Message
  ----     ------        ----                    ----                  -------
  Warning  FailedCreate  3m13s (x17 over 8m41s)  daemonset-controller  Error creating: pods "node-problem-detector-" is forbidden: unable to validate against any pod security policy: []

âš  It's a very major bug in the upgrade process âš 

Consider other tasks that EKS users might perform after an upgrade, like update kube-proxy, the AMI or CNI. If any of these processes are started before the PSP issue is fixed, like if the EKS user doesn't notice, then parts of the cluster will be offline and any new nodes won't become Ready.

You can copy these resources from another EKS cluster: kubectl get clusterrolebinding,clusterrole,psp -l eks.amazonaws.com/component=pod-security-policy

Or copy them from here

Could you provide the impacted cluster ARN?

@jqmichael our ARN is in support case 514398213672

The cluster upgrade process will skip creating eks.privileged psp if any psp already present

Our other 1.12>1.13 upgrades didn't have this problem and they had PSPs already present.

Our other 1.12>1.13 upgrades didn't have this problem and they had PSPs already present.

When are those cluster upgrade performed?

When are those cluster upgrade performed?

About 8 weeks ago IIRC.

After creating PSP by following this issue https://github.com/aws/amazon-vpc-cni-k8s/issues/695, was able to create coredns-1.2.6 successfully. Then 1.13 worker nodes were connected to the EKS cluster.

Was this page helpful?
0 / 5 - 0 ratings