Cert-manager: Make Route53 dns01 work with IAM roles for service accounts

Created on 3 Oct 2019  路  27Comments  路  Source: jetstack/cert-manager

cert-manager should support the new IAM Roles for Service Accounts (IRSA) feature of AWS. Instead of putting the assumed role into the ClusterIssuer, it should go into the service account.

If you set everything up as described in the linked AWS page and put the assumed role in, just put the web identity role into v1.10.1, you get a lot of errors:

error instantiating route53 challenge solver: unable to assume role: AccessDenied: Access denied\n\tstatus code: 403
areacmdns01 kinfeature prioritbacklog

Most helpful comment

Hi @munnerz, of course. The whole thing works like this:

In addition to your AWS EKS Kubernetes cluster, you run an OpenID Connect identity provider for your AWS account, associated with your cluster. Because Kubernetes service accounts are first-class citizens in IAM now, you can allow them to assume an IAM role by creating a role with a trust relationship like this (identifiers redacted):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Federated":
          "arn:aws:iam::0123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/FFFFFFFFFFFFFFFF0123456789ABCDEF"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.eu-central-1.amazonaws.com/id/FFFFFFFFFFFFFFFF0123456789ABCDEF:sub": 
            "system:serviceaccount:cert-manager:cert-manager"
        }
      }
    }
  ]
}

This role gets a policy with the permission to modify the DNS zone as you already wrote in your documentation, ideally restricted to the the required zones. Let's assume the ARN of this role is arn:aws:iam::0123456789012:role/cert-manager-demo.

To tell cert-manager to actually make use of that granted permission and assume that role, you annotate the service account like this:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::0123456789012:role/cert-manager-demo
  name: cert-manager
  namespace: cert-manager

Now a mutating admission controller will modify all pods running with that service account as follows:

apiVersion: apps/v1
kind: Pod
# ...
spec:
  # ...
  containers:
  - name: ...
    # ...
    env:
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::0123456789012:role/cert-manager-demo
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    volumeMounts:
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        name: aws-iam-token
        readOnly: true
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

According to the linked blog post, you need to update the AWS SDK for Go to version 1.23.13 or beyond, which is able to process the injected information. I am not sure wether you need to change anything in the code, as External-DNS actually works and they just call AssumeRole as you do.

The cluster issuer doesn't need the role attribute as this went to the service account:

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
  namespace: cert-manager
spec:
  acme:
    ...
    solvers:
    - selector:
        dnsZones:
        - "example.com"
      dns01:
        route53:
          region: eu-central-1
          hostedZoneID: DNSZONEIDHERE
          # no more role here

I think in the end it's just updating the AWS SDK dependency, annotating the service account and adjusting the documentation.

Update: I think https://github.com/jetstack/cert-manager/pull/2083 and https://github.com/jetstack/cert-manager/pull/2086 do exactly this, but there hasn't been a release since then :)

All 27 comments

Hey, thanks for the feedback. I'm not personally too familiar with the AWS IAM APIs,

2083 upgraded our use of AWS client libraries to support IAM Roles for Service Accounts, as far as I am aware, and #1917 is where support for an explicit role field was added.

Would you be able to distill some of the information in those linked AWS docs for us, so we can work out what/if we need to make further changes? 馃槃

Hi @munnerz, of course. The whole thing works like this:

In addition to your AWS EKS Kubernetes cluster, you run an OpenID Connect identity provider for your AWS account, associated with your cluster. Because Kubernetes service accounts are first-class citizens in IAM now, you can allow them to assume an IAM role by creating a role with a trust relationship like this (identifiers redacted):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Federated":
          "arn:aws:iam::0123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/FFFFFFFFFFFFFFFF0123456789ABCDEF"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.eu-central-1.amazonaws.com/id/FFFFFFFFFFFFFFFF0123456789ABCDEF:sub": 
            "system:serviceaccount:cert-manager:cert-manager"
        }
      }
    }
  ]
}

This role gets a policy with the permission to modify the DNS zone as you already wrote in your documentation, ideally restricted to the the required zones. Let's assume the ARN of this role is arn:aws:iam::0123456789012:role/cert-manager-demo.

To tell cert-manager to actually make use of that granted permission and assume that role, you annotate the service account like this:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::0123456789012:role/cert-manager-demo
  name: cert-manager
  namespace: cert-manager

Now a mutating admission controller will modify all pods running with that service account as follows:

apiVersion: apps/v1
kind: Pod
# ...
spec:
  # ...
  containers:
  - name: ...
    # ...
    env:
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::0123456789012:role/cert-manager-demo
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    volumeMounts:
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        name: aws-iam-token
        readOnly: true
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

According to the linked blog post, you need to update the AWS SDK for Go to version 1.23.13 or beyond, which is able to process the injected information. I am not sure wether you need to change anything in the code, as External-DNS actually works and they just call AssumeRole as you do.

The cluster issuer doesn't need the role attribute as this went to the service account:

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
  namespace: cert-manager
spec:
  acme:
    ...
    solvers:
    - selector:
        dnsZones:
        - "example.com"
      dns01:
        route53:
          region: eu-central-1
          hostedZoneID: DNSZONEIDHERE
          # no more role here

I think in the end it's just updating the AWS SDK dependency, annotating the service account and adjusting the documentation.

Update: I think https://github.com/jetstack/cert-manager/pull/2083 and https://github.com/jetstack/cert-manager/pull/2086 do exactly this, but there hasn't been a release since then :)

We've just today cut the v0.11.0-beta.0 release, which includes this (as well as quite a few other changes!)

If you've got the chance to give this a go and update this issue with the results, it'd be great to find out if there's anything more we need to do to support this 馃槃

Yes, just tried it, but I am unable to create a cluster issuer due to https://github.com/jetstack/cert-manager/issues/2109. When creating the custom resource definitions, I got this validation error

error: error validating "https://raw.githubusercontent.com/jetstack/cert-manager/release-0.11/deploy/manifests/00-crds.yaml": error validat
ing data: ValidationError(CustomResourceDefinition.spec.validation.openAPIV3Schema.properties.spec.properties.solver.properties.dns01.prope
rties.webhook.properties.config): unknown field "x-kubernetes-preserve-unknown-fields" in io.k8s.apiextensions-apiserver.pkg.apis.apiextens
ions.v1beta1.JSONSchemaProps; if you choose to ignore these errors, turn validation off with --validate=false

Workaround is using --validate=false which allows creating the custom resource definitions. Tested on Amazon EKS 1.14.7.

I've added some comments in #2109 馃槃

& yep, that fields is only present in k8s 1.15 onwards. Setting --validate=false is absolutely fine to do. The error occurs because kubectl has validation based on OpenAPI schemas, and in 1.14 that field was not present. The Kubernetes apiserver will just silently drop this field when it gets submitted to an older apiserver 馃槃

Hello there, I can confirm that opening port 443 as described in https://github.com/jetstack/cert-manager/issues/2109 made creating the cluster issuer work.

I also added the eks.amazonaws.com/role-arn annotation that made the admission controller mutate the cert manager pod. Here is an excerpt from kubectl describe pod:

   Environment:
      POD_NAMESPACE:                cert-manager (v1:metadata.namespace)
      AWS_ROLE_ARN:                 arn:aws:iam::xxxxxxxxxxxx:role/CertManager-demo
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cert-manager-token-c595c (ro)

However, the challenge fails because the injected token cannot be read:

$ kubectl describe challenge ...
...
Error presenting challenge: Failed to change Route 53 record set: WebIdentityErr: unable to read file at /var/run/secrets/eks.amazonaws.com/serviceaccount/token
...
$ kubectl logs cert-manager-...
...
E1010 12:04:47.578424       1 controller.go:131] cert-manager/controller/challenges "msg"="re-queuing item  due to error processing" "error"="Failed to change Route 53 record set: WebIdentityErr: unable to read file at /var/run/secrets/eks.amazonaws.com/serviceaccount/token\ncaused by: open /var/run/secrets/eks.amazonaws.com/serviceaccount/token: permission denied" "key"="cert-manager/demo-example-com-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxxx"
...

As the container has no shell I am unable to check the current permissions, but I think it's just a small thing.

Edit: enabling the securityContext solved this. Everything is working as expected as soon as the required changes are made. The only thing left is adjusting the documentation and the Kubernetes manifests accordingly.

Changes to be made:

  • Add eks.amazonaws.com/role-arn annotation to service account
  • Set security context for cert-manager deployment as follows
  securityContext:
    fsGroup: 1001

Hello,

I'm having some issues getting the IAM roles working with v11.0.0, and I fear I'm missing something simple.

  • I am using helm to install on Kubernetes 1.14, and have read through the release notes, and updated my ClusterIssuer to conform to the new API spec.

  • I have followed all the steps to create the IAM Roles for Service Accounts (IRSA) and attached it to the Route53 role.

  • I have removed all previous versions of cert manager, all CRDs, and deleted the cert-manager namespace.

Here is how I'm installing cert manager

  1. Apply v0.11 CRDs
kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.11/deploy/manifests/00-crds.yaml --validate=false
  1. Create the cert-manager namespace and labels (I added both to be backwards compatible)
kubectl create namespace cert-manager
kubectl label namespace cert-manager cert-manager.io/disable-validation=true
kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true
  1. I then install cert manager v0.11.0 using Helm, and a custom values.yaml
helm install --name cert-manager \
    --namespace cert-manager \
    -f values.yaml \
    --version 0.11.0 \
    jetstack/cert-manager

The values.yaml contains the annotation and securityContext @hendrikhalkow mentioned, as well as setting letsencrypt-dev as the defaultIssuer.

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxx:role/my-role

ingressShim:
  defaultIssuerName: "letsencrypt-dev"
  defaultIssuerKind: "ClusterIssuer"

securityContext:
  enabled: true
  fsGroup: 1001
  1. I then add my ClusterIssuer, which looks like this (notice the updated apiVersion)
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: letsencrypt-dev
  namespace: cert-manager
spec:
  acme:
    # The ACME server URL
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    # Email address used for ACME registration
    email: [email protected]
    # Name of a secret used to store the ACME account private key
    privateKeySecretRef:
      name: letsencrypt-dev
    solvers:
    - selector:
        dnsZones:
          - "mydomain"
      dns01:
        route53:
          region: us-east-1

Finally, I deploy my Ingress with the kubernetes.io/tls-acme: "true" annotation.

However, in the logs I'm seeing an error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity:

"msg"="re-queuing item  due to error processing" "error"="Failed to determine Route 53 hosted zone ID: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: c5209031-eb99-11e9-b4ee-d30fa2245e4a" "key"="monitoring/grafana-dev-us-east-1-lets-encrypt-3071181437-3860980709-2148988824" 

Any ideas? Any help would be greatly appreciated.

Hi @cookandy, the fact that you get this error message shows that you already set up your cert-manager correctly because AssumeRoleWithWebIdentity is already being attempted. Please check your trust policy (see code snippet above) where you allow the service account to perform that action.

Yes, just tried it, but I am unable to create a cluster issuer due to #2109. When creating the custom resource definitions, I got this validation error

error: error validating "https://raw.githubusercontent.com/jetstack/cert-manager/release-0.11/deploy/manifests/00-crds.yaml": error validat
ing data: ValidationError(CustomResourceDefinition.spec.validation.openAPIV3Schema.properties.spec.properties.solver.properties.dns01.prope
rties.webhook.properties.config): unknown field "x-kubernetes-preserve-unknown-fields" in io.k8s.apiextensions-apiserver.pkg.apis.apiextens
ions.v1beta1.JSONSchemaProps; if you choose to ignore these errors, turn validation off with --validate=false

Workaround is using --validate=false which allows creating the custom resource definitions. Tested on Amazon EKS 1.14.7.

Getting the same error on GKE (GitVersion:"v1.14.6-gke.1" server, GitVersion:"v1.15.2" client) and the same workaround solves the immediate issue. It would be nice however if the validation works. Should I raise a separate issue for it or is it already on the todo list?

I had the same problems with 0.11, first with the CRD validation failing and then withe cert-manager unable to read the Service Account token:

unable to read file at /var/run/secrets/eks.amazonaws.com/serviceaccount/token

The same --validate=false and securityContext.enabled: true changes fixed these for me, and DNS challenges are working for the IAM Service Account Role and well as assuming cross-account roles.

BTW, I noticed the cross-account instructions on the website fails to mention that the policy in account X needs to include at ability to assume the role in account A, as well as for account A to trust the role in account X.

I am having the same issue described here. after following some of the steps above on 1 cluster, i was able to get the route53 dns01 challenge to work. however i was unable to achieve that on another cluster. my conclusion is that this needs some explicit documentation explaining how to update the trust relationships, policies, etc so there is a decent guide beyond an open github issue.

I am getting error instantiating route53 challenge solver: unable to assume role: NoCredentialProviders: no valid providers in chain issue while working with cert manager.

I am using latest version 0.12 and I have created an IAM role for my lets-encrypt ClusterIssuer. Below is my lets-encrypt cluster issuer:

apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: xxxx-issuer
spec:
  acme:
    email: [email protected]
    privateKeySecretRef:
      name: xxxx-issuer
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
    - dns01:
        route53:
          role: arn:aws:iam::xxxx:role/kube2iam-cert_manager_role
          hostedZoneID: xxxx
          region: eu-central-1
      selector:
        dnsZones:
        - sennder.com

Below is my IAM role:
https://cert-manager.io/docs/configuration/acme/dns01/route53/#set-up-a-iam-role

@hendrikhalkow ... I'm trying to follow your instructions above to enable in EKS. Just to confirm -- "arn:aws:iam::0123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/FFFFFFFFFFFFFFFF0123456789ABCDEF" is the oidc provider associated with the EKS cluster? ... it looks suspiciously simple :) -- I worry I am not understanding something.

So ... the service account request tokens for the role from the oidc endpoint, and injects them into pods in the cert-manager namespace, which allows the cert manager (and anything else in the namespace?) to use them. (Or is there some further config somewhere limiting what can use the serviceaccount?)

What is the security context? What is the "magic config" fsGroup: 1001? It would be useful for someone to provide links to documentation (and/or write some) for beginners like myself that aren't familiar with all the internals. (I had read https://cert-manager.io/docs/configuration/acme/dns01/route53/ ... then found this issue when I googled for what to put into the assume role policy.)

Update: I'm trying to install via kustomize. To what should I be applying the "securityContext"?

@shaunc it's really that simple. Today, my working Helm values look like this:

securityContext:
  enabled: "true"
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::000000000000:role/CertManagerRoleName

Replace the 0 digits with your actual AWS account ID. The IAM role _CertManagerRoleName_ has a policy attached that looks like this:

{
    "Statement": [
        {
            "Action": "route53:GetChange",
            "Effect": "Allow",
            "Resource": "arn:aws:route53:::change/*"
        },
        {
            "Action": "route53:ListHostedZonesByName",
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": "route53:ChangeResourceRecordSets",
            "Effect": "Allow",
            "Resource": "arn:aws:route53:::hostedzone/YOUR_HOSTED_ZONE_ID"
        }
    ],
    "Version": "2012-10-17"
}

Replace _YOUR_HOSTED_ZONE_ID_ with your actual ID of your Route53 zone. To check if everything works, check your cert manager pod:

k describe pod -n cert-manager cert-manager-...

You should see a volume _aws-iam-token_ which indicates that it's working.

Note: you will still need to use spec.solvers[X].dns01.route53.role if your Route53 zones live inside another account. The reason is because the the IAM-linked service roles provide temporary credentials to your pod, but it will not assume the role for you.

The examples listed above from other users assume the role lives inside the same AWS account as the Route53 zones, which means the permissions to modify DNS records are attached to the role directly.

From AWS documentation, they give an example of an AWS config:

[profile account_b_role]
source_profile = account_a_role
role_arn=arn:aws:iam::222222222222:role/account-b-role

[profile account_a_role]
web_identity_token_file = /var/run/secrets/eks.amazonaws.com/serviceaccount/token 
role_arn=arn:aws:iam::111111111111:role/account-a-role

@sc250024 thanks for the details 馃槃 would you be able to add a note to our docs to explain this for the next person? It'd probably save them a lot of time digging out and stumbling across this issue in future! 馃槃

@munnerz Sure.

I'm seeing

unable to read file at /var/run/secrets/eks.amazonaws.com/serviceaccount/token\ncaused by: open /var/run/secrets/eks.amazonaws.com/serviceaccount/token: permission denied"

I haven't been able to apply the securityContext.enabled: true fix since I don't use helm and like @shaunc I cannot find much information about it.

Edit: I was able to grab the manifest for the cert-manager deployment and add

    spec:
      securityContext:
        enabled: true
        fsGroup: 1001
      containers: ...

Seems to work after applying that.

    spec:
      securityContext:
        enabled: true
        fsGroup: 1001
      containers: ...

Please add this to the cert-manager deployment. This is required in AWS so that the cert-manager pod can access the token that has the credentials.

Would somebody be able to create a PR to add this block into our Helm chart by default? I think it's harmless for users running outside of AWS, and should hopefully resolve issues like this across different cloud providers! 馃槃

@munnerz That's how people get coronavirus. Today, you're forcing Kubernetes annotations on them, even if they don't use it. It's a slippery slope.

@sc250024 I don't think that's very appropriate - please be mindful of your words and how they may affect others. We want to ensure nobody feels excluded here, and comments like these could really upset _anyone_ affected by the virus.


On the topic at hand, it looks like this PR: https://github.com/jetstack/cert-manager/commit/3b838758a34c1c56d20eaaa69246d68484585a2d changed how the securityContext is set.

@thismatters @derrickburns would you be able to elaborate on your solutions above with some full examples, including what version of CM you applied the patch to? As well, would you mind testing it out with the latest (v0.14.0) version of cert-manager too?

I'm not really much of a kubernetes expert, nor does startup life (amidst a pandemic) permit much time for rework. Now that I've gotten everything working just right I'm pretty reticent to touch anything.

I was patching version 0.13.1. I downloaded the manifest (https://github.com/jetstack/cert-manager/releases/download/v0.13.1/cert-manager.yaml) found deployment/cert-manager, added the aforementioned securityContext block, and applyd it to my cluster.

@munnerz The PR that you mentioned works for me. There is no more need for changes to work with IAM for Service Accounts. Thanks!

It's not clear from this bug report, but it seems that I still need fsGroup: 1001 to use IRSA with cert-manager? Tested with 0.14.2, where the last-mentioned PR (#2455) is present.

I read @derrickburns 's last comment as meaning that the fsGroup setting is no longer needed...

Does this block need to be mentioned in the docs?

It finally worked on my cluster when I carefully read all the instructions on this thread!

kudos to @hendrikhalkow and @munnerz !

Was this page helpful?
0 / 5 - 0 ratings