Kops: Route53 ARN has incorrect partition in govcloud

Created on 22 Feb 2020 · 17Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.
Version 1.15.2 (git-ad595825a)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS GovCloud

4. What commands did you run? What is the simplest way to reproduce this issue?
Creating a cluster in an existing VPC with existing subnets and an existing Route53 Zone. The zone was specified as a Zone ID.

5. What happened after the commands executed?

W0221 23:03:15.427827   12438 executor.go:130] error running task "IAMRolePolicy/<redacted>" (9m59s remaining to succeed): error creating/updating IAMRolePolicy: MalformedPolicyDocument: Partition "aws" is not valid for resource "arn:aws:route53:::hostedzone/<redacted>".
        status code: 400, request id: <redacted>
W0221 23:03:15.427860   12438 executor.go:130] error running task "DNSName/<redacted>" (9m59s remaining to succeed): error creating ResourceRecordSets: NoSuchHostedZone: The specified hosted zone does not exist.

6. What did you expect to happen?
The route53 arn should use the partition associated with the region, aws-us-gov.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-02-21T22:55:11Z"
  generation: 3
  name: <redacted>
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Owner: Ops
    Team: Ops
  cloudProvider: aws
  configBase: s3://<redacted>/<redacted>
  dnsZone: <redacted>.
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-gov-west-1b-1
      name: "1"
    - instanceGroup: master-us-gov-west-1b-2
      name: "2"
    - instanceGroup: master-us-gov-west-1b-3
      name: "3"
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-gov-west-1b-1
      name: "1"
    - instanceGroup: master-us-gov-west-1b-2
      name: "2"
    - instanceGroup: master-us-gov-west-1b-3
      name: "3"
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - <redacted>
  kubernetesVersion: 1.15.9
  masterInternalName: api.internal.<redacted>
  masterPublicName: api.<redacted>
  networkCIDR: <redacted>
  networkID: vpc-<redacted>
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - <redacted>
  subnets:
  - cidr: <redacted>
    id: subnet-<redacted>
    name: us-gov-west-1a
    type: Private
    zone: us-gov-west-1a
  - cidr: <redacted>
    id: subnet-<redacted>
    name: us-gov-west-1b
    type: Private
    zone: us-gov-west-1b
  - cidr: <redacted>
    id: subnet-<redacted>
    name: utility-us-gov-west-1a
    type: Utility
    zone: us-gov-west-1a
  - cidr: <redacted>
    id: subnet-<redacted>
    name: utility-us-gov-west-1b
    type: Utility
    zone: us-gov-west-1b
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-02-21T22:55:11Z"
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-gov-west-1b-1
spec:
  additionalSecurityGroups:
  - sg-<redacted>
  image: ami-<redacted>
  machineType: t3.2xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-gov-west-1b-1
  role: Master
  subnets:
  - us-gov-west-1b
  tenancy: dedicated

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-02-21T22:55:11Z"
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-gov-west-1b-2
spec:
  additionalSecurityGroups:
  - sg-<redacted>
  image: ami-<redacted>
  machineType: t3.2xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-gov-west-1b-2
  role: Master
  subnets:
  - us-gov-west-1b
  tenancy: dedicated

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-02-21T22:55:11Z"
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-gov-west-1b-3
spec:
  additionalSecurityGroups:
  - sg-<redacted>
  image: ami-<redacted>
  machineType: t3.2xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-gov-west-1b-3
  role: Master
  subnets:
  - us-gov-west-1b
  tenancy: dedicated

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-02-21T22:55:11Z"
  labels:
    kops.k8s.io/cluster: <redacted>
  name: nodes
spec:
  additionalSecurityGroups:
  - sg-<redacted>
  image: ami-<redacted>
  machineType: t3.2xlarge
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-gov-west-1a
  - us-gov-west-1b
  tenancy: dedicated

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Relevant output:

I0221 23:11:41.338498   12819 iamrolepolicy.go:147] Creating IAMRolePolicy
I0221 23:11:41.338515   12819 iamrolepolicy.go:175] PutRolePolicy RoleName=masters.<redacted> PolicyName=masters.<redacted>: {
...
    {
      "Effect": "Allow",
      "Action": [
        "route53:ChangeResourceRecordSets",
        "route53:ListResourceRecordSets",
        "route53:GetHostedZone"
      ],
      "Resource": [
        "arn:aws:route53:::hostedzone/<redacted>"
      ]
    },
...

Other ARNs in the same output, such as for S3, properly include the aws-us-gov partition.

9. Anything else do we need to know?

lifecyclrotten

Source

reversefold

All 17 comments

This was fixed in https://github.com/kubernetes/kops/issues/8359, included in Kops 1.16 which will be released imminently. Are you able to try the latest Kops 1.16 beta to confirm it is fixed?

rifelpet on 22 Feb 2020

I've downloaded 1.16.0-beta.2 and it seems to fix this arn problem but I suspect there is other code with a similar issue as I get this error now:

I0221 23:27:39.849867   13631 executor.go:176] Executing task "DNSName/api.<redacted>": *awstasks.DNSName {"Name":"api.<redacted>","Lifecycle":"Sync","ID":null,"Zone":{"Name":"Z08249791WBSC69ZYDIJJ","Lifecycle":"Sync","DNSName":"<redacted>","ZoneID":"<redacted>","Private":true,"PrivateVPC":{"Name":"<redacted>","Lifecycle":"Sync","ID":"<redacted>","CIDR":"<redacted>","EnableDNSHostnames":null,"EnableDNSSupport":true,"Shared":true,"Tags":null}},"ResourceType":"A","TargetLoadBalancer":{"Name":"api.<redacted>","Lifecycle":"Sync","LoadBalancerName":"<redacted>","DNSName":"<redacted>","HostedZoneId":"<redacted>","Subnets":[{"Name":"utility-us-gov-west-1b.<redacted>","ShortName":"utility-us-gov-west-1b","Lifecycle":"Sync","ID":"subnet-8d240de9","VPC":{"Name":"<redacted>","Lifecycle":"Sync","ID":"<redacted>","CIDR":"<redacted>","EnableDNSHostnames":null,"EnableDNSSupport":true,"Shared":true,"Tags":null},"AvailabilityZone":"us-gov-west-1b","CIDR":"<redacted>","Shared":true,"Tags":{"SubnetType":"Utility","kubernetes.io/cluster/<redacted>":"shared","kubernetes.io/role/elb":"1"}},{"Name":"utility-us-gov-west-1a.<redacted>","ShortName":"utility-us-gov-west-1a","Lifecycle":"Sync","ID":"subnet-a2250fd4","VPC":{"Name":"<redacted>","Lifecycle":"Sync","ID":"<redacted>","CIDR":"<redacted>","EnableDNSHostnames":null,"EnableDNSSupport":true,"Shared":true,"Tags":null},"AvailabilityZone":"us-gov-west-1a","CIDR":"<redacted>","Shared":true,"Tags":{"SubnetType":"Utility","kubernetes.io/cluster/<redacted>":"shared","kubernetes.io/role/elb":"1"}}],"SecurityGroups":[{"Name":"api-elb.<redacted>","Lifecycle":"Sync","ID":"<redacted>","Description":"Security group for api ELB","VPC":{"Name":"<redacted>","Lifecycle":"Sync","ID":"<redacted>","CIDR":"<redacted>","EnableDNSHostnames":null,"EnableDNSSupport":true,"Shared":true,"Tags":null},"RemoveExtraRules":["port=443"],"Shared":null,"Tags":{"KubernetesCluster":"<redacted>","Name":"api-elb.<redacted>","kubernetes.io/cluster/<redacted>":"owned"}}],"Listeners":{"443":{"InstancePort":443,"SSLCertificateID":""}},"Scheme":null,"HealthCheck":{"Target":"SSL:443","HealthyThreshold":2,"UnhealthyThreshold":2,"Interval":10,"Timeout":5},"AccessLog":null,"ConnectionDraining":null,"ConnectionSettings":{"IdleTimeout":300},"CrossZoneLoadBalancing":{"Enabled":false},"SSLCertificateID":"","Tags":{"KubernetesCluster":"<redacted>","Name":"api.<redacted>","Owner":"Ops","Team":"Ops","kubernetes.io/cluster/<redacted>":"owned"}}}
I0221 23:27:39.850452   13631 request_logger.go:45] AWS request: route53/ListResourceRecordSets
I0221 23:27:39.887373   13631 dnsname.go:76] Found DNS resource "NS" "<redacted>."
I0221 23:27:39.888108   13631 dnsname.go:76] Found DNS resource "SOA" "<redacted>."
I0221 23:27:39.888532   13631 dnsname.go:76] Found DNS resource "A" "management.<redacted>."
I0221 23:27:39.888625   13631 dnsname.go:178] Updating DNS record "api.<redacted>"
I0221 23:27:39.888854   13631 request_logger.go:45] AWS request: route53/ChangeResourceRecordSets
W0221 23:27:39.908367   13631 executor.go:128] error running task "DNSName/api.<redacted>" (9m29s remaining to succeed): error creating ResourceRecordSets: NoSuchHostedZone: The specified hosted zone does not exist.
        status code: 404, request id: <redacted>
I0221 23:27:39.908408   13631 executor.go:143] No progress made, sleeping before retrying 1 failed task(s)

It found the existing records but then fails to create a new record on the same zone.

reversefold on 22 Feb 2020

Having just got off with Amazon Support, it turns out that Alias records are not supported by Gov Cloud at this time.

We are looking at augmenting the code and will test the possibility of a CNAME record instead of an Alias and will report back ASAP.

When attempting to create the same record type via the API we get the same record stating that the zone is not found. It in fact is the zone of the internal ELB of the kubernetes API and not the main hosted zone.

bikescholl on 6 Mar 2020

Ah, thats good to know, thanks for the update! We may be able to force kops to use CNAMEs rather than Aliases when in GovCloud.

rifelpet on 6 Mar 2020

👍1

Thanks @rifelpet! Will let you know this week if we can get it working with some augmented code. I may try it in a non-GovCloud region and also a Gov Cloud Region.

Out of curiousity why is Kops using Alias records and not CNAME's? Is there also a reason why it is still using the classic ELB type rather than the more modern ALB/NLB architecture?

bikescholl on 8 Mar 2020

If I had to guess, it's because of many of the advantages are outlined here. mainly that it avoids an additional round trip dns lookup, avoids an intermediate TTL, supports health checking of alias targets, doesn't expose the aws resource's dns name, etc.

I'm sure Kops' route53 support was not designed with the possibility that Route53 might be supported but Alias records would not. It may be a minor change, some quick searching reveals this code but there may be others:

https://github.com/kubernetes/kops/blob/298f79659a2bd9c1d6548c6c95b335b5e29e6466/upup/pkg/fi/cloudup/awstasks/dnsname.go#L153-L164

I don't recall the reasoning for keeping classic ELBs, it may just be a matter of no one has put in the effort to switch. there are a few issues that have discussed it in the past, perhaps we could add opt-in support for NLBs sometime soon.

rifelpet on 9 Mar 2020

👍1

Good news, I was able to get this working with some manual intervention. I'll write up my override steps and post them here....there some manual steps but it is up and running and kops validate cluster returns positively. I'll try and write some automation if possible (likely an Ansible Playbook).

bikescholl on 10 Mar 2020

@mgs4332 any updates you could provide would be helpful!

weisjohn on 25 Mar 2020

@mgs4332 Any update as @weisjohn responded would be much appreciated. I am in the same boat as I need some alias created.

ksummersill2 on 1 May 2020

@ksummersill2 i haven't had this issue since moving to gossip protocol: https://github.com/kubernetes/kops/blob/master/docs/getting_started/aws.md#configure-dns

weisjohn on 1 May 2020

I am glad you message me so that I can update this. I wrote an article to do this with gossip DNS just like you said. https://medium.com/@ksummersill/setup-kops-and-calico-within-aws-gov-cloud-using-gossip-dns-cd6ed5cba36c

ksummersill2 on 1 May 2020

👍1

@ksummersill2 glad you found a workaround with the gossip protocol. Apologies as I changed jobs and no longer had access to the resources I was working on. The method I got around it with was wrapping it in Ansible and creating standard CNAME records and rebooting the worker nodes to have them automatically join the cluster after first boot.

bikescholl on 20 May 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Aug 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 17 Sep 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 17 Sep 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 17 Oct 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.