Describe the bug:
We're using cert-manager with Let's Encrypt (in staging) + Acme DNS issuance and a delegated / unprivileged domain.
The idea is to use, as suggested a CNAME for _acme-challenge.DOMAIN that points to another location with the challenge response.
So I've configured cert-manager with - --dns01-recursive-nameservers="8.8.8.8:53", a ClusterIssuer as the following:
apiVersion: cert-manager.io/v1alpha3
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
email: [email protected]
privateKeySecretRef:
name: example-issuer-account-key
server: https://acme-staging-v02.api.letsencrypt.org/directory
solvers:
- dns01:
cnameStrategy: Follow
webhook:
groupName: acme.example.com
solverName: nullsolver
And a Certificate like this:
apiVersion: cert-manager.io/v1alpha3
kind: Certificate
metadata:
name: testecert1111
namespace: bla
labels:
env: "qa"
spec:
secretName: certkatz1111
issuerRef:
name: letsencrypt-staging
kind: ClusterIssuer
dnsNames:
- testecert1111.estaleiro.serpro.gov.br
And created a domain of this in the authority DNS of estaleiro.serpro.gov.br:
_acme-challenge.testecert1111 CNAME testecert1111.bla.solver.rkatz.xyz.
Now, when doing a dig txt into _acme-challenge.testecert1111.estaleiro.serpro.gov.br the following returns correctly:
dig txt _acme-challenge.testecert1111.estaleiro.serpro.gov.br
;_acme-challenge.testecert1111.estaleiro.serpro.gov.br. IN TXT
;; ANSWER SECTION:
_acme-challenge.testecert1111.estaleiro.serpro.gov.br. 300 IN CNAME testecert1111.bla.solver.rkatz.xyz.
testecert1111.bla.solver.rkatz.xyz. 0 IN TXT "NtUtuxC7aAfDCv1J_dbpIqzTmJpqIpvVAXlC6Ea3iCw"
But cert-manager keeps returning that the answer is not valid:
E0701 17:59:39.296714 1 sync.go:183] cert-manager/controller/challenges "msg"="propagation check failed" "error"="DNS record for \"testecert1111.estaleiro.serpro.gov.br\" not yet propagated" "dnsName"="testecert1111.estaleiro.serpro.gov.br" "resource_kind"="Challenge" "resource_name"="testecert1111-69xnn-840791812-3569446908" "resource_namespace"="bla" "type"="dns-01"
The describe of the challenge contains the following:
Status:
Presented: true
Processing: true
Reason: Waiting for dns-01 challenge propagation: DNS record for "testecert1111.estaleiro.serpro.gov.br" not yet propagated
State: pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 25m cert-manager Challenge scheduled for processing
Normal Presented 25m cert-manager Presented challenge using dns-01 challenge mechanism
I've started to dig a bit further into the code, and put some debug messages and got the following when the challenge is happening:
W0701 18:10:59.695493 1 wait.go:117] ==Querying testecert1111.bla.solver.rkatz.xyz. in ns1.dreamhost.com.:53==
W0701 18:11:00.078706 1 wait.go:123] ==Query was ;; opcode: QUERY, status: NOERROR, id: 40444
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 2
;; QUESTION SECTION:
;testecert1111.bla.solver.rkatz.xyz. IN TXT
;; AUTHORITY SECTION:
solver.rkatz.xyz. 14358 IN NS solver.rkatz.xyz.
;; ADDITIONAL SECTION:
solver.rkatz.xyz. 14358 IN A 54.146.59.126
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 2800
So it seems that when following a challenge, cert-manager stops in the authoritative DNS (dreamhost) and does not follow the delegation to solver.rkatz.xyz, which can properly answer for the testecert1111.bla.solver.rkatz.xyz
So I've changed the code to always return this validation as true and allow Let's Encrypt to proper validate againts the DNS here (always return a nil error)
https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/dns.go#L123-L125
And then following the flow, Let's Encrypt could validate and issue a certificate.
So I'm thinking that maybe this is related to the PreCheckDNS not following subdomains when using a CNAME?
Expected behaviour:
That I can use a subdomain as the challenge responder (acme CNAME challenge.subdomain.domain.tld)
Steps to reproduce the bug:
:thinking:
About the webhook that should be used to create the entry into DNS, as I'm not using a managed DNS but a manual process to create the registry once, I'm using the sample webhook provided by cert-manager that always returns nil :)
Anything else we need to know?:
Environment details::
/kind bug
Thank you!
/cc @meyskens
BTW the solver actually is a gRPC backend for CoreDNS that returns the challenges of certificate.namespace of a k8s cluster, so it might not be always online as I'm running in my lab environment :)
As an additional change, it seems that when PreChecking the DNS (https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/dns.go#L119) it uses the configuration of the context, that seems to be true
So I've forced the --dns01-recursive-nameservers-only and it works.
But my env is not air gapped here, the thing for me is that it seems that the lookupNameservers does not follow the subdomain delegation.
Honestly I'm ok with using the option, it's easy for me :) I'll just keep the issue opened so someone can tell me if this is the desired behavior, otherwise I can try change this in the functions so cert-manager keeps following the DNS Servers until it finds the right :D
Great to hear that worked! It indeed sounds like some check doesn't follow the NS delegation correct. IMO this should be looked into
/priority important-longterm
@meyskens nice.
I can assign this to myself and maybe try to submit a PR, if I fail I can return this :D
Opened the issue just to make sure it's really an issue or just me doing wrong stuff :D
/assign @rikatz
A PR would be very welcome! If you need anything for it you can find us in #cert-manager-dev in the https://slack.k8s.io or in one of our community meetings https://github.com/jetstack/cert-manager#bi-weekly-development-meeting :smile:
Will try to work on it this week and will let you updated in slack then
So, this is embarrassing but I've managed to discover what really happened here, and maybe a point of evolution is to do some logging in this steps.
What happens is that I was using templates in Coredns but I've misconfigured the SOA responses as "authority" instead of "answer". Explaining to the future me not forget about that :D
Wrong configuration:
template IN SOA {
authority "{{ .Zone }} 60 IN SOA {{ .Zone }} admin1.{{ .Zone }} (1 60 60 60 60)"
fallthrough
}
Right configuration:
template IN SOA {
answer "{{ .Zone }} 60 IN SOA {{ .Zone }} admin1.{{ .Zone }} (1 60 60 60 60)"
fallthrough
}
When asking for the SOA of my misconfigured domain, I got:
;; QUESTION SECTION:
;solver.rkatz.xyz. IN SOA
;; AUTHORITY SECTION:
solver.rkatz.xyz. 59 IN SOA solver.rkatz.xyz. admin1.solver.rkatz.xyz. 1 60 60 60 60
While in a well configured domain, I got:
;; QUESTION SECTION:
;subdomain.domain.tld. IN SOA
;; ANSWER SECTION:
subdomain.domain.tld. 21599 IN SOA ns1.subdomain.domain.tld. 2020070202 300 60 86400 60
This was pretty hard to find, but it's exactly here when it iterates through DNS Msg Answers.
So I'm closing this, thank you for your patience @meyskens !