Cert-manager: Possible incorrect "kid" header in ACME v2 JWS?

Created on 31 May 2018  路  11Comments  路  Source: jetstack/cert-manager

Is this a BUG REPORT or FEATURE REQUEST?: Bug report

/kind bug

What happened:

Yesterday Let's Encrypt enabled stricter processing of ACME v2 JWS "kid" headers in staging.

In the staging logs I see some instances of the User Agent jetstack-cert-manager/v0.3.0-1e606b3eadfa069332b25feeb3d5aecc46d07ece receiving the described "400 :: malformed :: KeyID header contained an invalid account URL" error.

This is the only commit hash/cert-manager UA I see so its possible this is already fixed, or perhaps specific to a single user's deployment.

What you expected to happen:

I expected to see no KeyID errors for cert-manager's ACME v2 JWS posts.

How to reproduce it (as minimally and precisely as possible):

Both the Let's Encrypt v2 staging environment and Pebble enforce the "kid" header and can be used to reproduce this bug.

Anything else we need to know?:

I quickly skimmed the Cert-manager source and it _seems_ like the doAccount implementation is taking the value that is eventually used as the JWS "kid" header from the newAccount response's Location as is expected: https://github.com/jetstack/cert-manager/blob/eba38b6a891369a9075541a02de9b1cc7cd39baa/third_party/crypto/acme/acme.go#L573-L588

I suspect perhaps there is a corner case where in some circumstances a different value is used. There's a bit of "code distance" between doAccount and jwsEncodeJSON where the kid is used: https://github.com/jetstack/cert-manager/blob/9cebaa604f512387fda859c050574aee43764c9b/third_party/crypto/acme/jws.go#L36

I wasn't able to do much extensive digging, I hope the above is enough to get you started.

Environment:

  • Kubernetes version (use kubectl version): n/a
  • Cloud provider or hardware configuration**: n/a
  • Install tools: n/a
  • Others: n/a
areacme kinbug

All 11 comments

Hey - thanks for reporting this. I'm going to dig into the source code to work out what we use and where throughout.

A cursor glance of acme.go however, shows that we use a variety of different values for the accountURL field, at a variety of points:

Are all of these usages of the accountURL field/kid field valid? I don't see anywhere in our codebase where we actually rely upon the account URL that we record locally before sending off requests. Even when we do know the account URL, we always call CreateAccount instead and follow Location headers (my initial thought here was maybe some users are using their ACMEv1 account URLs as part of the kid header, which would presumably be invalid, but as I say, I don't think we ever use the values for the account URL that we recorded, so I don't know how this could be the case).

The specific version/user-agent you have noted is of the current 'latest stable' release of cert-manager (v0.3.0), so the issue has not been fixed if it does exist!

Are you seeing quite a lot of these messages now that more users have upgraded? Are you able to share some rough details on which endpoints it is against? (i.e. do the kid errors only happen on requests to the 'CreateAccount' endpoint, or is it happening to many different types of requests?)

FWIW, I am definitely not seeing this issue on the 'happy path', so like you say, it appears to be an edge case!

Are all of these usages of the accountURL field/kid field valid?

I'm not sure I entirely followed your message. There's only one usage of this field described: carrying the account's KeyID (e.g., its URL). It sounds like you have a few ways of populating that field/URL - from a new account response, as a result of an update (?), and by fetching an existing account with onlyReturnExisting. Those all seem like valid ways to populate the field.

The specific version/user-agent you have noted is of the current 'latest stable' release of cert-manager (v0.3.0), so the issue has not been fixed if it does exist!

Ok, that's good to know.

Are you seeing quite a lot of these messages now that more users have upgraded? Are you able to share some rough details on which endpoints it is against? (i.e. do the kid errors only happen on requests to the 'CreateAccount' endpoint, or is it happening to many different types of requests?)

Here's a count of how many times I've seen the "KeyID header contained an invalid account URL" error for the "jetstack-cert-manager/v0.3.0-1e606b3eadfa069332b25feeb3d5aecc46d07ece" UA in the last 7 days of staging logs:

| Time | Count |
|-------------|-------|
| 31/05/2018 | 28 |
| 01/06/2018 | 5 |
| 02/06/2018 | 4 |
| 03/06/2018 | 1 |
| 04/06/2018 | 1 |
| 05/06/2018 | 6 |

There are multiple client IPs generating these messages (on the order of ~6 distinct client IPs) so I don't think its likely to be a corner case caused by a specific user configuration. The volume of errors is quite low overall, matching your observation that in the "happy path" things seem to be working OK.

Of the errors above, 24 were for POSTS to the /acme/challenge/ endpoint, and 21 were for /acme/finalize/. Those are the only endpoints with observed JWS KeyID POST errors. Its somewhat notable to me that there are no errors for /acme/new-order/. That endpoint is POSTed ~1:1 in volume with finalize and yet has no errors.

FWIW, I am definitely not seeing this issue on the 'happy path', so like you say, it appears to be an edge case!

Very odd! Let me know if I can help provide more information from our side.

Unfortunately I didn't have the foresight to think of including the sent KeyID in the error message when I implemented the stricter KeyID validation. I fixed that in https://github.com/letsencrypt/boulder/commit/a9847f492e6476288a3e3a27e816f01394a6ca4b and it should start producing better error messages/log lines later this afternoon after the staging Boulder update. I'll reply with more information once I have some data from post-deploy.

Thanks for digging into this! Much appreciated :-)

@munnerz Good news, the logging change I implemented in https://github.com/letsencrypt/boulder/commit/a9847f492e6476288a3e3a27e816f01394a6ca4b has turned up a clue. In the past ~12 hrs since the deploy of this change to the staging environment all of the cert-manager errors are of the form:

KeyID header contained an invalid account URL: "https://acme-v02.api.letsencrypt.org/acme/acct/XXXXXX" 

Where XXXXXX is the account ID.

That's a production kid being sent to the staging environment. I would expect the kid to be of the form: https://acme-staging-v02.api.letsencrypt.org/acme/acct/XXXXXX.

Does this insight help you guess at a root cause? Is it perhaps a configuration issue after all?

The JWS kid enforcement went to prod yesterday. I'm also seeing the same UA sending staging account IDs to the prod environment.

It's certainly possible this can be chalked up to routine misconfiguration. If you think that's the case and you don't think there are guardrails that might help it would be fine with me to close this issue :-) I mostly wanted to make sure it was on folks' radar in case it was a larger issue.

I ran into this with quay.io/jetstack/cert-manager-controller:v0.3.0 when changing the URL of the Issuer from staging to production URL.

I suspect that status.acme.uri on the Issuer was still pointing to the staging account, which lead to this issue.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

/remove-lifecycle stale

I just got here from seeing my (staging) ClusterIssuer reporting these errors (IIUC from the thread above, sending a staging kid to the staging url).

I'm using quay.io/jetstack/cert-manager-controller:v0.5.0.

This sequence of log messages repeat every minute or so:

I1002 14:37:24.287331 1 controller.go:171] certificates controller: syncing item 'service-mesh/mydomain.se'
I1002 14:37:24.287503 1 sync.go:274] Preparing certificate service-mesh/mydomain.se with issuer
I1002 14:37:24.287750 1 logger.go:43] Calling GetOrder
I1002 14:37:24.590493 1 logger.go:73] Calling GetAuthorization
I1002 14:37:24.781341 1 logger.go:93] Calling HTTP01ChallengeResponse
I1002 14:37:24.781384 1 prepare.go:279] Cleaning up old/expired challenges for Certificate service-mesh/mydomain.se
I1002 14:37:24.781398 1 logger.go:68] Calling GetChallenge
I1002 14:37:35.074330 1 prepare.go:488] Accepting challenge for domain "mydomain.se"
I1002 14:37:35.074371 1 logger.go:63] Calling AcceptChallenge
I1002 14:37:35.981988 1 helpers.go:201] Found status change for Certificate "mydomain.se" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-10-02 14:37:35.981967776 +0000 UTC m=+10095.159756648
I1002 14:37:35.982034 1 sync.go:276] Error preparing issuer for certificate service-mesh/mydomain.se: acme: urn:ietf:params:acme:error:malformed: KeyID
header contained an invalid account URL: "https://acme-staging-v02.api.letsencrypt.org/acme/acct/XXXXXXXX"
E1002 14:37:35.982056 1 sync.go:197] [service-mesh/mydomain.se] Error getting certificate 'tls-cert.mydomain.se-test': secret "tls-cert.mydomain.se-test" not found
E1002 14:37:36.128478 1 controller.go:180] certificates controller: Re-queuing item "service-mesh/mydomain.se" due to error processing: acme: urn:ietf:params:acme:error:malformed: KeyID header contained an invalid account URL: "https://acme-staging-v02.api.letsencrypt.org/acme/acct/XXXXXXXX"

I am seeing this exact same order of logs and errors in staging and production on 0.5.0

EDIT: This has resolved with a deletion and reinstallation of the clusterissuers and the cert-manager. Not sure what the issue was.

Ah, I think I've got an idea as to what this is caused by.

We store the Order URI on the Certificate resource under status.acme.order.url.

Following this:

1) Create letsencrypt-staging Issuer
2) Create Certificate resource referencing staging issuer
3) before the order is completed: switch the issuer being used, or change the acme server URL in the existing issuer to be the prod endpoint
4) observe errors

This is because the controller will attempt to query the existing order URL, stored on the Certificate resource, using a client constructed using the new prod issuers configuration.

We've changed how we handle order creation on the master branch of cert-manager, but which will resolve this issue in the event that a user changes the issuerRef.name on the Certificate resource, but we need to add some explicit invalidation of Order's if the order's referenced issuer acme server URL does not match the order URL.

I'm going to add this to the 0.6 milestone.

On v0.5, users can clear the certificate.status.acme field on their Certificate resources to resolve this situation, so I don't think this is worth backporting into the v0.5 release 馃槃

@munnerz Nice find! I think that could explain _part of_ the observations I've made.

However, I've also seen this happen for completely new certificate resources, with no changes to the cluster issuers or cert-manager configuration. However, in those cases, I _think_ there has been an old _secret_ lying around (after seeing it not work, removing everything cert-manager related, and re-installing). Do you store metadata also on the secrets?

As for clearing the certificate.status.acme, could you post a kubectl command that would do it, for future reference? :)

I'm going to close this issue now as the original problem is solved as far as I can tell.

I've opened #1331 to track the wider issues described above.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kragniz picture kragniz  路  4Comments

Stono picture Stono  路  3Comments

caiobegotti picture caiobegotti  路  4Comments

jbouzekri picture jbouzekri  路  4Comments

jbartus picture jbartus  路  4Comments