At Let's Encrypt, we've noticed that cert-manager v0.8.0 and v0.8.1 generate excessive traffic under some circumstances. Since we don't have access to the cert-manager installs, we're not sure what those circumstances are. This is a placeholder bug for cert-manager users to provide details of their setup after they've noticed in their logs that cert-manager is sending excessive traffic (more than about 10 requests per day in steady state).
I've noticed two patterns in the logs so far:
Also, I've found that a lot of affected cert-manager users seem to have multiple accounts created, sometimes with multiple independent cert-manager instances running on the same IP (by accident).
If you've noticed this, please list what cert-manager version you are using, plus any details of your Kubernetes setup and how many instances of cert-manager are currently running in your setup.
(This issue is linked to from https://community.letsencrypt.org/t/blocking-old-cert-manager-versions/98753/2, and email we'll send shortly about deprecating older cert-manager versions. Note that even though v0.8 still has some issues, it's still definitely better than previous versions)
I received an email from Let's Encrypt an hour ago about this.
I'm actually running 0.9.0 but gone through different versions during setup because none certificates reached ready state (was LB issue).
I had one certificate for norad.fr that stayed pending (on 0.9.0) with http challenge for maybe a week that could have done too many requests.
I'm not sure at which step in logs cert-manager is calling let's encrypt but I found :
k -n cert-manager logs cert-manager-6554467ddb-nbb6d | grep norad.fr | grep 'propagation check failed'
2987
The issue that prevented challenge completion was that my 80 port on norad.fr is hosted by a http redirect server, while I wanted in fact a certificate for www.norad.fr (where 80 port is hosted on kube cluster).
I don't know if that help
Running 0.5.2 (old, I know ๐) but no issues with excessive traffic that I can tell. Been running for ~200 days and primarily uses DNS validation.
I just received an email from LetsEncrypt over running an earlier version of cert-manager. I'm glad for the email and the careful handling of the issue. Thank you LetsEncrypt team!
We just got the email 4am this morning (GMT) and one of our certs expired at 3pm today so - we can't get a re-issue now due to the 503 (perhaps an IP ban?).
Cert-manager v0.7.2.
I restarted the cert-manager pod and it immediately started a renew loop. Not sure how long it's been doing it but presumably quite a while as I believe they default to renew 30 days before expiry?
Logs are below.
I0813 16:14:35.179589 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:14:35.182585 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:14:35.187403 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:14:35.192966 1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:14:45.193168 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:14:45.194316 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:14:45.198492 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:14:45.198572 1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:14:55.198764 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:14:55.199461 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:14:55.204099 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:14:55.204262 1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:05.204370 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:05.205528 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:05.238805 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:15:05.238841 1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:15.238973 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:15.239829 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:15.242087 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:15:15.242114 1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:25.242310 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:25.243184 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:25.248178 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:15:25.248667 1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:35.248809 1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:35.249092 1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:35.252109 1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
This issue is not allowing to update the version: https://github.com/jetstack/cert-manager/issues/1255
Hello, updated mine today to v0.9.0.
While checking if everthing was in place, I noticed that cert-manager has no back-off mechanism to deal with misconfigured certificates. In my case it keep trying do verify a domain that is not mine every minute since 17:50 pm until 00:40 am, when I deleted de misbehaving certificate. If the log are accurate, it called GetOrder, GetAuthorization and HTTP01ChallengeResponse 1641 times during that period.
my log for that period
I tried to use v.0.9.1, but for some reason it will issue a "Temporary certificate" and for some time this certificate that is not trusted it will be available. Can this be avoided(for older versions this didn't happen)? Thank you!
Events for v.0.6.2:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Generated 2m cert-manager Generated new private key
Normal OrderCreated 2m cert-manager Created Order resource "order-name-2441368539"
Normal OrderComplete 1m cert-manager Order "order-name-2441368539" completed successfully
Normal CertIssued 1m cert-manager Certificate issued successfully
Events for version>v.0.8.1:
Normal Generated 16s cert-manager Generated new private key
Normal GenerateSelfSigned 16s cert-manager Generated temporary self signed certificate
Normal OrderCreated 16s cert-manager Created Order resource "order-name-2441368539"
Normal OrderComplete 13s cert-manager Order "order-name-2441368539" completed successfully
Normal CertIssued 13s cert-manager Certificate issued successfully
Is there anyway to exclude the step "Generated temporary self signed certificate"?
Thank you!
@RaduRaducuIlie How did you install v0.9.1?
It's giving me this error: Error: failed to download "stable/cert-manager" (hint: runninghelm repo updatemay help) on AKS cluster.
I'm on 0.9.0. Cleaned up some test ingresses I had lying around. And noticed (afterwards) that I had a CertIssued event run 1662301 times in the past 4 days. I suspect its because I had two ingresses fighting over the same tls secret, but I'm not completely certain, as I didnt check out the event count until after I deleted them. The events themselves had the message "Certificate issued successfully"
@rnkhouse,cert-manager is installed wi5h istio and I just updated the image tag for cert-manager.
Hello,
I want to help with my metrics but I don't know how to count the requests. It would be great if you gave us:
If you post this, I'm pretty sure you'll get a lot of feedback :) (including mine)
I've encountered the second pattern on a fresh install of cert-manager v0.9.1 on k8s v1.15.2
cert-manager.log
@munnerz Is anyone from Jetstack planning to engage with this issue? It's somewhat disheartening to see affected users reporting in (some with data about the problem, some asking for help collecting that data) and no one from Jetstack has replied yet. There are a few posts mentioning the current most version having a pattern of excessive traffic that may lead to Let's Encrypt having to block that version as well.
@cpu @JoshVanL has been looking through logs to try and find anything suspect - we've also been discussing with others on Slack and gathering info too.
Is there any data on the percentage of unique accounts this is affecting? i.e. N% of accounts registered using cert-manager are showing abusive traffic patterns?
@munnerz @JoshVanL Great, glad to hear that this is on your radar. Do you have any advice for @anderspetersson's questions?
It also sounds like @aparaschiv was able to reproduce this from a brand new install. Could you collaborate with them to reproduce the problem?
Is there any data on the percentage of unique accounts this is affecting? i.e. N% of accounts registered using cert-manager are showing abusive traffic patterns?
Our log analysis platform is not particularly well suited to answering questions like this about a proportion of UAs that meet some other 2nd level criteria like request volume. I'll ask internally to see if we can pull this somehow.
@aparaschiv
E0815 21:03:22.296093 1 base_controller.go:189] cert-manager/controller/clusterissuers "msg"="re-queuing item due to error processing" "error"="Timeout: request did not complete within requested timeout 30s" "key"="letsencrypt-prod"
From your logs, this looks to be the error you're talking about. It is expected that we'd retry this kind of error as it's a timeout completing a request with the ACME server - this indicates either a network issue, or some other problem access the Let's Encrypt API.
Looking at the timestamps, it seems like the exponential back-off is being applied correctly (specified here: https://github.com/jetstack/cert-manager/blob/582371a1db8469710437b3900bf533c3b3bdffb6/pkg/controller/util.go#L38):
E0815 21:03:22.296093
E0815 21:03:52.994624
E0815 21:04:23.710262
E0815 21:04:54.452585
E0815 21:05:34.482711
That said, I do notice you have this error at the end:
E0815 21:08:15.247152 1 base_controller.go:189] cert-manager/controller/clusterissuers "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"clusterissuers.admission.certmanager.k8s.io\": the server is currently unable to handle the request" "key"="letsencrypt-prod"
which indicates that the 'webhook' component has not started correctly, which will also cause issues persisting data (which will cause us to retry and apply exponential backoff).
That said, from my understanding the types of abusive traffic patterns we are looking for are more than 1 request every 5 minutes, and more in the region of multiple requests per second
@AndresPineros
We expose ACME client library Prometheus metrics which can be used to identify abusive traffic patterns - from there, a full copy of your logs would be appreciated. The prometheus metrics we expose also contain the response status code from the ACME server:
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="GET",path="/directory",scheme="https",status="200"} 1
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="GET",path="/directory",scheme="https",status="999"} 1
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="HEAD",path="/acme/new-nonce",scheme="https",status="200"} 3
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/authz",scheme="https",status="200"} 6
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/cert",scheme="https",status="200"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/challenge",scheme="https",status="200"} 4
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/finalize",scheme="https",status="200"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/new-acct",scheme="https",status="200"} 1
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/new-order",scheme="https",status="201"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/new-order",scheme="https",status="400"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/order",scheme="https",status="200"} 6
I had a CertIssued event run 1662301 times in the past 4 days. I suspect its because I had two ingresses fighting over the same tls secret, but I'm not completely certain, as I didnt check out the event count until after I deleted them.
This observation from @ryangrahamnc also seems like a promising avenue for debugging.
I'd be very surprised if the issue I had (logs posted above) was due to two ingresses referencing the same secret as we have only ever seen this once on one certificate and all our resources are formulaic, automated by pulumi and never shared. We disable pulumi autonaming for ingresses and services so there shouldn't be a way for that to have happened, for us.
@ryangrahamnc we added a check for this a little while ago: https://github.com/jetstack/cert-manager/blob/582371a1db8469710437b3900bf533c3b3bdffb6/pkg/controller/certificates/sync.go#L131-L148 this was first in a release in v0.9.0: https://github.com/jetstack/cert-manager/pull/1689
Can you share your log messages as it's odd that you're seeing this... ๐ฌ
@jsha I got an email on pre 0.8 cert managers being blocked. However, for some reason the email is now without content in my email system so I am not sure of the exact content - was it recalled?
We downgraded to cert-manager 0.7 a while back because of this infinite loop problem and it solved the problem in our case on both Azure and Google cloud. It would be unfortunate if we are forced to upgrade to an unstable cert-manager version since apparently the issue hasn't been fixed.
Does this mean we need to find some solution other than cert-manager?
@mikkelfj the content of the email was also shared in a community forum thread if you need to reference a copy.
@mikkelfj could you also share your log messages using 0.9.1 to help us dig into this for you? ๐
I don't recall what version we were running before rolling back to 0.7, it was probably 6-12 months ago. We are currently running 0.7 reliably and that is all we have logs for. The above discussion suggests that the latest version is still not stable. As to 0.7 it works for us, but now I think about it, perhaps I did see some suspect log content a while back but at least we get new certs for now.
If I set up a 0.9.1, I'll let you know how it goes.
After upgrading from 0.8.1 to 0.9.1 my logs mentioned something about
Operation cannot be fulfilled on certificates.certmanager.k8s.io "some-cert": StorageError: invalid object, Code: 4
Unfortunately I don't have the logs anymore and won't try to upgrade again (I tried it like 10 times) because I'm trying to recover from hitting the rate-limit (which resulted in quite critical infrastructure being unavailable)
I don't know if it's related, but I'd configured 1 ingress with 2 domains and the same secret
Is there any data on the percentage of unique accounts this is affecting? i.e. N% of accounts registered using cert-manager are showing abusive traffic patterns?
Grouped by account, we get about 2% of accounts showing abusive traffic patterns (last 30 days):
abusive: 949
friendly: 45,634
However, grouped by IP address, we get about 11%:
abusive: 4,149
friendly: 38,997
To me this suggests that having multiple cert-manager instances on the same cluster may be a contributing factor, but that there is also a failure mode that affects solo instances.
Thanks for doing the digging here @jsha! ๐
So we do already perform leader election for the controller, but allow the leader election parameters to be tweaked in case users have different environments to the defaults we provide.
There's not a particularly good way to deal with this sort of thing in Kubernetes, and likewise if you were to run multiple instances of a core Kubernetes cluster, you'd have undefined results.
We can probably document and call this out better in our installation guide.
It's also worth noting that some users run multiple instances of cert-manager in a compatible way, by scoping each instance to a single Kubernetes namespace. In those cases, I'd expect it to work just fine.
I've recently opened #2041 which also improves our overall ACME Order handling process, including some better error handling when 4xx errors are received.
Thanks for the update!
There's not a particularly good way to deal with this sort of thing in Kubernetes, and likewise if you were to run multiple instances of a core Kubernetes cluster, you'd have undefined results.
It sounds like it may be impossible to make cert-manager "do the right thing" if someone runs multiple instances; is that right?
Even so, I would expect the "wrong thing" to be linear - e.g. if you're running two cert-manager instances, you'd be generating twice the normal traffic. But what we're seeing here is distinctly non-linear: We have people running two instances, and generating millions of requests per day. Do you have an idea of why cert-manager would go non-linear in a situation like that?
It's kind of undefined because it might be linear, but also two different instances could potentially 'fight' with each other, especially if those two instances are distinct versions that have different behaviour.
Millions of requests per day sounds to me like two different versions of cert-manager running, as this would cause resources to be pretty much infinitely updated and flip-flopping.
In the next couple of releases, our API format will stabilise which in turn will mean that our changes will become backwards compatible, which should mean that two different versions shouldn't exhibit symptoms quite this bad (although it's still very difficult to reason about).
Also, after some thinking on short term ways to measure our ACME server usage, I'm planning on adding a small change to our e2e test suite that will grab Prometheus metrics from cert-manager at the end of a test run. This will allow us to graph the number of API calls made in each test run over time - whilst this is quite a coarse metric (we can't see per test case usage), it will allow us to:
1) set up periodic jobs against previous versions, measuring their API usage
2) set up periodic jobs against 'master', allowing us to spot increases/decreases
3) compare usage and validate it stays stable over time.
The metrics returned from these test cases will definitely be inflated beyond reality (as Pebble itself rejects some % of requests too), however it will allow us to relatively compare usage over time, which will be super valuable for assessing #2041's performance and suitability.
Once we get this change in, I'm inclined to backport it to v0.10 (our current 'stable' version) at the very least, so we can collect 'signal' on our usage to help inform the next release ๐
I've opened #2043 to discuss how we can extend our test suite to gather metrics for passing end-to-end tests. It may be worth also collecting failed test run results too, although I think there'll be a lot of noise in the results potentially making them less useful.
Millions of requests per day sounds to me like two different versions
of cert-manager running, as this would cause resources to be pretty much
infinitely updated and flip-flopping.
This sounds like a promising lead. Which resources are you thinking would
be updated?
>
Opened #2057 to make it clearer that only one copy should be run per cluster.
This sounds like a promising lead. Which resources are you thinking would
be updated?
I'm not sure what you mean - do you mean which resource types in the k8s API would be updated whilst this problem is happening? If so, I think it'd be our 'Order' or 'Certificate' resource ๐.
That said, it's not particularly easy for an end-user to confirm that this is happening - there is no one definitive way to be sure. That said, once we stabilise the API, these problems should be a lot fewer and further between.. so in the meantime I am pushing towards getting 'v1beta1' out.!
Could some kind of locking mechanism work? So that in case there are 2 instances (maybe by accident) only one would try to get a cert.
OK I don't know the details of cert manager, nor how Kubernetes accesses etcd, but I have made my own cert tool earlier, and also worked with etcd seperately.
So: If the cert manager register itself with an entry in etcd you can do leader election. The risk is there is a stale leader, but I think you could work that out.
A simple annotation or an entry in the status could suffice
cert-manager already performs leader election and is configurable with the options specified here: https://github.com/jetstack/cert-manager/blob/d2cedd50e125a7a4839e5fe8294f9315eb7d0f08/cmd/controller/app/options/options.go#L197-L214
It's just occurred to me that if two users deploy an instance of cert-manager into two difference namespaces using Helm, the leader election namespace will actually be set to the namespace that the cert-manager pod runs in, which could result in two instances running in two different namespaces as a result: https://github.com/jetstack/cert-manager/blob/d2cedd50e125a7a4839e5fe8294f9315eb7d0f08/deploy/charts/cert-manager/templates/deployment.yaml#L70
We should update this to always use the kube-system namespace, which should hopefully 'increase the guard rails' and require additional user acknowledgement that they are 'going against the grain'. I'll put a PR in shortly ๐
Hopefully this works by default - I tend to not use helm as I find it just complicates operations with different ways to clean up etc.
Based on my email from LetsEncrypt versions >0.8.0 significantly reduce this excessive traffic issue.
It might be beneficial to change the defaults in the helm chart from version v0.6.2 to the latest stable release.
I imagine that LetEncrypt are seeing traffic from helm installations that are using default values for the image version.
@lyndon160 - you can add jetstack repository and install/update to the latest version (v0.10.0).
# Add the Jetstack Helm repository if you haven't already
helm repo add jetstack https://charts.jetstack.io
# Ensure the local Helm chart repository cache is up to date
helm repo update
As new versions of cert-manager are released, we will add the non-current versions to our block list after 3 months.
This is a bit distressing. The promise of Let's Encrypt was to have short-lived automatically-renewing certificates, but now we have to manually upgrade our cert-manager instances on the same cadence the certificates are valid for.
I think this would be a lot more palatable if the deprecation were pushed out to >=1 year.
Until cert-manager 1.x, it may be reasonable to require an upgrade cadence that's faster then that since it still can have bugs?
Thanks,
Kevin
From: Patrick Lucas [[email protected]]
Sent: Thursday, September 12, 2019 2:08 AM
To: jetstack/cert-manager
Cc: Fox, Kevin M; Comment
Subject: Re: [jetstack/cert-manager] cert-manager v0.8.0 and v0.8.1 send excessive traffic (#1948)
As new versions of cert-manager are released, we will add the non-current versions to our block list after 3 months.
This is a bit distressing. The promise of Let's Encrypt was to have short-lived automatically-renewing certificates, but now we have to manually upgrade our cert-manager instances on the same cadence the certificates are valid for.
I think this would be a lot more palatable if the deprecation were pushed out to >=1 year.
โ
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/jetstack/cert-manager/issues/1948?email_source=notifications&email_token=AALRNQX7WCB4IG6ILVDNH5TQJIBJTA5CNFSM4IISZY52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6RGQIQ#issuecomment-530737186, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AALRNQXRFW7IRN6MEKSK5IDQJIBJTANCNFSM4IISZY5Q.
I upgraded from 0.7.0 to 0.9.1 yesterday (after receiving email notification from let's encrypt). I deleted one of my cert secret and checked the logs for ACME requests. There is only one issuance/request cycle. Nothing else in the logs since yesterday.
I only have one cert-manager pod in a cert-manager namespace. I installed everything whith regular manifests.
I keep an eye with cert renewal after migration to v0.10.0
ref https://github.com/jetstack/cert-manager/issues/2150
If you look at the output in this issue, the user is seeing Order resources continuously created and deleted, and consequently ~2RPS to the CreateOrder endpoint from cert-manager.
It was later determined that this was caused by running >1 instance of cert-manager in a single cluster.
We're going to add a patch into the v0.11 release to make it harder to modify the leader election namespace to avoid this in future. I'd hope that we see a substantial drop in the 2% of users who are exhibiting this issue ๐ค @jsha.
I've opened #2155 to update this ^ ๐
Thanks for the update @munnerz ! I also hope that helps. However, I think it's still not quite enough. What I'd really like to see is a system where, even if someone manages to bypass the guardrails you're adding and run two instances of cert-manager, it doesn't go into pathological traffic mode.
So, for instance, you mentioned that you think the problem is due to the two instances overwriting a single Resource. Can you make it so that each instance names its Resources randomly so that there is very little chance multiple instances will be contending over one?
Hello,
I am using Rancher v2 and I am currios if we should upgrade cert-manager using the official repo posted here in comments or the update stuff will be available also via https://github.com/helm/charts/tree/master/stable/cert-manager?
I found my answer here: https://rancher.com/docs/rancher/v2.x/en/installation/options/upgrading-cert-manager/ :)
So, for instance, you mentioned that you think the problem is due to the two instances overwriting a single Resource. Can you make it so that each instance names its Resources randomly so that there is very little chance multiple instances will be contending over one?
Given the way that Kubernetes controllers work, this isn't really possible. These resources are created and named by end-users, not just by cert-manager. Some resources (i.e. Orders) are created by cert-manager in response to 'user actions', but there's no reliable way for us to shard processing in the way you describe without potentially ending out in situations where no instances will process the resource.
The very decoupled nature of Kubernetes is designed around the idea that different actors can modify/manipulate resources, which aids extensibility, however if a user runs two controllers that 'compete' with each other, you've effectively got a situation where one person is turning the heating on, whilst someone else is continuously turning the heating off.
Leader election et al is meant to address this sort of thing to ensure only one instance runs at a time. When users run multiple instances (and worse, when these instances have mismatched versions), it's effectively like running a concurrency sensitive application without any locks.
We've now made the change (and it's rolled out to v0.11) to make it harder to actually configure things in this way (it was too easy in the past), so I'm keen to see how the results look there.
Given that you're seeing approx. 2% of accounts express abusive traffic patterns, I still think that these instances are down to misconfigurations/bad deployments (and also the issues like you describe in #2194) - I am confident we can continue to reduce this number with on-going changes, and I think you'd agree compared to a few months ago on earlier releases, we've managed to reduce the total % of abusive accounts fairly significantly (previously, I do believe we had a far higher proportion of our users with abusive patterns).
Happy to set up a call or any other kind of chat to go over it in a bit more depth ๐ I appreciate this isn't the simplest concept, and it's a bit tricky to explain it all here ๐
These resources are created and named by end-users, not just by cert-manager. Some resources (i.e. Orders) are created by cert-manager in response to 'user actions', but there's no reliable way for us to shard processing in the way you describe without potentially ending out in situations where no instances will process the resource.
It sounds like "resources" is probably the wrong abstraction for cert-manager to store its internal state in. What if cert-manager stored its internal state on disk in its container? I understand cert-manager may want to make a certificate resource available so other components (like Nginx) can consume it, but cert-manager could treat the certificate resource as output-only, treating its on-disk state as authoritative.
BTW, I tried to look up "resources" in the Kubernetes documentation but didn't find something that seemed to match the concept here. Are we talking about Kubernetes Objects?
Given that you're seeing approx. 2% of accounts express abusive traffic patterns, I still think that these instances are down to misconfigurations/bad deployments
I think you're probably right that misconfiguration is the cause of this excessive traffic, but it's a very common misconfiguration, and I can see why - it seems like it's easy in Kubernetes to lose track of the fact that you've already got a cert-manager instance deployed. Even if it were a rare misconfiguration, it would be important that cert-manager fail cleanly, sending zero traffic rather than sending thousands of times more traffic than normal. While only 2% of cert-manager instances sent high traffic, at times those instances represented 40% of all Let's Encrypt API requests.
I think you'd agree compared to a few months ago on earlier releases, we've managed to reduce the total % of abusive accounts fairly significantly (previously, I do believe we had a far higher proportion of our users with abusive patterns).
Yes, I think cert-manager has made a ton of great progress in recent versions. I really appreciate your work on this! I want to get to the point where 0% of cert-manager clients are abusive, and I think we can get there, but it will probably take some significant design changes.
I think the new locking changes will help significantly.
I don't think it can ever be reduced to 0%. Any software can be abused. What is reasonable to do is have 0% of it be accidental, but instead, all remaining is malicious.
I think one of the next remaining checks would be to ensure that if there was a cluster-wide cert-manager installed, that a namespace only one wouldn't start.
After some more careful consideration on this, I think the v0.11 release will significantly improve this due to the change we made to use the status subresource on our CRDs (#2097). This change will mean that any old version of cert-manager, when attempting to persist its new state, will not be able to persist their old state data and thus, will not interfere with new versions of cert-manager still running.
This should massively help, as it'll prevent the 'fighting' behaviour, meaning that newer releases will operate just fine. The older release is likely to sit and not do much (depending on the version), as it won't be able to observe its own state changes and so, won't re-sync the resource.
To further insulate us from issues like this in future, I've also opened #2219 which will go a step further and make the ACME Order details immutable once set on our Order resources. This should, once again, prevent fighting as these values will no longer be able to 'flip-flop'. In the event that two controllers do start to do this, the apiserver will actually reject changes to these fields, which will cause a 4xx error to be returned to the UpdateStatus call, which in turn will trigger exponential back-off (and avoid querying ACME in a tight loop!)
The above 2, plus the leader election changes, I believe will resolve this issue altogether.
I don't think it can ever be reduced to 0%. Any software can be abused. What is reasonable to do is have 0% of it be accidental, but instead, all remaining is malicious.
Yes - agreed.
I think one of the next remaining checks would be to ensure that if there was a cluster-wide cert-manager installed, that a namespace only one wouldn't start.
This is a difficult heuristic to develop IMO. That said, with the new leader election changes, if a user tries to deploy a cluster scoped version of cert-manager as well as a namespace scoped version, they will both have the same leader election namespace set (unless the user explicitly changes it), which will mean they won't 'compete'.
I'd be more in favour of supporting 'namespace scoped cert-manager' as a first-class feature, and then having a --set namespaceToWatch=abc (naming is hard) which would set the leader election namespace as well as disabling any non-namespaced controllers.
Relevant to this is the discussions we've had in the past about switching to controller-runtime, which has better support for running informers against multiple namespaces at once. But this is starting to veer far off the original topic, so I'll not go into too much detail here ๐
@jsha regarding cert-manager state, our interactions with other tools in the ecosystem etc., I'd be happy to set up a quick call to go over some of these details. I appreciate your suggestions, but I don't think it is fair that due to a number of users that have misconfigured older and newer clients, we should significantly re-architect the entire project.
BTW, I tried to look up "resources" in the Kubernetes documentation but didn't find something that seemed to match the concept here. Are we talking about Kubernetes Objects?
Yes
I think you're probably right that misconfiguration is the cause of this excessive traffic, but it's a very common misconfiguration, and I can see why - it seems like it's easy in Kubernetes to lose track of the fact that you've already got a cert-manager instance deployed.
I am not sure if it's fair to say it's easy to lose track - I think some users do it by accident, similar to how some users may install two copies of the same application on their own computer. Kubernetes is a powerful tool, but it must be used properly (and changes to our leader election config will help users to not burn themselves here).
Even if it were a rare misconfiguration, it would be important that cert-manager fail cleanly, sending zero traffic rather than sending thousands of times more traffic than normal.
:+1: - agreed, and I think we've made some significant changes in v0.11 (and also, v0.12), that are mentioned above. I'm hopeful that this will quash that remaining 2%, and I believe that if you dig into the numbers, you'll observe far fewer users of v0.11 and v0.12 that are showing abusive traffic patterns whilst also running an older version.
I want to get to the point where 0% of cert-manager clients are abusive, and I think we can get there, but it will probably take some significant design changes.
I think we can get here too ๐ (although excluding users who intentionally are trying to circumvent the rules/cause problems). That said, I am fairly confident it won't take significant design changes ๐
the v0.11 release will significantly improve this due to the change we made to use the status subresource on our CRDs (#2097).
I've also opened #2219 which will go a step further and make the ACME Order details immutable once set on our Order resources.
These both look like really positive changes (though I'll admit to not fully understanding how #2097 works). I'm optimistic these will further reduce the problem.
due to a number of users that have misconfigured older and newer clients, we should significantly re-architect the entire project.
I think it depends on how serious you consider this class of bugs, and whether you think it's the user's fault when they hit them. I've tended to consider this an issue with the software rather than the user, because every user I've reached out to doesn't realize what's going on - there's no good way for them to notice.
A big part of why I consider this class of bugs to be serious is that it's non-linear. Yes, a user can always make a mistake and install two copies of a program; that would typically use twice the resources. But under our current understanding, installing two copies of cert-manager can result in 100,000-1,000,000 times as many requests as installing just one copy (based on an expected "normal" traffic of 10 requests per renewal period, or more generously 10 requests per day).
It's not clear to me how big a reorganization it would be to move to internal storage; it may be prohibitive. I'd be curious to hear more. My intuition that it's worthwhile is because, so far, a series of fixes to address specific symptoms haven't succeeded in fully addressing the problem. Usually that means that the problems need a more structural approach.
I'd be happy to set up a quick call to go over some of these details
Thanks! I'll send you an email to schedule.
... 'helpful' github automation closed this issue - re-opening it so that we can explicitly close it when we're happy ๐
(to clarify, the changes in #2219, and various others, should definitely help significantly in those cases where users are running multiple instances of cert-manager with leader election not properly enabled, but we should wait for some kind of statistical validation of that first!)
Hi,
Im gettign following error for my GKE cluster. My domain is registered in godaddy.com
My question what is causing below error and is this something I need to do on gopdaddy.com so that these well-known thing can be verified
cert-manager/controller/challenges "msg"="propagation check failed" "error"="failed to perform self check GET request 'http://www.abc.com/.well-known/acme-challenge/AZl_evY1PscNKi95EdfFNuYG_Gl75-Hi8we7Efbyy7I': Get http://www.abc/.well-known/acme-challenge/AZl_evY1PscNKi95E3EFNxcdfl75-Hi8we7Efbyy7I: dial tcp: lookup www.abc.in on 10.12.244.10:53: no such host" "dnsName"="www.abc.com" "resource_kind"="Challenge" "resource_name"="abc-3611830638-3386842356-1600275291" "resource_namespace"="default" "type"="http-01"
@kushwahashiv this isn't an issue for generic support for cert-manager - could you join the #cert-manager channel over on https://slack.k8s.io and we can work through providing support to get this working for you?
@munnerz Ok I have joined the slack channel. I deleted the whole GKE cluster. let me create the cluster and its deployments etc and then I will. connect in slack if the issue still persists. Thanks for your prompt reply.
/ Shiv
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
I think we can close this one now. There are currently only 2 instances of cert-manager v0.8.x in our top clients (though there are a smattering of other versions showing up). Thanks for all your work on the issue!
Most helpful comment
Grouped by account, we get about 2% of accounts showing abusive traffic patterns (last 30 days):
abusive: 949
friendly: 45,634
However, grouped by IP address, we get about 11%:
abusive: 4,149
friendly: 38,997
To me this suggests that having multiple cert-manager instances on the same cluster may be a contributing factor, but that there is also a failure mode that affects solo instances.