Go: x/crypto/acme: ACME client's internal retry implementation results in hanging retries on 429s

Created on 23 Jul 2020  ·  15Comments  ·  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version go1.14.4 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN="/Users/viola/go/bin"
GOCACHE="/Users/viola/Library/Caches/go-build"
GOENV="/Users/viola/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/viola/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.14.4/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.14.4/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/viola/crypto/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/0s/nyt41p_j69d8vzq1qfkg5nrh0000gn/T/go-build384678003=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Start off with triggering a rate-limit, to get a 429 from Let's Encrypt. Here's how I triggered one by going over the default 5 per week duplicate certificate limit set by Let's Encrypt :

  • Call AuthorizeOrder with a valid AuthzID identifier.
  • Repeat the request 6 times.
  • On the 6th attempt, the request winds up hanging with a much longer execution. It wound up hitting a timeout set up an upstream caller at 45 seconds.
  • Unless the context is canceled or timed-out, or a "non-retriable error status" is received, retries are indefinite. In this case, the client never returns, because 429 is a retriable error code, and this will be a somewhat terminal state (for the rest of the week).
  • Here's a look at the 429 that the client was receiving and retrying on in this example.
429 urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for exact set of domains: violababola.best: see https://letsencrypt.org/docs/rate-limits/

What did you expect to see?

  • Fail fast and not to retry on the client side error => http.StatusTooManyRequests since this is a not recoverable response error code. At least, the facility to configure this.
  • Also, that there is some bound on the number of retries.

What did you see instead?

Retries on the client side error => http.StatusTooManyRequests
It seems like http.StatusTooManyRequests is something that should not be considered as a retriable response error code as it's not recoverable.

Related PR Fix

  • This PR introduces a custom ShouldRetry func option that can be set on the ACME client to allow the default set of retriable response error codes to be overridden. This will keep the current behaviour backwards compatible, but provide more flexible retry configuration.
NeedsInvestigation

Most helpful comment

I also hit this issue. Could the fix be prioritized? :)

All 15 comments

/cc @FiloSottile @x1ddos

Hi @viola! Thank you for contributing to the Go project and welcome.

Hi @cagedmantis, good talking to you again! Please let me know if there is anything else I can elaborate on to help this issue move forward.
PS. Go cubs go!

Related issue #40161

cc @FiloSottile 👋 Just doing a gentle nudge on this one 👀

Hello folks and apologies for the ping @FiloSottile @x1ddos. I just wanted to check in on this and see if we could move it forwards? The PR that fixes this is ready for review: https://github.com/golang/crypto/pull/149

Please do let me know if there's anything I can do to help trudge this along.

Hi all. I'm a developer at @1password and we're currently experiencing this exact issue with the library. It'd be a great help if we could get golang/crypto#149 moved along and merged before the upcoming code freeze. Please let me know if any testing is needed to assist with that.

cc @FiloSottile if you have some time, would love to pick this one up. Especially since there is more interest now ☝️

I also hit this issue. Could the fix be prioritized? :)

@icholy @alicethorne-ab @ZhiminXiang thanks for letting me know you've hit the same issue. This farther validates that our golang/crypto#149 fix would be really nice to bring over to crypto.
@FiloSottile @x1ddos @cagedmantis folks any eyes on that PR would be greatly appreciated! Please let me know if there is anything I can do to help. ❤️

I would love to see this fixed since 429 is being returned in a case where you will need to retry for days before it would succeed. That doesn't seem to match the purpose of 429.

Hey @viola, sorry it has taken so long to address this. I agree this isn't the correct behavior, as far as I am aware there is only one class of 429 returned from most ACME servers that is likely to be retry-able in the short term (an overall req/s limit). Rather than expanding the API surface of the client I think it makes sense to just remove the behavior of retrying on 429 responses in general.

Change https://golang.org/cl/272927 mentions this issue: x/crypto/acme: only retry GET requests on 429

@rolandshoemaker thank you so much for looking into it for me! While https://golang.org/cl/272927 will address my original issue with retrying on 429 in AuthorizeOrder that uses post I still think there is a value in my proposed extension for the api. Mainly ability for the caller to control what to retry on if desired or needed. Given that https://github.com/golang/crypto/pull/149 is not a breaking change would you still consider it? Once again, thank you for taking the time to review my issue and PR! <3

Typically when we extend the public API of a package we go through a proposal process before moving to a CL as once the API is landed it becomes much harder to make any changes (although less so for x/ packages, it is still something we want to follow). If you could write up the proposed API for this change in this issue it would help clear up a few things. It doesn't need to be super detailed, just what the public methods would look like, and what the use case is.

In particular I'd be interested in what you see the use cases for this API would be, beyond the initial 429 one.

Was this page helpful?
0 / 5 - 0 ratings