Aws-sdk-java: Retry logic needs different kinds of max retries

Created on 24 Dec 2016 · 11Comments · Source: aws/aws-sdk-java

Or maybe clock skew errors should just be retried once? If I have a critical rate limited resource I may want to retry MANY times, maybe even forever. But for 403 kinds of errors I may not want to retry at all (since I'm not authorized to use the resource). It would be a drag for that to be retried many times. ENTER CLOCK SKEW. RetryUtils.isClockSkewError includes AuthFailure, InvalidSignatureException, and SignatureDoesNotMatch. Is there value to retrying a clock skew error more than once? Maybe the solution is to only retry clock skew errors once no matter what MaxErrorRetry is set to. However, if there is value to clock skew errors being retried multiple times then I would like a more complicated enhancement, such as being able to set MaxErrorRetry based on the type or category of error.

Currently, I cannot simply use the Retry mechanism in the SDK. For all my client invocations I have to write another retry loop around them so that RateLimit, Socket, or 500 series errors are retried many many times but so that AuthFailure, etc. are only retried once (in the case of a clock skew). Since clock skew handling is internal to AmazonHttpClient I end up setting the MaxErrorRetry to one and then writing a loop for my max error retry for those errors I want to backoff and retry many times.

guidance response-requested

Source

rdifalco

All 11 comments

Have you looked at using your own custom retry condition? While we have tried to provide a default retry condition that will work for most customers it will not always be what works for everyone. You can provide your own custom retry condition by extending the RetryPolicy class and passing that in to the ClientConfiguration.

This way you can exclude retrying on clock skew errors at all or build in logic to only retry on them if this is the first retry.

spfink on 29 Dec 2016

We could add a check here to only retry on those error codes if a significant enough skew is detected. This would prevent most cases of retrying on legitimate signature mismatch or auth failures that have no hope of succeeding. Would this work for you?

shorea on 3 Jan 2017

👍1

Yes, I think that is what I have to resort to, i.e. just creating my own custom RetryPolicy. That said, it seems like the general case (the case the default retry policy should handle) is to only try once on clock skew. I can't imagine a case where users would want a 403 to be retried as often as (say) a rate limit exception or even an IOException. So yeah, I can work around it which is great, but I just think the default policy does not match what users would want or expect (especially since users can modify Max Retries without changing the default policy).

rdifalco on 4 Jan 2017

@rdifalco The V2 retry policy isn't exposed yet but if you are interested in using it we can expose it. The current retry policy doesn't provide a good mechanism for passing state between invocations which is really what you need here.

I think adding a check for the actual skew amount would help too though. Right now we don't have high confidence that the request failed because of a clock skew issue since the error codes are so generic. Checking the actual skew against some threshold would prevent us from retrying when it doesn't make sense to (and sense we keep the offset subsequent requests should succeed immediately).

shorea on 4 Jan 2017

👍1

@shorea, doubling back on this issue, I came up with something like this to work within the current SDK features. Does this seem basically what I want? FWIW, having to interrogate errors twice seems a little clumsy, but overall I think this works.

    enum RetryAction {
        PAUSE,
        BACKOFF,
        IMMEDIATELY,
        RATE_LIMIT,
        CLOCK_SKEW,
        NOT_RETRYABLE
    }

    private static RetryAction toRetryAction(AmazonClientException exception) {

        // Check for an underlying IOException first, these can be retries without delay
        Throwable throwable = exception.getCause();
        while (throwable != null) {
            // Always retry on client exceptions caused by IOException
            if (throwable instanceof IOException) {
                return IMMEDIATELY;
            }

            if (throwable instanceof XMLStreamException) {
                // This can be caused by a read-timeout occurring while parsing and
                // streaming a large result.
                // A bug for this has been filed here:
                // https://github.com/aws/aws-sdk-java/issues/892
                return IMMEDIATELY;
            }

            throwable = throwable.getCause();
        }

        /*
         * Throttling is reported as a 400 error from newer services. To try
         * and smooth out an occasional throttling error, we'll pause and
         * retry, hoping that the pause is long enough for the request to
         * get through the next time.
         */
        if (RetryUtils.isThrottlingException(exception)) {
            return RATE_LIMIT;
        }

        /*
         * For 500 internal server errors and 503 service
         * unavailable errors, we want to retry, but we need to use
         * an exponential back-off strategy so that we don't overload
         * a server with a flood of retries.
         */
        if (RetryUtils.isRetryableServiceException(exception)) {
            return BACKOFF;
        }

        // Only retry on a subset of service exceptions
        if (RetryUtils.isClockSkewError(exception)) {
            return CLOCK_SKEW;
        }

        // Down to string searches...
        if (exception.getMessage().startsWith("Unable to unmarshall response")) {
            return IMMEDIATELY; // Just a network blip, retry immediately
        }

        return NOT_RETRYABLE;
    }

    private static class ExtensiveRetryCondition implements RetryCondition {

        private final int maxRetries;

        public ExtensiveRetryCondition(int maxRetries) {
            this.maxRetries = maxRetries;
        }

        @Override
        public boolean shouldRetry(AmazonWebServiceRequest amazonWebServiceRequest, AmazonClientException e, int tries) {
            switch (toRetryAction(e)) {
                case CLOCK_SKEW:
                    return tries == 1; // Retry clock skew errors just once

                case NOT_RETRYABLE:
                    return false;
            }

            return true;
        }
    }

    private static class RateLimitedBackoffStrategy implements BackoffStrategy {

        private final RateLimiter rateLimiter;

        public RateLimitedBackoffStrategy() {
            this(null);
        }

        public RateLimitedBackoffStrategy(RateLimiter rateLimiter) {
            this.rateLimiter = rateLimiter;
        }

        @Override
        public long delayBeforeNextRetry(
            AmazonWebServiceRequest request, AmazonClientException exception, int retries) {
            switch (toRetryAction(exception)) {
                case CLOCK_SKEW:
                case IMMEDIATELY:
                    return 0;

                case RATE_LIMIT:
                    if (rateLimiter != null) {
                        rateLimiter.acquire(1);
                        return 0;
                    }

                    // fall through to pause

                case PAUSE:
                    return nextPause();

                case BACKOFF:
                    return nextBackoff(retries);
            }

            throw exception;
        }
    }

rdifalco on 27 Feb 2017

@rdifalco this looks like a good approach to me. There may be cases were you don't retry clock skew errors when perhaps you should but I suspect this will be so rare in occurrence that it won't cause you problems (i.e. if your first request fails due to a IOException and the next request fails with a clock skew error but since you're only retrying on clock skew if it's the first failure, you don't retry).

shorea on 28 Feb 2017

Oh yeah, @shorea good point. I could fix that if that retry condition was state-full per request or sent the last exception as context. Otherwise I don't see a good way around it.

One final question on this @shorea (and thanks for your patience and time), if a connection is hosed will the SDK detect that between retries and get a new connection? I used to do this retry logic _around_ an SDK call. That made it possible for the next retry to get a new connection from the pool for each retry. With this logic _inside_ a RetryPolicy I'm not sure if that will still be the case.

rdifalco on 28 Feb 2017

The SDK will attempt to validate the connection after it has been idle in the pool for a certain period of time. This is done by default every 5 seconds (as of the latest version of the SDK) but is configurable via https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/ClientConfiguration.java#L1821. This is a best effort as there is always the chance the server closes the connection after we check it's validity in which case, some type of IOException is usually thrown (I've seen NoHttpResponseException before) and most retry strategies should treat that as retryable. In practice, I'm not really sure how common this scenario is. I suspect our stale connection checking covers most cases where the connection is closed but I don't have data to back this up.

There is also the idleConnectionReaper which uses a single thread to proactively prune idle or stale connections from the pool. This is disabled by default. https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/ClientConfiguration.java#L1273

shorea on 28 Feb 2017

👍1

@shorea The documentation above is hard for me to parse. If I do not use the IdleConnectionReaper will a pool shrink when not being used? I have long lived clients and would like connections in the pool to expire and get cleaned up. The language in withValidateAfterInactivityMillis makes it unclear to me if it does this job. If the above setting does this job then what are the use cases for the IdleConnectionReaper? From how I read these docs, it looks like pools will only shrink when not used if the idleConnectionReaper is enabled. Is that right? Thanks again.

As an aside, the code in the previous comment won't work because of #1068. Looks like I have to do something like this:

        // Modify the client configuration to only retry once to handle
        // clock skew errors.
        builder.getClientConfiguration()
            .withThrottledRetries(false)
            .withMaxErrorRetry(1)
            .withRetryPolicy(new RetryPolicy(
                (request, e, retries) -> RetryUtils.isClockSkewError(e),
                (request, e, retries) -> 0,
                1, false));

Then I need to put all my Client calls through the following method. It seems like the only way to have a robust idempotent call mechanism because of #1068. But I've been wrong before. 😸

    public <C, U extends AmazonWebServiceRequest, R>
    R apply(C client, BiFunction<C, U, R> function, U request) {

        for (int tries = 0; ; ++tries) {
            try {
                return function.apply(client, request);

            } catch (AmazonClientException e) {
                if (tries == maxRetries) {
                    throw e;
                }

                switch (RetryAction.from(e)) {
                    case CLOCK_SKEW: // Should be handle once in the Client Config RetryPolicy
                    case NOT_RETRYABLE:
                        throw e;

                    case RATE_LIMIT:
                        Uninterruptibles.sleepUninterruptibly(nextPause(), TimeUnit.MILLISECONDS);
                        break;

                    case BACKOFF:
                        Uninterruptibles.sleepUninterruptibly(nextBackoff(tries), TimeUnit.MILLISECONDS);
                        break;
                }
            }
        }
    }

rdifalco on 12 Mar 2017

Sorry about the delay on this.

validateAfterInactivityMillis -> If a connection has been sitting idle for this number of milliseconds when we go to use it, we'll attempt to validate the connection "seems to be valid". This is trying to catch connections that were closed, but we didn't notice.

maxIdleConnectionMillis -> If a connection has been sitting idle for this number of milliseconds when we go to use it, we'll close it and use a newer connection that hasn't been idle this long.

useReaper -> If this is enabled, we perform the cleanup described in maxIdleConnectionMillis above periodically for all connections. This ensures that the idle connections are cleaned up even if we're not actively trying to use them.

Did that answer your question?