Site-kit-wp: Retry failed requests: implement exponential backoff

Created on 9 Sep 2020  ·  31Comments  ·  Source: google/site-kit-wp

Feature Description

Currently when the plugin makes an API request that results in an error we display the error to the user and stop processing. We can improve this behavior for specific errors by implementing exponential backoff. We will retry the request in (roughly) 1, 2 and 4 seconds.

Let's start by turning on the client's default retry behavior (which handles 50x errors and curl/connection errors) and set maximum retries to 3. We can then follow up for other API specific codes that should be retried. Exact conditions/codes/reasons for sending a request will vary by API.

For example, for the Analytics Reporting API, the following error codes should use exponential retry:

429 - Quota Error: Number of recent failed reporting API requests is too high, please implement exponential back off.

For the Analytics Management API, when the error message is "Quota Error: Rate limit for writes exceeded." we should use exponential backoff to retry.


_Do not alter or remove anything below. The following sections will be managed by moderators only._

Acceptance criteria

  • When the PHP client encounters a retry-able errors for API requests, it should use exponential backoff to re-request that data, trying up to 3 times.

Implementation Brief

  • When Google API request are made, the client should handle error conditions with up to 3 retries:

  • In includes/Core/Authentication/Clients/OAuth_Client.php:setup_client configure the $client to add support for retries:

// Enable exponential retries, try up to three times.
$client->setConfig( 'retry', array( 'retries' => 3 ) );

Test Coverage

We can test that the default client uses exponential backoff. We shouldn't need tests for actual retries - the client itself includes these tests.

  • In tests/phpunit/integration/Core/Authentication/Clients/OAuth_ClientTest.php expand the test_get_client test which calls Google_Site_Kit_Client:get_client. Add assertions that verify retries is set to 3:
$retry = $client->getConfig( 'retry' );
assertEquals( $retry[ 'retries' ], 3 );

Visual Regression Changes

  • No changes in VRT.

QA Brief

  • To test that this is working as expected we need to be able to mock API error responses. I will work separately to add this capability in the tester plugin.

Changelog entry

  • Implement exponential backoff to retry Google service API requests a limited amount of time if they fail with temporary errors.
P1 Eng Enhancement

Most helpful comment

@adamsilverstein @aaemnnosttv It's a bit late now to make more changes here like move this behind a feature flag, and I'm also not sure that makes sense since the implementation for the feature is completed - I think we're okay launching this, as only wide usage will give us an idea about the impact on API requests anyway. Once this goes out on Monday, let's monitor whether it has any impact on API requests or quotas and react if needed.

All 31 comments

Noting that this feature is already available in the PHP client. Unless we see a reason to put this in the client, we should be able to leverage the PHP client feature.

Retries AND exponential backup support are built into the Google_Client (php client) we use. I found the feature in the code base first, then here: https://github.com/googleapis/google-api-php-client/pull/416 and https://github.com/googleapis/google-api-php-client/issues/1487

I did not find any documentation on the feature, but in general we should be able to pass a map of retry parameters:

/**
  * @var array $retryMap Map of errors with retry counts.
  */
$client = new Google_Client();
$client->setClassConfig(
    'Google_Service_Exception',
    'retry_map',
    array(
        // Don't retry 500 errors
        '500' => 0,
        // Retry 503 errors once
        '503' => 1,
        // Retry rate limit errors until we hit `retries`
        'rateLimitExceeded' => -1,
        'userRateLimitExceeded' => -1
    )
);
...

The logic here includes the ability to retry based on code or 'reason' which I think covers our use cases, as long as the errors we want to retry on provide unique code/reasons.

This looks like a good addition to make @adamsilverstein, although why not implement it for _all Google API requests_ (i.e. in OAuth_Client::setup_client) rather than using an alternate client configuration just for Analytics requests?

Unless we see a reason to put this in the client, we should be able to leverage the PHP client feature.

While it seems fairly straightforward to implement this on the PHP client, it's arguably more important to protect _the host machine_ from being bombarded with too many requests than the Google APIs. I suppose this is mostly what the batch request architecture is for but we could still do something similar on the client as well; perhaps that's a separate issue though.

why not implement it for all Google API requests

That is the plan ultimately. This idea is to roll it out on a limited basis to makes sure it doesn't have any negative consequences. It it probably safe, I'm just being cautious.

While it seems fairly straightforward to implement this on the PHP client, it's arguably more important to protect the host machine from being bombarded with too many requests than the Google APIs.

I agree protecting the server is important, hence batching... currently though, we don't re-request from the client until a refresh, right?

The main motivation for this work is to reduce user facing errors (eg. blue boxes) when data requests for a reason that might be fixed by a re-request.

This idea is to roll it out on a limited basis to makes sure it doesn't have any negative consequences. It it probably safe, I'm just being cautious.

By trying to isolate this to Analytics only, can potentially cause new problems; see below.

In third-party/google/apiclient-services/src/Google/Service/AnalyticsReporting.php configure the passed $client to add support for retries and a retry map:

$client->setClassConfig( 'Google_Task_Runner', 'retries', '3' );

We can't set things on the provided $client because the same instance is shared between all modules. This would cause Analytics requests to have a side-effect of changing the configuration of the client.

In order to do this safely, we would need to create a new Client instance with all the same configuration in Analytics::setup_services based on the given $client. I'm not sure there is an easy way to do this and we definitely don't want to duplicate what is done in OAuth_Client::setup_client.

This may be as simple as $analytics_client = clone $client before making changes but I'm not sure.

I agree protecting the server is important, hence batching... currently though, we don't re-request from the client until a refresh, right?

That _should_ be the case, yes; a failed request should not be re-requested until the page is loaded again.


I think configuring this globally should be safe to do for 5xx errors – afterall, these are only triggered by Google_Service_Exceptions, correct?

This may be as simple as $analytics_client = clone $client before making changes but I'm not sure.

Yea, I was hoping it might be. Still, I can see the value in applying this globally.

I think configuring this globally should be safe to do for 5xx errors – afterall, these are only triggered by Google_Service_Exceptions, correct?

You are probably right, I am likely being overly cautious here. My main concern is that API responses are not consistent across the APIs, for example how they report Quota errors, which can be 429 or 403 errors. Still, the 500x class errors and also retrying on curl/connection errors makes sense - which is probably why they are defaults.

I will update the IB to apply this globally to the client. The plan is to land the error monitoring first and collect a couple weeks of data before launching this change. That way we can gauge the impact of the change on error rates.

re: testing, Felix suggested we could extend our client to support returning a mocked response. Great idea! This should be as straightforward as having the client handle execute and check for a debug setting before calling the parent method:
https://github.com/googleapis/google-api-php-client/blob/61517efe678b3c20fe806e76d506204261ca2863/src/Google/Client.php#L834-L881

@adamsilverstein

  • Regarding the class referenced in setClassConfig we need to make sure we reference our prefixed-version rather than the instance in the root namespace which may not exist. i.e. \Google\Site_Kit_Dependencies\Google_Task_Runner. In practice, it's better to use Google_Task_Runner::class and import the correct reference in the file though.
  • This issue is missing some of the newer fields in the IB section, it would be great if you could update it to include these
  • Missing a priority label
  • Currently estimated as an 11 for a 1 line change?

@aaemnnosttv -

  • Changed estimate to 3 hrs. We reduced complexity by choosing defaults and applying to all APIs. To clarify: should the estimate include time for QA/testing?
  • Added a priority label.
  • Added IB sections for test coverage and Visual Regression changes. Thinking we only need to test that the default configuration is correct since the client itself includes test for the actual retry functionality.

To clarify: should the estimate include time for QA/testing?

Yes (basically everything from the Execution column and beyond).

Thinking we only need to test that the default configuration is correct since the client itself includes test for the actual retry functionality.

Agreed, we shouldn't need to have automated tests for functionality provided by the official client which is already well tested I'm sure. Wouldn't hurt to add a test to ensure the configuration is present, but I don't think we need to specifically check the number of retries it's set to.

Regarding the (get|set)ClassConfig methods, I don't see these on the Client class we have. There are however, (get|set)Config methods – are you maybe looking at a newer version?

Regarding the (get|set)ClassConfig methods, I don't see these on the Client class we have. There are however, (get|set)Config methods – are you maybe looking at a newer version?

I noticed that it seemed to be missing from the main client as well; I thought I was missing something since all the examples as well as PRs I found use this approach (eg https://hotexamples.com/examples/google/Client/getClassConfig/php-client-getclassconfig-method-examples.html).

Looking at the code, it seems like we need to provide a retry_map to the client and the runner will use it (which contradicts the examples). I will report back here with my findings.

@adamsilverstein The examples you linked to there reference this fork of the client which hasn't been updated in about 5 years and is based on v1.1.5.

@aaemnnosttv I wasn't able to find current documentation, so I dug into the client source code a bit to figure out how to do this. In short, the client's 'retry' config gets passed to Google_Http_REST, then on to Google_Task_Runner. I verified this was working by testing with passing an invalid value (-1) and got the expected error.

IB updated accordingly.

IB ✅

@felixarntz @aaemnnosttv - I suggest we delay implementation of this ticket by at least one release so we can collect API error data and observe how this change impacts those errors/rates.

Researching the PSI API I learned that it returns a 500 error for non-retryable user defined errors (such as a query for a url that doesn't go anywhere or respond). For this reason we should endeavor to not retry 500 errors for PSI request, or skip 500 errors entirely across requests if we need to. Potentially we can turn off retries in the client before making the PSI request?

@adamsilverstein do these errors include a specific reason or anything else we can identify it by rather than just a 500? It seems odd that it wouldn't be possible to return a 500 for a different reason.

possibly, we can check then exclude just for that reason if that works to override the 500 default in the error mapping.

@adamsilverstein after taking a quick look, it seems that we just need to identify the reason and set a custom $retry_map in the config (it doesn't seem to allow for extension) that has {reasonNotToRetry} => Google_Task_Runner::TASK_RETRY_NEVER.

That reason appears to apply to any status code though so that's something to consider.

Researching the PSI API I learned that it returns a 500 error for non-retryable user defined errors (such as a query for a url that doesn't go anywhere or respond). For this reason we should endeavor to not retry 500 errors for PSI request, or skip 500 errors entirely across requests if we need to.

@adamsilverstein – did we ever find out if there was a reason we could use to not retry for this case?

@aaemnnosttv - Looks like we can use reason "lighthouseError" - trying a bad domain I got:

{
  "error": {
    "code": 500,
    "message": "Lighthouse returned error: FAILED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Details: net::ERR_CONNECTION_FAILED)",
    "errors": [
      {
        "message": "Lighthouse returned error: FAILED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Details: net::ERR_CONNECTION_FAILED)",
        "domain": "lighthouse",
        "reason": "lighthouseError"
      }
    ]
  }
}

Hey @aaemnnosttv + @adamsilverstein, I've added a custom $retry_map and pass it to the Google_Client as a config that should now be passed to the REST Runner during the execute function in the library.

Updated here: https://github.com/google/site-kit-wp/pull/2442/files

@adamsilverstein is this one you could help QA? If not, could you please update the QAB? 🙂

Sure, I'm going to look into returning mocked responses so we can directly recreate/test the retry conditions.

I'm still digging into the best way to test this, it might be easiest to mock the responses manually for testing, I am also investigated modifying vendor/google/apiclient/src/Http/REST.php to force an error.

Since we are relying on the client for the correct behavior, perhaps checking that the Runner is passed the correct $retryMap data?

I spent some time debugging this to verify that the underlying PHP client was picking up the correct retry_map that we are trying to set. Unfortunately this is not working correctly as is.

The underlying client looks for a config setting named retry_map when setting up the REST request against the Google API:

https://github.com/googleapis/google-api-php-client/blob/81696e6206322e38c643cfcc96c4494ccfef8a32/src/Client.php#L893-L900

Whereas in Site Kit we are setting a config value named retryMap

https://github.com/google/site-kit-wp/blob/a00b131b703d894d8bdb90e2e2b272c8f97868fc/includes/Core/Authentication/Clients/OAuth_Client.php#L258

I tested changing this to retry_map and verified that after that change the Runner was correctly set with the retry map data:

https://github.com/googleapis/google-api-php-client/blob/fa3641c1a6e3e8832d70e70e93b42327c4efab78/src/Http/REST.php#L62-L64

In summary QA failed, the underlying client is NOT using the retry map settings, we need to change

$client->setConfig( 'retryMap', $this->retry_map );
to
$client->setConfig( 'retry_map', $this->retry_map );

Thanks @adamsilverstein ! @benbowler can you create a quick follow-up PR for the change described above?

@aaemnnosttv follow up PR created and linked.

@adamsilverstein ready for another pass 👍

@aaemnnosttv This worked well in my testing. I verified that:

  • The retry_map configuration is passed to the underlying PHP client, specifically in the Runner class.

I was able to reproduce an API error for lighthouse _by filtering googlesitekit_site_url to use a non existent domain_.

add_filter( 'googlesitekit_site_url', function() { return "not-a-validdomain.co"; } );

  • Noted that in this case, the reason was lighthouseError (500 code) and the retry was not attempted - as expected.
  • Verified that without the 'lighthouseError' => Runner::TASK_RETRY_NEVER line in our config, the Runner retries the API request (based on the 500 error code) three times before finally returning an exception.

✅ QA looks good.

ps. we had previously discussed rolling this change out to users because of the potential for unintended side effects. Should we consider putting the retry config behind a feature flag?

@adamsilverstein @aaemnnosttv It's a bit late now to make more changes here like move this behind a feature flag, and I'm also not sure that makes sense since the implementation for the feature is completed - I think we're okay launching this, as only wide usage will give us an idea about the impact on API requests anyway. Once this goes out on Monday, let's monitor whether it has any impact on API requests or quotas and react if needed.

Our customization of the retry conditions can be removed once this PR is merged: https://github.com/googleapis/google-api-php-client/pull/2010

Was this page helpful?
0 / 5 - 0 ratings