Google-cloud-python: BigQuery: DEFAULT_RETRY doesn't catch rate 'rateLimitExceeded'.

Created on 19 Jul 2019 · 17Comments · Source: googleapis/google-cloud-python

OS: MacOS 10.14.3
Python: 3.7.3
google-cloud-bigquery: 1.15.0

I'm trying to reproduce the condition where BigQuery is executing 100 concurrent queries and refuses to execute other jobs by raising Forbidden: 403 Exceeded rate limits: too many concurrent queries for this project_and_region.

Let's say that my BigQuery instance is already executing 100 queries and I request a new job execution using client.query(). It seems to me that the DEFAULT_RETRY has been designed to retry the operation if the rateLimitExceeded error is received, but it doesn't seem to work. What happens is that client.query() returns a job with the error already set and job.result() finally throws the error.

What I'd like to happen is that client.query() understands that there are concurrent queries and tries to execute later the query, according to the retry object.

question bigquery backend

Source

ralbertazzi

👍5

Most helpful comment

I see the problem now: the retry logic gets triggered only if an exception gets raised; this exception might get raised in the api_request method defined in the core library (here).

Unfortunately, even when the rate limit is exceeded, the api response returns status_code = 200 !! Therefore no exception is raised and the retry logic is not triggered.

An API that returns status_code = 200 and a 403 error as a string in the json response seems like a bad designed API to me... do you have any control over BigQuery APIs?

ralbertazzi on 23 Jul 2019

👍2

All 17 comments

@ralbertazzi BigQuery's retry mechanism does check for rateLimitExceeded as a retryable reason:

https://github.com/googleapis/google-cloud-python/blob/56accfae54b2accdc5ccbe1d8ac0242210b3e811/bigquery/google/cloud/bigquery/retry.py#L19-L21

The retry is not infinite: it has a default deadline of 120 seconds. Maybe your code is hitting that limit?

Can you please provide a reproducible code example? At a mimimum, can you at least provide the full traceback?

tseaver on 19 Jul 2019

@tseaver The code is reproducible only if it points at a GCP project where BigQuery is already executing 100 concurrent queries.

from google.cloud.bigquery import Client

client = Client(project="my-gcp-project")

# since the default is retry = DEFAULT_RETRY, it should wait for BigQuery to execute other queries
# at least for 120 seconds; instead, it returns after 1 second and the job object contains the error.
job = client.query("SELECT 1")
assert len(job.errors) == 1
assert job.errors[0]["reason"] == "rateLimitExceeded"
job.result() # this throws error immediately, without calling any API

# Same thing happens if I don't use any retry logic!
job = client.query("SELECT 1", retry=None)
assert len(job.errors) == 1
assert job.errors[0]["reason"] == "rateLimitExceeded"
job.result() # this throws error immediately, without calling any API

Traceback is the same for both cases:

  File "/Users/ra/projects/big-query-data-retrievers-lib-python/tests/pico/utils.py", line 119, in <module>
    job.result()
  File "/Users/ra/.local/share/virtualenvs/big-query-data-retrievers-lib-python-u6kdBxdG/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 2863, in result
    super(QueryJob, self).result(timeout=timeout)
  File "/Users/ra/.local/share/virtualenvs/big-query-data-retrievers-lib-python-u6kdBxdG/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 732, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "/Users/ra/.local/share/virtualenvs/big-query-data-retrievers-lib-python-u6kdBxdG/lib/python3.7/site-packages/google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too many concurrent queries for this project_and_region. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

ralbertazzi on 20 Jul 2019

There is a retry-with-backoff occurring in your first case: it just isn't retried long enough. Can you try your example with a bigger

from google.cloud.biquery import DEFAULT_RETRY

# Arbitrary deadline:  season to taste
query_deadline = 5 * 60  # seconds
query_retry = DEFAULT_RETRY.with_deadline(query_deadline)
job = client.query("SELECT 1", retry=query_retry)

tseaver on 22 Jul 2019

Just to be sure I tried with 1 hour deadline:

my_retry = DEFAULT_RETRY.with_deadline(3600)
job = client.query("SELECT 1", retry=my_retry)
assert len(job.errors) == 1
assert job.errors[0]["reason"] == "rateLimitExceeded"
job.result()

As I already said in the previous post, the client.query() call returns in 1-2 seconds (__it doesn't wait for 1 hour__) and the returned job contains the error. Calling job.result() finally raises the Forbidden error.

I tried also putting a breakpoint in the _should_retry func defined here and it never gets triggered. So I think that the problem is not a wrong retry logic but the retrial logic not getting called.

ralbertazzi on 23 Jul 2019

Just for help debugging, the api_response returned here is a dict with a status field that I copied here.

# 'status' field of api_response
{
    'errorResult': {
        'reason': 'rateLimitExceeded',
        'location': 'dremel_connection',
        'message': 'Exceeded rate limits: too many concurrent queries for this project_and_region. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors'
    },
    'errors': [
        {
            'reason': 'rateLimitExceeded',
            'location': 'dremel_connection',
            'message': 'Exceeded rate limits: too many concurrent queries for this project_and_region. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors'
         }
    ],
    'state': 'DONE'
}

ralbertazzi on 23 Jul 2019

I see the problem now: the retry logic gets triggered only if an exception gets raised; this exception might get raised in the api_request method defined in the core library (here).

Unfortunately, even when the rate limit is exceeded, the api response returns status_code = 200 !! Therefore no exception is raised and the retry logic is not triggered.

An API that returns status_code = 200 and a 403 error as a string in the json response seems like a bad designed API to me... do you have any control over BigQuery APIs?

ralbertazzi on 23 Jul 2019

👍2

An API that returns status_code = 200 and a 403 error as a string in the json response seems like a bad designed API to me... do you have any control over BigQuery APIs?

Nope, but we can bug the people who do. Summoning @tswast to do the bugging.

tseaver on 23 Jul 2019

@ralbertazzi And thanks for your persistence in chasing that down!

tseaver on 23 Jul 2019

No problem, I hope this gets fixed since I wouldn't like to write some custom retry logic on top of the library :(

Also, please check if the other errors ("backendError", "internalError", "badGateway") are actually captured or not.

ralbertazzi on 24 Jul 2019

This is a tough one. There are actually several API calls that happen when a call to query() is made. First is to insert the job, second is to get query results, then to get the job resource. If the job has failed, it's actually the call to get query results that fails.

I'm guessing that's what's going on here. The job insert succeeds, but then the job status fails due to rate limiting.

@shollyman Do you have any thoughts? Should we be retrying the whole query job in certain job failures?

tswast on 27 Jul 2019

@tswast actually when you call query() there is only one API call, which is performed in the _begin method. And that single API is the one the returns the error (as a string).

Here I dump all the parameters that are used in order to make the request:

url = 'https://www.googleapis.com/bigquery/v2/projects/pico-testing/jobs'
headers = {
    'Accept-Encoding': 'gzip',
    'Content-Type': 'application/json',
    'X-Goog-API-Client': 'gl-python/3.7.2 gax/1.14.0 gapic/1.17.0 gccl/1.17.0',
    'User-Agent': 'gl-python/3.7.2 gax/1.14.0 gapic/1.17.0 gccl/1.17.0'
}
method= 'POST'
data = '{"jobReference": {"jobId": "4ee391f6-0274-476a-addb-3470e645a681", "projectId": "my-project"}, "configuration": {"query": {"useLegacySql": false, "query": "SELECT 1"}}}'

The thing is that I shouldn't even be allowed to call query_job.result() if client.query() returned an errored QueryJob.

ralbertazzi on 28 Jul 2019

@ralbertazzi If the call to the jobs.insert API (inside _AsyncJob.begin) returns a 200 response, then the payload will be used to set the properties of the job. Can you verify that job._properties actually contains an error response after the _begin call, rather than a jobs resource?

tseaver on 29 Jul 2019

@tseaver Yes, I can confirm that. The query_job._properties.status is the dict that I reported here

ralbertazzi on 30 Jul 2019

Btw, if you need to clog your BQ instance and do that for free, you can spawn 100 jobs that ask for this query :)

CREATE TEMP FUNCTION is_prime_slow(x INT64)
RETURNS BOOL
LANGUAGE js AS '''
  for(var i = 3; i < x; i++)
    if(x % i === 0) return false;
  return true;
''';

SELECT num
FROM UNNEST(GENERATE_ARRAY(1000000, 2000000)) AS num
WHERE is_prime_slow(NUM)

ralbertazzi on 30 Jul 2019

@ralbertazzi Thanks for clarifying: I had missed that the response was actually a jobs resource.

@tswast Can you confirm whether it is expected behaviour that jobs.insert would return a 200 response containing resource with that status, rather than returning the 403 response?

tseaver on 30 Jul 2019

@ralbertazzi Thank you for the code to reproduce.

Unfortunately, this is expected behavior. The rateLimitExceeded error you are encountering is due to a limit on the number of concurrent INTERACTIVE priority queries. In the backend, the job is successfully created, but then fails. This means we can't retry the job in the same way we retry HTTP error codes.

To work around this restriction, it's recommended that you run your queries with BATCH priority, instead.

jobs = []
for i in range(200):
    job_config = bigquery.QueryJobConfig(priority="BATCH")
    job = client.query("""
    CREATE TEMP FUNCTION is_prime_slow(x INT64)
    RETURNS BOOL
    LANGUAGE js AS '''
      for(var i = 3; i < x; i++)
        if(x % i === 0) return false;
      return true;
    ''';

    SELECT num
    FROM UNNEST(GENERATE_ARRAY(1000000, 2000000)) AS num
    WHERE is_prime_slow(NUM)
    """, job_config=job_config)
    jobs.append(job)

error_jobs = [job for job in jobs if job.error_result]
print("Tried {} jobs, got {} errors.".format(len(jobs), len(error_jobs)))

tswast on 30 Jul 2019

BATCH would be good if the guaranteed execution window was way less than 24hr - as it is unfortunately now. Not an option if you have a stricter SLA with your own customers e.g. pipeline in 1hr.

@ralbertazzi if job execution - not insertion - fails, implement your own backoff retries by submitting a new job - see https://github.com/googleapis/google-cloud-python/issues/5555

Note the library has changed since I posted it, so replace with
job_ref = bq.get_job(job_id).to_api_repr()

_Btw, serves as reminder that that issue is still open..._