Dbt: Allowing for steps to retry

Created on 24 Jul 2019 · 4Comments · Source: fishtown-analytics/dbt

Please make sure to fill out either the issue template or the feature template and delete the other one!

Issue

BigQuery can hit some transient issues, its important to be able to configure a certain number of retries for a step in a DAG.

Issue description

When running a production run, we hit a transient bigquery error which failed that particular bigQuery query and all subsequent ones that depended on its output.

Results

What I expected was that dbt would have some configuration that would allow us to set a number of allowable retries.

System information

This was a run on dbtcloud.

Steps to reproduce

transient error on bigquery's side hard to reproduce.

Feature

Feature description

allow for configuring retry logic at an individual step or the whole DAG

Who will this benefit?

Anyone that rerlies on dbt_ for production and can't have transient errors killing the whole DAG.

bigquery enhancement

Source

zahink

👍3

Most helpful comment

We also ran into similar challenges in our BigQuery dbt run.

For example, in production, we have situations that backfilling historical data and scheduled incremental runs are happening at the same time, and sometimes update the same table.

We then got errors like this, which could be mitigated with some re-try

  domain: "cloud.helix.ErrorDomain" code: "QUERY_ERROR" argument: "Could not serialize access 
to table projectA:dataset1.table_A1 due to concurrent update" debug_info: "
[CONCURRENT_UPDATE] Table modified by concurrent UPDATE/DELETE/MERGE DML or truncation 
at 1578010970837. Storage set job_uuid: 3ca02cc1-8d32-4c3c-afdc-76f429c1add1_00008, 
instance_id: InsertedData, Reason: code=CONCURRENT_UPDATE message=Could not serialize 
access to table projectA:dataset1.table_A1 due to concurrent update debug=Table modified by 
concurrent UPDATE/DELETE/MERGE DML or truncation at 1578010970837. Storage set job_uuid: 
3ca02cc1-8d32-4c3c-afdc-76f429c1add1_00008, instance_id: InsertedData

hui-zheng on 3 Jan 2020

👍2

All 4 comments

We have similar issues with redshift (especially redshift spectrum) where retries would be very beneficial

advincze on 9 Oct 2019

👍1

seems #1579 is also related

advincze on 9 Oct 2019

👍1

We also ran into similar challenges in our BigQuery dbt run.

For example, in production, we have situations that backfilling historical data and scheduled incremental runs are happening at the same time, and sometimes update the same table.

We then got errors like this, which could be mitigated with some re-try

  domain: "cloud.helix.ErrorDomain" code: "QUERY_ERROR" argument: "Could not serialize access 
to table projectA:dataset1.table_A1 due to concurrent update" debug_info: "
[CONCURRENT_UPDATE] Table modified by concurrent UPDATE/DELETE/MERGE DML or truncation 
at 1578010970837. Storage set job_uuid: 3ca02cc1-8d32-4c3c-afdc-76f429c1add1_00008, 
instance_id: InsertedData, Reason: code=CONCURRENT_UPDATE message=Could not serialize 
access to table projectA:dataset1.table_A1 due to concurrent update debug=Table modified by 
concurrent UPDATE/DELETE/MERGE DML or truncation at 1578010970837. Storage set job_uuid: 
3ca02cc1-8d32-4c3c-afdc-76f429c1add1_00008, instance_id: InsertedData

hui-zheng on 3 Jan 2020

👍2

FYI #1963 adds retries to BigQuery when queries fail with a 500 status code (internal server error).

I'm going to close out this issue, as BigQuery is really the only place where 1) we see transient errors like this and 2) we receive a status code indicating that retrying can solve the problem. Happy to re-open if anyone has any further thoughts on this topic.