Please make sure to fill out either the issue template or the feature template and delete the other one!
BigQuery can hit some transient issues, its important to be able to configure a certain number of retries for a step in a DAG.
When running a production run, we hit a transient bigquery error which failed that particular bigQuery query and all subsequent ones that depended on its output.
What I expected was that dbt would have some configuration that would allow us to set a number of allowable retries.
This was a run on dbtcloud.
transient error on bigquery's side hard to reproduce.
allow for configuring retry logic at an individual step or the whole DAG
Anyone that rerlies on dbt_ for production and can't have transient errors killing the whole DAG.
We have similar issues with redshift (especially redshift spectrum) where retries would be very beneficial
seems #1579 is also related
We also ran into similar challenges in our BigQuery dbt run.
For example, in production, we have situations that backfilling historical data and scheduled incremental runs are happening at the same time, and sometimes update the same table.
We then got errors like this, which could be mitigated with some re-try
domain: "cloud.helix.ErrorDomain" code: "QUERY_ERROR" argument: "Could not serialize access
to table projectA:dataset1.table_A1 due to concurrent update" debug_info: "
[CONCURRENT_UPDATE] Table modified by concurrent UPDATE/DELETE/MERGE DML or truncation
at 1578010970837. Storage set job_uuid: 3ca02cc1-8d32-4c3c-afdc-76f429c1add1_00008,
instance_id: InsertedData, Reason: code=CONCURRENT_UPDATE message=Could not serialize
access to table projectA:dataset1.table_A1 due to concurrent update debug=Table modified by
concurrent UPDATE/DELETE/MERGE DML or truncation at 1578010970837. Storage set job_uuid:
3ca02cc1-8d32-4c3c-afdc-76f429c1add1_00008, instance_id: InsertedData
FYI #1963 adds retries to BigQuery when queries fail with a 500 status code (internal server error).
I'm going to close out this issue, as BigQuery is really the only place where 1) we see transient errors like this and 2) we receive a status code indicating that retrying can solve the problem. Happy to re-open if anyone has any further thoughts on this topic.
Most helpful comment
We also ran into similar challenges in our BigQuery dbt run.
For example, in production, we have situations that backfilling historical data and scheduled incremental runs are happening at the same time, and sometimes update the same table.
We then got errors like this, which could be mitigated with some re-try