Apache Airflow version: 1.10.*
Backport packages version: 2020.10.5rc1
What happened:
Although good work has been done to improve the BigQueryInsertJobOperator by the assigning of a job_id based on a combination of dag_id, task_id, execution_date and an additional uniqueness_suffix, other operators that create BigQuery jobs - e.g. GCSToBigQueryOperator - do not take advantage of this but instead still create a job_id based on a timestamp.
https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/bigquery.py#L1475
If more than one task based on this operator are launched at the same time, this will cause an error due to a duplication in the job_id.
Already Exists: Job my-project:EU.airflow_1601894411
What you expected to happen:
Any task that starts a BigQuery should create a job_id that is unique.
Perhaps the method of job_id creation needs to move into the BigQuery hook, if possible?
https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/operators/bigquery.py#L2060
cc: @turbaszek - WDYT ?
And @nathadfield -> maybe you or someone from your team would like to fix it ? I am sure we can help with having out the details and review.
@potiuk We're more than happy to give it a try but we've just got to find the time.
I think that the simple fix is to use uuid or more detailed timestamp, @nathadfield WDYT?
Not sure how you want to categorise it. Yes, all the old operators use a random id.
This came about because I'm trying to use the new operators in existing production DAGs in preparation for Airflow v2 and this is an issue that is preventing me from doing so.
@turbaszek Extending the timestamp - perhaps to millisecond - would probably fix it but I feel it would be better overall if there was a consistent methodology for creating the job_id across all operators that create BigQuery jobs.
@turbaszek Extending the timestamp - perhaps to millisecond - would probably fix it but I feel it would be better overall if there was a consistent methodology for creating the job_id across all operators that create BigQuery jobs.
I agree but I have no capacity now and the backports are going to be released in next few days. @TobKed maybe you have some space?
I understand. Its not critical for us right now as we'll just continue to use the current operators. Perhaps this is something for the next round of backports then?
@nathadfield I can push a simple fix and then we can improve it... better done than perfect 馃槃
Totally agree!
@nathadfield #11287 I think this should do, WDYT?
Probably done in #11287
Most helpful comment
@nathadfield #11287 I think this should do, WDYT?