Airflow: GCSToBigQueryOperator - Not generating the Unique BQ Job Name

Created on 19 Oct 2020  路  13Comments  路  Source: apache/airflow

I was using GoogleCloudStorageToBigQueryOperator then I wanted to use GCSToBigQueryOperator. When I run parallel data export from GCS to BQ, (via a for loop Im generating dynamic task) It is generating the BQ Job name as test-composer:us-west2.airflow_1603109319 (I think its taking node name + current timestamp) as the job id for all the tasks.

Error

ERROR - 409 POST https://bigquery.googleapis.com/bigquery/v2/projects/centili-prod/jobs: Already Exists: Job test-composer:us-west2.airflow_1603109319
Traceback (most recent call last)

This is not allowing to import 2nd table, it has to wait for a min(retry in DAG) then its imported.

But the older one is giving proper Job ID like (Job_someUUID)

3 parallel table Import:

  • _GCSToBigQueryOperator_ - BQ Job ID for all the load jobs: Job test-composer:us-west2.airflow_1603109319
  • _GoogleCloudStorageToBigQueryOperator_ - BQ job id for table1(job_NYEBXXXXXvoflDiEj2j), table2(job_9xGl7WlVXXXXXWBriaqbhLQY), table3(job_aqmVLXXXXXL2YqVCGAqb_5EtW)
bug Google

Most helpful comment

This seems to be the method that incorrectly generates the job IDs but the format does not seem to match exactly what I get in the logs.
Meanwhile I will indeed switch to the older operator since you report it always generates job IDs correctly, thanks!

All 13 comments

I just started using this operator via the backports package a few days ago and I hit this at least one in ten times I invoke the operator, making it unusable without manual supervision. I do not use a dynamic dag but I do have a few GCSToBigQueryOperators.

[2020-10-22 08:40:58,732] {base_task_runner.py:113} INFO - Job 59262: Subtask TASKNAME [2020-10-22 08:40:58,730] {taskinstance.py:1135} ERROR - 409 POST https://bigquery.googleapis.com/bigquery/v2/projects/PROJECTNAME/jobs: Already Exists: Job PROJECTNAME:EU.airflow_1603356058@-@{"workflow": "DAGNAME", "task-id": "TASKNAME", "execution-date": "2020-10-22T08:29:42.258293+00:00"}

Yeah, maybe a bug, try the previous version(GoogleCloudStorageToBigQueryOperator). It works.

This seems to be the method that incorrectly generates the job IDs but the format does not seem to match exactly what I get in the logs.
Meanwhile I will indeed switch to the older operator since you report it always generates job IDs correctly, thanks!

Are you sure you're running the current backport versions? I was getting this issue in 2020.6.24, but it looks like it is solved in 2020.10.5. Unfortunately, 2020.10.5 isn't compatible with the new bigquery/pubsub libraries (2.0.0), so I can't test.

I'm using apache-airflow-backport-providers-google==2020.10.5 and seem to get similar error as well.

[2020-11-08 23:31:58,200] {taskinstance.py:1150} ERROR - 409 POST https://bigquery.googleapis.com/bigquery/v2/projects/banksalad/jobs: Already Exists: Job test:US.airflow_1604878317

Even the big query hook has the same issue, Im, not a coder :) so not able to find the exact cause.

When we use contrib.bigquery then no issue, this new providers.bigquery having this problem on all distributions (like the hook, bigquery operator, gcs to bq, empty table creator, Bq to bq and etc)

apparently they upgraded and fixed it at airflow backport 2020.10.29 version.

Awesome!!! So im closing this now.

https://github.com/apache/airflow/issues/11282

For Cloud Composer users like me, trying to use 2020.10.29 with the composer-1.12.5-airflow-1.10.10 image will error out with a very cryptic:

The command '/bin/sh -c bash installer.sh $COMPOSER_PYTHON_VERSION  fail' returned a non-zero code: 1
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: step exited with non-zero status: 1

In the cloud build step. The warning from earlier in the log may provide some hints as to why this happens:

+ python3 -m pipdeptree --warn fail
Warning!!! Possibly conflicting dependencies found:
* google-cloud-memcache==0.2.0
 - google-api-core [required: >=1.17.0,<2.0.0dev, installed: 1.16.0]

So for now at least Cloud Composer users are stuck with using either the old non-buggy Operator or waiting for Google to patch this.

Can you latest composer image composer-1.13.0-airflow-1.10.12 image and latest backport provider -apache-airflow-backport-providers-google==2020.10.29?

@muscovitebob 0 I believe indeed it is a dependency problem that is very likely to be addressed in the last version of composer image. From what we know the Composer team keeps the images updated with the releases of Apache Airlfow and the providers and the next image will even include the latest google providers baked, but you should try to install the latest provider there. Note that there is a new version of google backport provider as a release candidate (voting on it finishes on Thursday) so you might even try to install this version instead

https://pypi.org/project/apache-airflow-backport-providers-google/2020.11.13rc1/

It has even more fixes:

Commit | Committed | Subject
-- | -- | --
b2a28d159 | 2020-11-09 | Moves provider packages scripts to dev (#12082)
fcb6b00ef | 2020-11-08 | Add authentication to AWS with Google credentials (#12079)
2ef3b7ef8 | 2020-11-08 | Fix ERROR - Object of type 'bytes' is not JSON serializable when using store_to_xcom_key parameter (#12172)
0caec9fd3 | 2020-11-06 | Dataflow - add waiting for successful job cancel (#11501)
cf9437d79 | 2020-11-06 | Simplify string expressions (#12123)
91a64db50 | 2020-11-04 | Format all files (without excepions) by black (#12091)
fd3db778e | 2020-11-04 | Add server side cursor support for postgres to GCS operator (#11793)
f1f194026 | 2020-11-04 | Add DataflowStartSQLQuery operator (#8553)
41bf172c1 | 2020-11-04 | Simplify string expressions (#12093)
5f5244b74 | 2020-11-04 | Add template fields renderers to Biguery and Dataproc operators (#12067)
4e8f9cc8d | 2020-11-03 | Enable Black - Python Auto Formmatter (#9550)
8c42cf1b0 | 2020-11-03 | Use PyUpgrade to use Python 3.6 features (#11447)
45ae145c2 | 2020-11-03 | Log BigQuery job id in insert method of BigQueryHook (#12056)
e324b37a6 | 2020-11-03 | Add job name and progress logs to Cloud Storage Transfer Hook (#12014)
6071fdd58 | 2020-11-02 | Improve handling server errors in DataprocSubmitJobOperator (#11947)
2f703df12 | 2020-10-30 | Add SalesforceToGcsOperator (#10760)
e5713e00b | 2020-10-29 | Add drain option when canceling Dataflow pipelines (#11374)
37eaac3c5 | 2020-10-29 | The PRs which are not approved run subset of tests (#11828)
79cb77199 | 2020-10-28 | Fixing re pattern and changing to use a single character class. (#11857)
5a439e84e | 2020-10-26 | Prepare providers release 0.0.2a1 (#11855)
240c7d4d7 | 2020-10-26 | Google Memcached hooks - improve protobuf messages handling (#11743)
8afdb6ac6 | 2020-10-26 | Fix spellings (#11825)
872b1566a | 2020-10-25 | Generated backport providers readmes/setup for 2020.10.29 (#11826)
b680bbc0b | 2020-10-24 | Generated backport providers readmes/setup for 2020.10.29

Can you latest composer image composer-1.13.0-airflow-1.10.12 image and latest backport provider -apache-airflow-backport-providers-google==2020.10.29?

Upgrade with these went well, thanks much! Did not realise there was a new release.

Was this page helpful?
0 / 5 - 0 ratings