Airflow: Transfer table/data from mssql to mssql

Created on 17 Sep 2020  路  9Comments  路  Source: apache/airflow

Description

Sync table from business database to datawarehouse database incrementally (both mssql).

Use case / motivation

I want to keep two database tables sync. One is the business database. Another one is the datawarehouse database. (They are installed separaterly, i.e., in different machine).

It seems GenericTransfer it desinged to do this kind of task. But I see there is only insert in DbApiHook (which OdbcHook and MsSqlHook inherit from).

I guess something like https://github.com/apache/airflow/pull/8625 is needed for mssql.

providers good first issue feature

Most helpful comment

@turbaszek I would like to work on this.

All 9 comments

Thanks for opening your first issue here! Be sure to follow the issue template!

@turbaszek I would like to work on this.

@shizidushu I have been trying to create a proper sollution for this issue with my newborn :D (thats why it took so long).
Solution: The equvialent of "REPLACE INTO" or "INSERT ... ON CONFLICT DO UPDATE" in Sql Server is built-in MERGE operation.
I can provide that by adding _generate_insert_sql method into MsSqlHook. After that it will be available for upserting ("small") data.

But in the description it is said that, this will be used for incremental load to datawarehouse. According to my experience, using MERGE directly on datawarehouse table may produce unwanted results (in terms of concurrency and performance). I have just written a blog post about this topic.
https://akocukcu.github.io/incremental-etl-pipeline-airflow.html
In a nutshell, we should know that it has specific drawbacks in wide usage.

My suggestion for incremental etl is to use GenericTransfer combined with Staging table and additional MsSqlHooks for executing procedures.

At last, I can add upsert functionality by using Merge.

@akocukcu Thank you for your post. I didn't realize there are problems with MERGE operation before.

In our case, we have a data warehouse using Sql Server, there are many merge in the current code base (low performance). I was considering refactor the etl with airflow in september. Now our team decide to switch to airflow/hive/spark from scratch. And now I'm struggling with how to do incremental etl with hive (without merge, insert overwrite partition approach).

@shizidushu for your case again I kindly suggest using "Control Table" pattern with a Staging table.
You can insert data directly (maybe bulk load) to empty Staging table.
After that since staging table and datawarehouse table are in the same server, upsert operation will complete at lightning speed.

BTW, I think in the same sense there are preoperator parameters inside a lot of transfer operators.
For example if you look at the definition of mysql_preoperator parameter of PrestoToMySqlOperator, it says:

"... typically use to truncate of delete in place of the data coming in ...".

So basically, first you can truncate the staging table, second insert the (only new or updated) oltp data to staging table, finally you can upsert staging data to datawarehouse table.

Just like in the blog post, you can test that approach in your environment and data.

Independent from your case I can add upsert functionality by adding MERGE to MsSqlHook. But I have concerns, if folks use it for upserting large data.
Maybe I can add a comment just like they did in GenericTransfer:

"... This is meant to be used on small-ish datasets ..."

@shizidushu Also since you are planning to use Hive, HiveToMySqlOperator is a very elegant example (Which can be adapted to use with MsSql).

@akocukcu Thanks for you suggestion.

@shizidushu you are welcome :)
So do you think we should close this issue or add upsert functionality to MsSqlHook.

@akocukcu It's time to close it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ryanahamilton picture ryanahamilton  路  3Comments

grbinho picture grbinho  路  3Comments

JavierLopezT picture JavierLopezT  路  4Comments

ephraimbuddy picture ephraimbuddy  路  3Comments

mik-laj picture mik-laj  路  3Comments