Dbt: SQL select statements for source freshness and tests are using unnecessary DB transaction logic

Created on 10 Sep 2020 · 2Comments · Source: fishtown-analytics/dbt

Describe the bug

When running DBT source snapshot-freshness checks and DBT tests, I see in the Snowflake query logs that all these SELECT queries are wrapped in database transaction logic:

For source freshness, I see a BEGIN and COMMIT per SQL statement
For DBT tests I see a BEGIN and ROLLBACK

All these steps consume unnecessary CPU cycles. As they '_only_' are select statement, wrapping them in transaction logic is unnecessary as far as I am concerned. As we plan to run a lot of source checks and tests, all these CPU cycles _will add up eventually_, costing us money.

Steps To Reproduce

Can be reproduced by running DBT source snapshot-freshness or running DBT test and then look at the snowflake query log

Expected behavior

No explicit database transactional logic zommands on SELECT only queries, specifically source freshness and tests
IMO transactional logic is only needed for a sequence of INSERT / UPDATE statements that make a transaction together. An example being a DELETE plus INSERT operation

System information

Which database are you using dbt with?

[ ] postgres
[ ] redshift
[ ] bigquery
[X] snowflake
[ ] other (specify: ____________)

The output of dbt --version:
Running DBT version 17.0 on the cloud IDE

bug performance snowflake

Source

bashyroger

All 2 comments

I knew this issue was coming someday :) Thanks for being the one to open it @bashyroger.

dbt knows how to operate on databases that do have transactions (Redshift, Postgres), and databases that don't (BigQuery, Spark). Much trickier are databases that _sometimes_ have transactions (Presto, depending on the connector) or where we only _sometimes_ need to use them, e.g. Snowflake.

So, Snowflake supports a lot of atomic operations (create or replace table, merge), but (depending on one's autocommit parameter) it also requires commit for DML statements to take effect. As you rightly note, the delete+insert incremental strategy on Snowflake needs to happen inside a transaction, lest we succeed in deleting records but fail upon inserting their replacements.

Here's an approach that would only require a handful code changes:

At the connection level, make Snowflake more like BigQuery, a database that doesn't know anything about transactions: reimplement execute sans add_begin_query/ add_commit_query, make clear_transaction a noop.
In the specific places where we need to run multiple DDL statements, in sequence, within a single transaction—within dbt, I think that may only be the incremental materialization—we explicitly run begin and commit queries.

This would be a breaking change for anyone who currently uses statement blocks with auto_begin=True to run multiple DDL statements within a single transaction. It's not a _huge_ breaking change—they'd just need to add begin; and commit;, before and after—but worth calling out that this may affect many custom SQL headers, hooks, and operations.

Before we change this code, I think it'd be worth estimating the actual cost associated with all the begin, commit, and rollback queries explicitly executed by dbt over the course of a standard run. Here's a _really_ unscientific measurement using Snowflake's query history for our internal analytics account. (Unfortunately, it seems to be limited to returning up to 100 queries at a time, so this sample is limited to exactly 15 seconds of me running dbt test with 8 threads.)

select

    min(end_time) as first_query,
    max(end_time) as last_query,
    sum(case when query_text in ('COMMIT', 'BEGIN', 'ROLLBACK') then total_elapsed_time else 0 end) / 1000 as excess_txn_time_s,
    sum(total_elapsed_time) / 1000 as total_query_time_s,
    excess_txn_time_s / total_query_time_s as pct_excess

from table(information_schema.query_history_by_user(
   user_name => 'JERCO',
   end_time_range_start => '2020-09-11 09:31:44.000'::timestampltz,
   end_time_range_end => '2020-09-11 09:32:08.000'::timestampltz
))

FIRST_QUERY | LAST_QUERY | EXCESS_TXN_TIME_S | TOTAL_QUERY_TIME_S | PCT_EXCESS
-- | -- | -- | -- | --
2020-09-11 12:31:53 | 2020-09-11 12:32:07 | 12.45 | 55.05 | 0.23

I managed to run 55 compute-seconds of tests in 15 perceived seconds. Neat! It also looks like ~20% of my overall query time was BEGIN, COMMIT, or ROLLBACK. So over the long run—and please take this with a _major_ grain of salt—there's some opportunity for marginal savings.

We have some performance work slated for the end of the year. While that's going to focus more on speeding up things like project parsing and startup compilation time, I'll keep this in mind as well.

jtcohen6 on 11 Sep 2020

👍1

Hi @jtcohen6 , thanks for this extensive investigation!
As you found out by looking at the query history, the time spend on non-required work is quite significant with a potential to reduce compute time with 23%.
Depending on how one configures snowflake, schedules /runs jobs, and the real work VS overhead ratio the savings could be more then marginal.

I just ran a slightly adapted version of your query on our job that checks the freshness of 1015 sources of which (currently) 525 are empty _(making the ratio a percentage and changing the row limit to the max of 10000)_

select
    min(end_time) as first_query,
    max(end_time) as last_query,
    sum(case when query_text in ('COMMIT', 'BEGIN', 'ROLLBACK') then total_elapsed_time else 0 end) / 1000 as excess_txn_time_s,
    sum(total_elapsed_time) / 1000 as total_query_time_s,
    excess_txn_time_s / total_query_time_s as pct_excess
from table(information_schema.query_history_by_user(
   user_name => 'DWH_PROD_TL_RUNNER',
   end_time_range_start => '2020-09-16 13:07:19.000'::timestampltz,
   RESULT_LIMIT => 10000
))

Total is 2800 rows of raw data,

Results are as follows:

EXCESS_TXN_TIME_S: 324
TOTAL_QUERY_TIME_S: 712
PCT_EXCESS: 46

So, quite significant for this workload...

bashyroger on 16 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ref'ing an ephemeral model inside of a statement does not work

drewbanin · 3Comments

log-path (and other output paths) cannot be set to null, because of aggressive defaults

kconvey · 3Comments

Support a flag to override the target and logs directories on the CLI

drewbanin · 3Comments

Add COPY GRANTS to #Snowflake table materialisation

bashyroger · 3Comments

Add `partitions` to bq AdapterSpecificConfigs

jtcohen6 · 3Comments