When running DBT source snapshot-freshness checks and DBT tests, I see in the Snowflake query logs that all these SELECT queries are wrapped in database transaction logic:
All these steps consume unnecessary CPU cycles. As they '_only_' are select statement, wrapping them in transaction logic is unnecessary as far as I am concerned. As we plan to run a lot of source checks and tests, all these CPU cycles _will add up eventually_, costing us money.
Can be reproduced by running DBT source snapshot-freshness or running DBT test and then look at the snowflake query log
Which database are you using dbt with?
The output of dbt --version:
Running DBT version 17.0 on the cloud IDE
I knew this issue was coming someday :) Thanks for being the one to open it @bashyroger.
dbt knows how to operate on databases that do have transactions (Redshift, Postgres), and databases that don't (BigQuery, Spark). Much trickier are databases that _sometimes_ have transactions (Presto, depending on the connector) or where we only _sometimes_ need to use them, e.g. Snowflake.
So, Snowflake supports a lot of atomic operations (create or replace table, merge), but (depending on one's autocommit parameter) it also requires commit for DML statements to take effect. As you rightly note, the delete+insert incremental strategy on Snowflake needs to happen inside a transaction, lest we succeed in deleting records but fail upon inserting their replacements.
Here's an approach that would only require a handful code changes:
execute sans add_begin_query/ add_commit_query, make clear_transaction a noop.begin and commit queries.This would be a breaking change for anyone who currently uses statement blocks with auto_begin=True to run multiple DDL statements within a single transaction. It's not a _huge_ breaking change鈥攖hey'd just need to add begin; and commit;, before and after鈥攂ut worth calling out that this may affect many custom SQL headers, hooks, and operations.
Before we change this code, I think it'd be worth estimating the actual cost associated with all the begin, commit, and rollback queries explicitly executed by dbt over the course of a standard run. Here's a _really_ unscientific measurement using Snowflake's query history for our internal analytics account. (Unfortunately, it seems to be limited to returning up to 100 queries at a time, so this sample is limited to exactly 15 seconds of me running dbt test with 8 threads.)
select
min(end_time) as first_query,
max(end_time) as last_query,
sum(case when query_text in ('COMMIT', 'BEGIN', 'ROLLBACK') then total_elapsed_time else 0 end) / 1000 as excess_txn_time_s,
sum(total_elapsed_time) / 1000 as total_query_time_s,
excess_txn_time_s / total_query_time_s as pct_excess
from table(information_schema.query_history_by_user(
user_name => 'JERCO',
end_time_range_start => '2020-09-11 09:31:44.000'::timestampltz,
end_time_range_end => '2020-09-11 09:32:08.000'::timestampltz
))
FIRST_QUERY | LAST_QUERY | EXCESS_TXN_TIME_S | TOTAL_QUERY_TIME_S | PCT_EXCESS
-- | -- | -- | -- | --
2020-09-11 12:31:53 | 2020-09-11 12:32:07 | 12.45 | 55.05 | 0.23
I managed to run 55 compute-seconds of tests in 15 perceived seconds. Neat! It also looks like ~20% of my overall query time was BEGIN, COMMIT, or ROLLBACK. So over the long run鈥攁nd please take this with a _major_ grain of salt鈥攖here's some opportunity for marginal savings.
We have some performance work slated for the end of the year. While that's going to focus more on speeding up things like project parsing and startup compilation time, I'll keep this in mind as well.
Hi @jtcohen6 , thanks for this extensive investigation!
As you found out by looking at the query history, the time spend on non-required work is quite significant with a potential to reduce compute time with 23%.
Depending on how one configures snowflake, schedules /runs jobs, and the real work VS overhead ratio the savings could be more then marginal.
I just ran a slightly adapted version of your query on our job that checks the freshness of 1015 sources of which (currently) 525 are empty _(making the ratio a percentage and changing the row limit to the max of 10000)_
select
min(end_time) as first_query,
max(end_time) as last_query,
sum(case when query_text in ('COMMIT', 'BEGIN', 'ROLLBACK') then total_elapsed_time else 0 end) / 1000 as excess_txn_time_s,
sum(total_elapsed_time) / 1000 as total_query_time_s,
excess_txn_time_s / total_query_time_s as pct_excess
from table(information_schema.query_history_by_user(
user_name => 'DWH_PROD_TL_RUNNER',
end_time_range_start => '2020-09-16 13:07:19.000'::timestampltz,
RESULT_LIMIT => 10000
))
Total is 2800 rows of raw data,
Results are as follows:
So, quite significant for this workload...