Dbt: Add data processed info into dbt run logs for all statement types

Created on 10 Jun 2020 · 16Comments · Source: fishtown-analytics/dbt

Describe the feature

When using BigQuery dbt run logs should show the data scan value for create table statement and not just for script.

Describe alternatives you've considered

No alternative

Additional context

This specific improvement is about BigQuery. However, this might apply for other database too I do not have experience to give specific examples.

Who will this benefit?

Everyone using dbt to get easier access to usage.

bigquery enhancement good first issue

Source

alepuccetti

Most helpful comment

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

Provide a config option that accepts a templated string for the model status
Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

drewbanin on 22 Jun 2020

👍3

All 16 comments

Very cool idea!

The code for printing out bytes billed is fairly straightforward:
https://github.com/fishtown-analytics/dbt/blob/c9b34681b756ccd4b5c4f0a3bd8f12f48f23081d/plugins/bigquery/dbt/adapters/bigquery/connections.py#L250-L252

I think it would just be a matter of reporting the total_bytes_processed for other query_job statement types.

@alepuccetti Is this something you'd be interested in working on? I think this could be a great opportunity for a first contribution :)

jtcohen6 on 10 Jun 2020

@jtcohen6 Yes, I would be interested. You can assigned to me if this is the procedure.

alepuccetti on 10 Jun 2020

It's all yours! Happy to help if you want more direction or get stuck

jtcohen6 on 10 Jun 2020

Hi Guys,

While the bytes processed is an interesting KPI for those bigquery customers that are on an on-demand pricing, the average slot usage of a query is interesting for those customers on a slot commitment.

It is very similar to calculate from the query job object you get when launching a query: you have to calculate the duration of the query in milliseconds using query_job.ended - query_job.started and then divide query_job.slot_millis by that value.

Could we add this info in the same PR, or do we create a new issue for this?

bodschut on 22 Jun 2020

@bodschut Thanks for the suggestion! We're packing a lot of info into limited horizontal space right now. Your point has crystallized for me that we should give the user more control.

Here's what makes sense to me: a set of additional configuration parameters in profiles.yml to control what dbt includes in its info output.

config:
  bytes_processed: true | false
  slot_duration: true | false

@drewbanin @alepuccetti What do you think here? I know this throws one more wrench into #2530, though I feel a lot better knowing that we'd be:

supporting both BQ pricing mechanisms
displaying a less overwhelming amount of information by default

jtcohen6 on 22 Jun 2020

@jtcohen6 Having the user configure it is a good idea. I second that. I am up for implementing the configuration parameters, might take a while, I am doing this in my spare time.

About the average slots usage. I had a discussion with google engineers on how to measure slots utilisation to estimate if it is worth it moving to flat-rate pricing and how to evaluate the performance implications. The answer was that there is no formula but only try your workload and look how it performs.
Because of this I am not not convinced about the metric as it is proposed if it actually serve a purpose of measuring the slots utilisation.
I can think of an example: if you have multiple queries in your model that can run in parallel is very difficult to understand what is happening because query running at the same time share the slots but the metrics as it is prosed will not account for this concurrency.

I am in favour having a slots metric but I think it should be discussed more of what it should be and how to interpret it in a separate issue. For instance, I think that even just slot_millis is a good metric.

alepuccetti on 22 Jun 2020

Hi @alepuccetti

You are right to say that the slot usage is not the only parameter to consider a move to a slot commitment. But, it's an interesting indicator when you are on a slot commitment and need to optimize your slot reservations for certain projects. At that point, knowing which queries consume how many slots on average can be interesting to optimize or to decide on changing a certain slot reservation or even upgrading your commitment with another package of 500 slots.

You could indeed include the slot millis like that, without calculating the query duration, but the average slot usage as calculated above is the best indicator for the 'slot load' of the query. I don't know if the model duration that dbt logs by default is the actual query duration or whether it includes the setup time. If it is the actual query duration, you could indeed calculate the average slots used yourself.

bodschut on 22 Jun 2020

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

Provide a config option that accepts a templated string for the model status
Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

drewbanin on 22 Jun 2020

👍3

@drewbanin This is a great idea. I am not sure I will be able to implement something like that with my limited knowledge of dbt internals.

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

Provide a config option that accepts a templated string for the model status

Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

@drewbanin This is a great idea.

alepuccetti on 22 Jun 2020

@bodschut I like the idea of having a slot usage metrics. I just think it needs more discussion. Because it is not straightforward like number of rows or byte processed.

alepuccetti on 22 Jun 2020

Agreed, let's move it to another discussion. But with the suggestion of @drewbanin it could all be included: if you have a job object in your variable scope, you can output the slot usage using macro's for example.

bodschut on 22 Jun 2020

I opened a new issue, let's keep the discussion going over there: https://github.com/fishtown-analytics/dbt/issues/2580

This issue will (rightly) be closed when we merge #2530.

jtcohen6 on 22 Jun 2020

Hi there,
Thanks for adding this.
I was wondering whether total_bytes_billed can be added as well?
I'm assuming that the data comes from _INFORMATION_SCHEMA.JOBS_BY_PROJECT_ and total_bytes_billed is now available too.

Thanks again,
Meurant

meurant-naude on 11 Sep 2020

@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there.

bodschut on 11 Sep 2020

@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there.

Agree, I will answer there.

alepuccetti on 11 Sep 2020

Thanks all. Much appreciated

meurant-naude on 13 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

minimize downtime for incremental models with full-refresh

drewbanin · 4Comments

Relation doesn't exist for dbt test after `dbt run -m state:modified` if the test was modified but the model wasn't

joellabes · 3Comments

support persist_docs on all plugins

drewbanin · 3Comments

KeyError while using deferred flag and model filters

nave91 · 3Comments

ref'ing an ephemeral model inside of a statement does not work

drewbanin · 3Comments