Presto: Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal

Created on 1 Oct 2018 · 14Comments · Source: prestodb/presto

Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.

Cumulative cost quality

Let's call cumulative estimated query q cost as:
EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q).
Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q).
All of those are random variables with respect to q.

Observations:

f(EstimatedXXX(q)) = CostXXX(q) + E(q), where E(q) is the estimate error.
We expect that f should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)
E(q) should be proportional to the CostXXX(q) or EstimatedXXX(q). This is because we expect larger errors for bigger cost/estimates

Let's say we gather queries qi and their estimates EstimatedXXX(qi) and costs CostXXX(qi). Thus we have data points composed of pairs:

<x1=EstimatedXXX(q1), y1=CostXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN=CostXXX(qN)>

for i = 1...N.

Those points will look similar to:
points1

Let's transform those points by dividing yN by xN. We now get:

<x1=EstimatedXXX(q1), y1'=CostXXX(q1)/EstimatedXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN'=CostXXX(qN)/Estimated(qN)>

Now those points will look like:
points2

We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).

We can derive quantifiable model properties like:

OLS error magnitude
regression statistical significance test (if errors have normal distribution)
confidence and prediction intervals (if errors have normal distribution)

Such metrics allow us to compare quality of cost estimates

Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.

Individual operator cost quality

Let's call operator o estimated cost as:
OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o).
Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q).
Let's call operator estimated input data size as:
EstimatedInputSize(o).
Let's call operator measured input data size as:
MeasuredInputSize(o).
All of those are random variables with respect to o.

Observations:

operator most likely will have input estimated data size mismatch actual input data size. This affects values of OperatorEstimatedXXX(o) and OperatorCostXXX(q). Because of that, let's normalize them by:

NormalizedOperatorEstimatedXXX(o) = OperatorEstimatedXXX(o)/EstimatedInputSize(o)
NormalizedOperatorCostXXX(o) = OperatorCostXXX(o)/MeasuredInputSize(o)

Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:

OperatorEstimationErrorXXX(o) = NormalizedOperatorEstimatedXXX(o) - NormalizedOperatorCostXXX(o)

It will be similar to:
Error

We can now:

derive variance of the OperatorEstimationErrorXXX
derive the mean of the OperatorEstimationErrorXXX, e.g: how much (per input byte) operator is under/overestimating cost.
if OperatorEstimationErrorXXX has normal distribution we can make more sophisticated conclusions (e.g: if we added Constant to OperatorEstimatedXXX than for 90% of queries we would overestimate memory usage by this much).

Note that we should perform such analysis for each operator type as they have different profiles.

Stats quality

Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of FilterStatsCalculator and rules that perform predicate estimation.

Adjusting model

The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.

However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:

introduce variables for multiplying estimated operator cost so that they match measured cost.
specifically choose operator memory cost factor to be large so that for majority of the queries we overestimate memory (to be on a safe side)
multiply cumulative plan costs so that we slightly overestimate. Such factor for cumulative cost should account for plan complexity as simpler plans should have more accurate costs

Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.

Plan Of Attack

The steps are as follows:

[ ] For each operator expose cumulative and individual stats/costs in query info (v1/query) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.

Note that operators don't match 1-1 to plan nodes. For instance to obtain HashAggregationOperator stats we would need to get stats for join build side.

[ ] Expose estimated cost/stats in EventListener so that we could gather and store historical data points.
[ ] Add variables to stats/cost models so that we can adjust model for particular hardware, data and workloads.

CC: @findepi @rschlussel @mbasmanova @arhimondr @martint

cbo stale

Source

sopel39

👍2

Most helpful comment

What is memory cost today?

@mbasmanova Currently memory cost estimate is sum of peak memory estimates for all operators. That doesn't account for operator execution timelines. I think discussion under https://github.com/prestodb/presto/pull/11495 is a great summary of the issue.

sopel39 on 2 Oct 2018

👍2

All 14 comments

FYI: @kbajda

sopel39 on 1 Oct 2018

In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others? Specifically, having an understanding the resource uses over the life of the query will be important for bin packing a cluster.

dain on 1 Oct 2018

In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others

This is more about quality of the metrics and adjusting them against measured data.
For instance, peak memory should be derived by @findepi https://github.com/prestodb/presto/pull/11591, which accounts for operator timelines.
We could imagine more metrics that account for stages timelines, etc.

sopel39 on 1 Oct 2018

@sopel39 Karol, I like this proposal and have some follow-up questions.

CostCpu(q), CostMemory(q), CostNetwork(q)

Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network. Do you have any ideas about how could we measure/report these?

I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.

CC: @rschlussel @arhimondr

mbasmanova on 2 Oct 2018

Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network.

PlanNodeCostEstimate contains both memory and networkCost.

sopel39 on 2 Oct 2018

I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.

Hopefully yes. There are more operators in final plan then just at reorder join phase (when CBO is used mostly). For instance partial aggregations stats are hard to estimate, see https://github.com/prestodb/presto/pull/11595.
However, even without known cost for the entire plan we should be still able to obtain cumulative cost of plan subtrees and individual cost of operators.

sopel39 on 2 Oct 2018

@sopel39

PlanNodeCostEstimate contains both memory and networkCost.

These are the estimates, but I'm looking for actual costs. Do we have these?

mbasmanova on 2 Oct 2018

These are the estimates, but I'm looking for actual costs. Do we have these?

I think so. We have got measured exchanges input (for network) and operator peak memory (we can compute actual subplan peak memory from that).

sopel39 on 2 Oct 2018

and operator peak memory (we can compute actual subplan peak memory from that)

we cannot.

QueryStats contains observed peak memory for a query, so the query-level and operator-level comparison is covered.

findepi on 2 Oct 2018

👍1

@sopel39 @findepi Are we saying that memory cost equals peak memory? What about network cost. I don't believe we measure it today, do we?

mbasmanova on 2 Oct 2018

@mbasmanova actual network data size can be obtained via: StageStats#getOutputPositions (PlanPrinter uses those metrics).

Are we saying that memory cost equals peak memory?

As @findepi mentioned QueryStats contains peek memory for a query and with https://github.com/prestodb/presto/pull/11591 we want to make memory cost estimate to be estimate of subplan peak memory that accounts for operator execution timelines.

sopel39 on 2 Oct 2018

@sopel39 Karol, thanks for explaining. One more question. You are saying that #11591 makes memory cost = peak memory. What is memory cost today?

mbasmanova on 2 Oct 2018

What is memory cost today?

sopel39 on 2 Oct 2018

👍2

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.