Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.
Cumulative cost quality
Let's call cumulative estimated query q cost as:
EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q).
Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q).
All of those are random variables with respect to q.
Observations:
f(EstimatedXXX(q)) = CostXXX(q) + E(q), where E(q) is the estimate error.f should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)E(q) should be proportional to the CostXXX(q) or EstimatedXXX(q). This is because we expect larger errors for bigger cost/estimatesLet's say we gather queries qi and their estimates EstimatedXXX(qi) and costs CostXXX(qi). Thus we have data points composed of pairs:
<x1=EstimatedXXX(q1), y1=CostXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN=CostXXX(qN)>
for i = 1...N.
Those points will look similar to:
Let's transform those points by dividing yN by xN. We now get:
<x1=EstimatedXXX(q1), y1'=CostXXX(q1)/EstimatedXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN'=CostXXX(qN)/Estimated(qN)>
Now those points will look like:
We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).
We can derive quantifiable model properties like:
Such metrics allow us to compare quality of cost estimates
Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.
Individual operator cost quality
Let's call operator o estimated cost as:
OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o).
Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q).
Let's call operator estimated input data size as:
EstimatedInputSize(o).
Let's call operator measured input data size as:
MeasuredInputSize(o).
All of those are random variables with respect to o.
Observations:
OperatorEstimatedXXX(o) and OperatorCostXXX(q). Because of that, let's normalize them by:NormalizedOperatorEstimatedXXX(o) = OperatorEstimatedXXX(o)/EstimatedInputSize(o)
NormalizedOperatorCostXXX(o) = OperatorCostXXX(o)/MeasuredInputSize(o)
Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:
OperatorEstimationErrorXXX(o) = NormalizedOperatorEstimatedXXX(o) - NormalizedOperatorCostXXX(o)
It will be similar to:
We can now:
OperatorEstimationErrorXXXOperatorEstimationErrorXXX, e.g: how much (per input byte) operator is under/overestimating cost.OperatorEstimationErrorXXX has normal distribution we can make more sophisticated conclusions (e.g: if we added Constant to OperatorEstimatedXXX than for 90% of queries we would overestimate memory usage by this much).Note that we should perform such analysis for each operator type as they have different profiles.
Stats quality
Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of FilterStatsCalculator and rules that perform predicate estimation.
Adjusting model
The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.
However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:
Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.
Plan Of Attack
The steps are as follows:
v1/query) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.Note that operators don't match 1-1 to plan nodes. For instance to obtain HashAggregationOperator stats we would need to get stats for join build side.
EventListener so that we could gather and store historical data points.CC: @findepi @rschlussel @mbasmanova @arhimondr @martint
FYI: @kbajda
In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others? Specifically, having an understanding the resource uses over the life of the query will be important for bin packing a cluster.
In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others
This is more about quality of the metrics and adjusting them against measured data.
For instance, peak memory should be derived by @findepi https://github.com/prestodb/presto/pull/11591, which accounts for operator timelines.
We could imagine more metrics that account for stages timelines, etc.
@sopel39 Karol, I like this proposal and have some follow-up questions.
CostCpu(q), CostMemory(q), CostNetwork(q)
Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network. Do you have any ideas about how could we measure/report these?
I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.
CC: @rschlussel @arhimondr
Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network.
PlanNodeCostEstimate contains both memory and networkCost.
I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.
Hopefully yes. There are more operators in final plan then just at reorder join phase (when CBO is used mostly). For instance partial aggregations stats are hard to estimate, see https://github.com/prestodb/presto/pull/11595.
However, even without known cost for the entire plan we should be still able to obtain cumulative cost of plan subtrees and individual cost of operators.
@sopel39
PlanNodeCostEstimate contains both memory and networkCost.
These are the estimates, but I'm looking for actual costs. Do we have these?
These are the estimates, but I'm looking for actual costs. Do we have these?
I think so. We have got measured exchanges input (for network) and operator peak memory (we can compute actual subplan peak memory from that).
and operator peak memory (we can compute actual subplan peak memory from that)
we cannot.
QueryStats contains observed peak memory for a query, so the query-level and operator-level comparison is covered.
@sopel39 @findepi Are we saying that memory cost equals peak memory? What about network cost. I don't believe we measure it today, do we?
@mbasmanova actual network data size can be obtained via: StageStats#getOutputPositions (PlanPrinter uses those metrics).
Are we saying that memory cost equals peak memory?
As @findepi mentioned QueryStats contains peek memory for a query and with https://github.com/prestodb/presto/pull/11591 we want to make memory cost estimate to be estimate of subplan peak memory that accounts for operator execution timelines.
@sopel39 Karol, thanks for explaining. One more question. You are saying that #11591 makes memory cost = peak memory. What is memory cost today?
What is memory cost today?
@mbasmanova Currently memory cost estimate is sum of peak memory estimates for all operators. That doesn't account for operator execution timelines. I think discussion under https://github.com/prestodb/presto/pull/11495 is a great summary of the issue.
This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.
Most helpful comment
@mbasmanova Currently memory cost estimate is sum of peak memory estimates for all operators. That doesn't account for operator execution timelines. I think discussion under https://github.com/prestodb/presto/pull/11495 is a great summary of the issue.