Tidb: Enhance the method `Selectivity`

Created on 6 Sep 2018 · 9Comments · Source: pingcap/tidb

It only return a float number tell you the selectivity of the given filters.
The NDV(number of distinct value) of one column cannot be maintained easily if we don't know this column's distribution after the given filters though we have histogram information.
Return these information from Selectivity seems to be a good choice.

siplanner

Source

winoros

Most helpful comment

ndv

tianjiqx on 6 Sep 2018

👍4

All 9 comments

ndv

tianjiqx on 6 Sep 2018

👍4

Sounds fair, I like this idea.

eurekaka on 6 Sep 2018

Sounds reasonable

zz-jason on 6 Sep 2018

Examining every bucket is a little heavy. But we can have a try in the near future.
And sadly that the bucket in TiDB's histogram doesn't holds the NDV information for now. We only know the global NDV of one Index or Column.
BTW the graph you paste is not complete...

winoros on 6 Sep 2018

@tianjiqx Em, there's problem that when the bucket is not fully in the ranges, we decide the selectivity of this bucket by the uniform distribution assumption.
So $$NDV_i = NDV_i * (1 - (1-sel_i)^{freq_i}) * (1 - (1 - sel_{other})^{freq_i})$$ is not very usable since the sel_i is already calcualted by assumption.

winoros on 26 Sep 2018

@winoros
"the bucket is not fully in the ranges", is not collecting histogram information on the join column？
This formula is mainly intended to deal with data skew.The whole does not satisfy the uniform distribution, but the local satisfaction is satisfied.

tianjiqx on 26 Sep 2018

@tianjiqx
What does is not collecting histogram information on the join column mean?

but the local satisfaction is satisfied.

Also what does this mean?
If there a bucket representing interval [10, 100) and its ndv is 10, row count is 200. The range after filter is [60, +∞).
How do you calculate the selectivity of this bucket? Could this selectivity be reused to calculate the ndv after filter?

winoros on 26 Sep 2018

We assume that the histogram of each bucket obeys uniform distribution.,so c2>60 in [10,100) bucket sel1=0.44. @winoros

tianjiqx on 26 Sep 2018

ndv2
ndv3