Tidb: Enhance the method `Selectivity`

Created on 6 Sep 2018  Â·  9Comments  Â·  Source: pingcap/tidb

It only return a float number tell you the selectivity of the given filters.
The NDV(number of distinct value) of one column cannot be maintained easily if we don't know this column's distribution after the given filters though we have histogram information.
Return these information from Selectivity seems to be a good choice.

siplanner

Most helpful comment

ndv

All 9 comments

ndv

Sounds fair, I like this idea.

Sounds reasonable

Examining every bucket is a little heavy. But we can have a try in the near future.
And sadly that the bucket in TiDB's histogram doesn't holds the NDV information for now. We only know the global NDV of one Index or Column.
BTW the graph you paste is not complete...

@tianjiqx Em, there's problem that when the bucket is not fully in the ranges, we decide the selectivity of this bucket by the uniform distribution assumption.
So $$NDV_i = NDV_i * (1 - (1-sel_i)^{freq_i}) * (1 - (1 - sel_{other})^{freq_i})$$ is not very usable since the sel_i is already calcualted by assumption.

@winoros
"the bucket is not fully in the ranges", is not collecting histogram information on the join column?
This formula is mainly intended to deal with data skew.The whole does not satisfy the uniform distribution, but the local satisfaction is satisfied.

@tianjiqx
What does is not collecting histogram information on the join column mean?

but the local satisfaction is satisfied.

Also what does this mean?
If there a bucket representing interval [10, 100) and its ndv is 10, row count is 200. The range after filter is [60, +∞).
How do you calculate the selectivity of this bucket? Could this selectivity be reused to calculate the ndv after filter?

We assume that the histogram of each bucket obeys uniform distribution.,so c2>60 in [10,100) bucket sel1=0.44. @winoros

ndv2
ndv3

Was this page helpful?
0 / 5 - 0 ratings