Cartodb: Big mismatch in actual vs estimated row count

Created on 13 Mar 2017 · 26Comments · Source: CartoDB/cartodb

Context

On the analysis quota warning, consumption estimation shows a huge mismatch between rows and credits.

Steps to Reproduce

Please break down here below all the needed steps to reproduce the issue

Open a two point layer
Add AOI / Georeference analysis and select settings

Current Result

Quota notification shows a 510 credit consumption estimation

captura de pantalla 2017-03-13 a las 14 15 30

Expected result

2 credit consumption estimation (or at least something close)

.csv file

two_point_table.zip

Additional info

Please add any information of interest here below

Frontend bug

Source

noguerol

Most helpful comment

All the merit should go to @jgoizueta :)

rafatower on 26 Apr 2017

🎉3

All 26 comments

I can't reproduce it in production, under my team account, could you give me more details?

nobuti on 13 Mar 2017

try with this .carto:
quota test map (on 2017-03-13 at 18.14.41).zip

noguerol on 13 Mar 2017

This seems to happen only in @noguerol's account. The explain query returns:

{"rows":[{"QUERY PLAN":"Seq Scan on untitled_table_4  (cost=0.00..11.53 rows=510 width=132)"}],"time":0.002,"fields":{"QUERY PLAN":{"type":"string"}},"total_rows":1}

with an isoline with only 1 tract.

any clue @rafatower?

nobuti on 13 Mar 2017

@noguerol what's the username of the account you used?
edit never mind, I managed w/o it.

rafatower on 14 Mar 2017

There's a difference between importing a table and populating an empty table with a couple of rows:

When importing,ANALYZE is run as part of the process.
When populating an empty table, ANALYZE is not run after inserting the rows. That's why you cannot reproduce the issue by importing a .carto file.

E.g: if you run this:

CREATE TABLE test_row_estimate_stats (my_attr text);
ANALYZE test_row_estimate_stats;
INSERT INTO test_row_estimate_stats (my_attr) VALUES ('foo'), ('bar');
EXPLAIN SELECT * FROM test_row_estimate_stats;

most likely you're going to get an unexpected number of rows:


                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on test_row_estimate_stats  (cost=0.00..11.83 rows=610 width=104)
(1 row)

From the top of my head, I see a few alternatives:

tweak configs to trigger autovacuum analyze operations more often. Namely the autovacuum_analyze_threshold. But this can have a negative effect on DB performance.
run ANALYZE "manually" upon insertion/deletion of features on a layer.
use actual counts instead of stats estimates.
assume this as a corner case and live with it.

rafatower on 14 Mar 2017

Solutions two (to run the query after the geometry edition) and three (to make a direct count on an always simple layer) sound good to me.

noguerol on 14 Mar 2017

I'm wondering how bad we could perform running the basic SELECT count(*) FROM whatever.

xavijam on 14 Mar 2017

I ran some benchmarks here

rafatower on 14 Mar 2017

There's something we have to keep in mind. The goal of PG stats is to help the query planner to make decisions.

There are a couple of corner cases in which EXPLAIN can return wrong estimates:

ANALYZE not run yet
empty table (no matter whether ANALYZE is run or not, it will not keep any actual stats in pg_stats)

This happens because PG tries to make some reasonable assumptions when true stats do not exist and the query planner needs them. Here is a thread about this matter. For the planner there is little difference if a particular table has 0 or 500 rows and many times it cannot accept 0 rows as an input.

A way to workaround several corner cases:

Check if stats exist. If not, run ANALYZE on the table.
Check the number of rows in stats. If estimated_rows < N return an actual count.
Otherwise return estimated_rows.

N can be set to 1000 or 10000 based on the benchmarks. It's a judgment call between accuracy and speed.

@nobuti in our case, are we always requesting the row count estimate of a table or a query?

rafatower on 15 Mar 2017

@rafatower yes, we always do request the estimation

nobuti on 15 Mar 2017

I was not expecting a yes or no :sweat_smile:

Let me explain myself better: do we need to estimate a) tables or b) queries?

rafatower on 15 Mar 2017

Sorry, always queries.

nobuti on 15 Mar 2017

That's bad, cause it makes the problem much harder to solve :confused:

Just to confirm: is this query ultimately taken from the analysis model?

rafatower on 15 Mar 2017

Yes, it comes from here.

nobuti on 15 Mar 2017

For the record, that has some implications:

We cannot use actual counts for the general case (otherwise it could run against actual services). That option is definitely ruled out.
We'd have to parse the query to determine the affected tables (there's already some code for that)
We'd have to deal with the SELECT * FROM fresh_empty_table special case. To be honest I don't know how to do that at all :disappointed:

rafatower on 15 Mar 2017

Regarding the "fresh empty table" stats issue, it can be reproduced like this:

tests=# CREATE TABLE fresh_empty_table (foo text);
CREATE TABLE
tests=# EXPLAIN SELECT * FROM fresh_empty_table;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Seq Scan on fresh_empty_table  (cost=0.00..14.08 rows=1360 width=32)

The root cause of it is in the function estimate_rel_size of file plancat.c (I reproduced it in 9.6 but is also present in 9.5 and in master):

curpages are set to 10 even though the relation occupies zero blocks: https://github.com/postgres/postgres/blob/18dc2aee5f303447bef48dee596a664d90f6939a/src/backend/optimizer/util/plancat.c#L915-L951
it never reaches this "quick exit" code that correctly sets *tuples = 0: https://github.com/postgres/postgres/blob/18dc2aee5f303447bef48dee596a664d90f6939a/src/backend/optimizer/util/plancat.c#L955-L961
later on it estimates the number of tuples based on "page density" and number of pages (curpages) here: https://github.com/postgres/postgres/blob/18dc2aee5f303447bef48dee596a664d90f6939a/src/backend/optimizer/util/plancat.c#L1008

rafatower on 15 Mar 2017

I created a couple of issues:

Create function CDB_EstimateRowCount(query text) #295 @nobuti can you please review the interface of the function to see if that fits properly into the frontend? (I guess it's much better than running the EXPLAIN and parsing the results from frontend but it's pretty important to agree on the interfaces).
[PG] Improve planner estimates for fresh empty tables #3176 This one is much harder to solve IMO and should be evaluated after implementing and integrating the first one. At the end of the day, it is really a corner case.

rafatower on 15 Mar 2017

@rafatower LGTM 💯

nobuti on 15 Mar 2017

Blocking until this is ready: https://github.com/CartoDB/cartodb-postgresql/issues/295

jorgesancha on 29 Mar 2017

https://github.com/CartoDB/cartodb-postgresql/issues/295 is done so I guess we can unblock this ticket by starting using CDB_EstimateRowCount(query text) from the FE code for estimations.

rafatower on 24 Apr 2017

🎉1

Works like a charm! Thanks @rafatower!

nobuti on 26 Apr 2017

All the merit should go to @jgoizueta :)

rafatower on 26 Apr 2017

🎉3

You both @rafatower and @jgoizueta rulez!

xavijam on 28 Apr 2017

Tested in RUIs and the issue seems to be fixed with last changes. \o/, as @nobuti mentioned.

xavijam on 3 May 2017

Deployed!

xavijam on 3 May 2017

OOOOH!

saleiva on 3 May 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Freeze of Editor assets

ivanmalagon · 3Comments

URL import of OSM Overpass API imports only Points

atlefren · 3Comments

Suggestions for notifications feature

javitonino · 5Comments

Time-series does not render

noguerol · 5Comments

Widgets not working on maps today that worked yesterday

makella · 3Comments