Cartodb: Big mismatch in actual vs estimated row count

Created on 13 Mar 2017  路  26Comments  路  Source: CartoDB/cartodb

Context

On the analysis quota warning, consumption estimation shows a huge mismatch between rows and credits.

Related to https://github.com/CartoDB/camshaft/issues/238

Steps to Reproduce

Please break down here below all the needed steps to reproduce the issue

  1. Open a two point layer
  2. Add AOI / Georeference analysis and select settings

Current Result

Quota notification shows a 510 credit consumption estimation

captura de pantalla 2017-03-13 a las 14 15 30

Expected result

2 credit consumption estimation (or at least something close)

.csv file

two_point_table.zip

Additional info

Please add any information of interest here below

Frontend bug

Most helpful comment

All the merit should go to @jgoizueta :)

All 26 comments

I can't reproduce it in production, under my team account, could you give me more details?

This seems to happen only in @noguerol's account. The explain query returns:

{"rows":[{"QUERY PLAN":"Seq Scan on untitled_table_4  (cost=0.00..11.53 rows=510 width=132)"}],"time":0.002,"fields":{"QUERY PLAN":{"type":"string"}},"total_rows":1}

with an isoline with only 1 tract.

any clue @rafatower?

@noguerol what's the username of the account you used?
edit never mind, I managed w/o it.

There's a difference between importing a table and populating an empty table with a couple of rows:

  • When importing,ANALYZE is run as part of the process.
  • When populating an empty table, ANALYZE is not run after inserting the rows. That's why you cannot reproduce the issue by importing a .carto file.

E.g: if you run this:

CREATE TABLE test_row_estimate_stats (my_attr text);
ANALYZE test_row_estimate_stats;
INSERT INTO test_row_estimate_stats (my_attr) VALUES ('foo'), ('bar');
EXPLAIN SELECT * FROM test_row_estimate_stats;

most likely you're going to get an unexpected number of rows:


                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on test_row_estimate_stats  (cost=0.00..11.83 rows=610 width=104)
(1 row)


From the top of my head, I see a few alternatives:

  • tweak configs to trigger autovacuum analyze operations more often. Namely the autovacuum_analyze_threshold. But this can have a negative effect on DB performance.
  • run ANALYZE "manually" upon insertion/deletion of features on a layer.
  • use actual counts instead of stats estimates.
  • assume this as a corner case and live with it.

Solutions two (to run the query after the geometry edition) and three (to make a direct count on an always simple layer) sound good to me.

I'm wondering how bad we could perform running the basic SELECT count(*) FROM whatever.

I ran some benchmarks here

There's something we have to keep in mind. The goal of PG stats is to help the query planner to make decisions.

There are a couple of corner cases in which EXPLAIN can return wrong estimates:

  • ANALYZE not run yet
  • empty table (no matter whether ANALYZE is run or not, it will not keep any actual stats in pg_stats)

This happens because PG tries to make some reasonable assumptions when true stats do not exist and the query planner needs them. Here is a thread about this matter. For the planner there is little difference if a particular table has 0 or 500 rows and many times it cannot accept 0 rows as an input.

A way to workaround several corner cases:

  1. Check if stats exist. If not, run ANALYZE on the table.
  2. Check the number of rows in stats. If estimated_rows < N return an actual count.
  3. Otherwise return estimated_rows.

N can be set to 1000 or 10000 based on the benchmarks. It's a judgment call between accuracy and speed.

@nobuti in our case, are we always requesting the row count estimate of a table or a query?

@rafatower yes, we always do request the estimation

I was not expecting a yes or no :sweat_smile:

Let me explain myself better: do we need to estimate a) tables or b) queries?

Sorry, always queries.

That's bad, cause it makes the problem much harder to solve :confused:

Just to confirm: is this query ultimately taken from the analysis model?

Yes, it comes from here.

For the record, that has some implications:

  • We cannot use actual counts for the general case (otherwise it could run against actual services). That option is definitely ruled out.
  • We'd have to parse the query to determine the affected tables (there's already some code for that)
  • We'd have to deal with the SELECT * FROM fresh_empty_table special case. To be honest I don't know how to do that at all :disappointed:

Regarding the "fresh empty table" stats issue, it can be reproduced like this:

tests=# CREATE TABLE fresh_empty_table (foo text);
CREATE TABLE
tests=# EXPLAIN SELECT * FROM fresh_empty_table;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Seq Scan on fresh_empty_table  (cost=0.00..14.08 rows=1360 width=32)

The root cause of it is in the function estimate_rel_size of file plancat.c (I reproduced it in 9.6 but is also present in 9.5 and in master):

I created a couple of issues:

@rafatower LGTM 馃挴

https://github.com/CartoDB/cartodb-postgresql/issues/295 is done so I guess we can unblock this ticket by starting using CDB_EstimateRowCount(query text) from the FE code for estimations.

Works like a charm! Thanks @rafatower!

All the merit should go to @jgoizueta :)

You both @rafatower and @jgoizueta rulez!

Tested in RUIs and the issue seems to be fixed with last changes. \o/, as @nobuti mentioned.

Deployed!

OOOOH!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ivanmalagon picture ivanmalagon  路  3Comments

atlefren picture atlefren  路  3Comments

javitonino picture javitonino  路  5Comments

noguerol picture noguerol  路  5Comments

makella picture makella  路  3Comments