Cartodb: Running two consecutive COUNT aggregations breaks the analysis

Created on 27 Oct 2016 · 16Comments · Source: CartoDB/cartodb

Example: Running Points to Polygon analysis (Which can at the moment only be aggregated by count), then intersecting that to obtain a count from a points dataset breaks the analysis & viz. Deleting broken analysis and refreshing page breaks builder.

Error: column "count_vals" specified more than once

double_count_bug

bug

Source

namessanti

Most helpful comment

@saleiva @xavijam In my opinion, this is a key workflow that should be solved. For instance, I have Madrid neighborhoods and a set of points layers such as hospitals, green areas, libraries and so on. I could intersect one by one with the neighborhoods layer, and then use the autostyle to show the variability both geographically and thematic.

ramiroaznar on 17 Feb 2017

👍2

All 16 comments

cc/ @javisantana

saleiva on 28 Oct 2016

Another example

1.3 - ATM Network Performance (on 2016-10-13 at 23.18.19).carto.zip

saleiva on 31 Oct 2016

ramiroaznar on 17 Feb 2017

👍2

Added next so we can attack this sooner

saleiva on 17 Feb 2017

+1, BCG also reporting this behavior.

csobier on 22 Aug 2017

👍1

I'm looking into this.

The problem is that both the aggregate-intersection/"Insersect second layer" and group-points/"Group points into polygons" (convex-hull, boundix-box, concave-hullnodes) analysis add columns named count_vals when agregating by count.

So, in our case, which chains two such analyses, camshaft tries to create an analysis cache table like this one (with two count_vals columns):

CREATE TABLE analysis_b194a8f896_dcb3c5cac4fbf0062380d58a0b3497ab0d531320 AS
SELECT * FROM (
  SELECT
    _cdb_analysis_source.cartodb_id,
    _cdb_analysis_source.category,
    _cdb_analysis_source.the_geom,
    _cdb_analysis_source.count_vals,                      -- << first count_vals column
    count(_cdb_analysis_target.cartodb_id) as count_vals, -- << second count_vals column
    count(_cdb_analysis_target.cartodb_id) / GREATEST(0.0000026, ST_Area((ST_Transform(_cdb_analysis_source.the_geom, 4326))::geography)) as count_vals_density
  FROM (
    SELECT
      row_number() over() as cartodb_id,
      distrito as category,
      ST_ConvexHull(ST_Collect(the_geom)) AS the_geom,
      count(1) as count_vals
    FROM (
      SELECT * FROM points_table
    ) _analysis_source
    GROUP BY distrito
  ) _cdb_analysis_source,
  (SELECT * FROM points_table) _cdb_analysis_target
  WHERE ST_Intersects(_cdb_analysis_source.the_geom, _cdb_analysis_target.the_geom)
  GROUP BY _cdb_analysis_source.cartodb_id, _cdb_analysis_source.category, _cdb_analysis_source.the_geom, _cdb_analysis_source.count_vals
) _analysis_create_table_query
LIMIT 0;

With other aggregates different from count it is also possible this kind of problem, if the second analysis uses the same aggregate function on a columns with the same name (in this case the doplicated column would be named {aggregate_function}_{base_column}. (here's an example; you can see the error activating DevTools and looking at the named map instantiation response)

I see no obvious way of preventing this kind of problem in general without introducing much uglier column names.

jgoizueta on 18 Sep 2017

I think this is a very important workflow for any client. Better to have ugly column names than useless analysis right?

namessanti on 18 Sep 2017

Yep, let me give it a thought

jgoizueta on 18 Sep 2017

What about count_vals_1, count_vals_vals_2 and so on?

ramiroaznar on 18 Sep 2017

👍1

It might not be trivial to apply such as naming-scheme (notice that each node's query is independently generated, and it should be deterministic based on the parameters used for hashing the node's id), but I think we can detect possible name collisions and do some renaming similar to that, given that we can know the source node's column names (with getColumns() method, thanks to @dgaubert for the info).

I will look for some simple way to handle this.

jgoizueta on 18 Sep 2017

I've just noticed we've had a ticket for this for some time: https://github.com/CartoDB/camshaft/issues/235

I've also open a new ticket for the solution I'm working on: https://github.com/CartoDB/camshaft/issues/330

jgoizueta on 20 Sep 2017

I intended to suffix the ambiguous column names with something meaningful to the user (the a0, a1, etc. identifiers that Builder uses), but I've realised that we cannot do that, so we'll have to just use numerical (_2, _3, ...) suffixes. I'm afraid it will be hard for the user to determine which column corresponds to which analysis, but I see no easy way to fix it.

jgoizueta on 21 Sep 2017

Hm, what about an identifier for what analysis it came from? For example centroid_count_1, centroid_count_2, etc

namessanti on 21 Sep 2017

That's feasible if we don't correlate the _N numbers with different analyses. Can get a little verbose for my taste, but easy to implement.

All in all I think it could be worth, since it would make clear where the column belongs in most cases.

jgoizueta on 21 Sep 2017

Fixed and deployed, thank you all!