Cartodb: [Dashboard] Performance Review

Created on 16 Jan 2019 · 20Comments · Source: CartoDB/cartodb

We have to make sure the New Dashboard performs 👌 in accounts with loads of maps and datasets. @andy-esch recently explained how the performance of the actual Dashboard is pretty bad in his account (~1000 datasets).

The goal of this issue is to assess what the performance of the New Dashboard is in accounts like @andy-esch 's and do a little investigation of what's causing the UI to be slow, so that we can decide how to fix some or all of them in the near future.

A more performant Dashboard will be add a lot of value to heavy CARTO users 💪

cc: @javitonino @ethervoid

Backend

Source

alonsogarciapablo

Most helpful comment

Applying this patch now we're doing it in 0.6s

ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute_and_clear(sql="              SELECT t.oid, t.typname, t.typelem, t.typdelim, t.typinput, r.rngsubtype, t.typtype, t.typbasetype\n              FROM pg_type as t\n              LEFT JOIN pg_range as r ON oid = rngtypid\n              LEFT JOIN pg_type as tt ON (tt.typtype = 'c' AND tt.typarray = t.oid AND tt.typinput = 'record_in(cstring,oid,integer)'::regprocedure)\n              WHERE\n                tt.oid is null\n                AND (t.typname IN ('int2', 'int4', 'int8', 'oid', 'float4', 'float8', 'text', 'varchar', 'char', 'name', 'bpchar', 'bool', 'bit', 'varbit', 'timestamptz', 'date', 'time', 'money', 'bytea', 'point', 'hstore', 'json', 'jsonb', 'cidr', 'inet', 'uuid', 'xml', 'tsvector', 'macaddr', 'citext', 'ltree', 'interval', 'path', 'line', 'polygon', 'circle', 'lseg', 'box', 'timestamp', 'numeric')\n                OR t.typtype IN ('r', 'e', 'd')\n                OR t.typinput = 'array_in(cstring,oid,integer)'::regprocedure\n                OR t.typelem != 0)\n", name="SCHEMA", binds=[]) <0.598322>

I've made some tests and when a new type is need it ask for it without problem.

ethervoid on 31 Jan 2019

🚀2

All 20 comments

Randomly came across this which should help: https://github.com/CartoDB/cartodb/pull/14621

javitonino on 17 Jan 2019

Some more info that might be relevant: https://github.com/CartoDB/cartodb/issues/13953

javitonino on 18 Jan 2019

Another place where with_presenter_cache could help:
https://github.com/CartoDB/cartodb/blob/32d30ba66b8006dca14be926fdc8ccfaa092e40b/app/controllers/carto/api/visualization_presenter.rb#L61

gonzaloriestra on 22 Jan 2019

@jesusbotella sent me this issue about the (poor) performance of the datasets view:
https://groups.google.com/forum/#!msg/cartodb/K7dG4p0Tpyw/NP5ncVshCQAJ
Similar to what @andy-esch was expressing. I see @javitonino already replied. If we can fix this in the New Dashboard, it will be a super-win for our users 💪

alonsogarciapablo on 24 Jan 2019

@ethervoid do you have any insights regarding this so far? Thanks!

alonsogarciapablo on 25 Jan 2019

@alonsogarciapablo I'm reading all the issues, understanding the measures taken an so on. I finished yesterday the Likes cleaning so I'm on it :)

ethervoid on 25 Jan 2019

Perfect! Thanks for the update! 💪

alonsogarciapablo on 25 Jan 2019

Update

I've been uploading and trying to reproduce the problem in my local environment but is not the same and I don't have the same latency problems that in production. Why?

The metadata and the user database are in the same machine
We don't use pgbouncer
We don't have hundreds/thousands of users in the same database

Then I tested it in staging cloud with similar results. It could be due to similar factors than the local environment.

Next step is to provision a new RUI, redirect @andy-esch user to that rui and start doing profiling and so on.

For example doing this request: https://team.carto.com/u/eschbacher/api/v1/viz?tag_name=&q=&page=1&type=derived&exclude_shared=false&per_page=1&locked=true&shared=no&only_liked=false&api_key=apikey the avg response time is 4 seconds so is a good point to start measuring

ethervoid on 29 Jan 2019

Thanks for the update @ethervoid! Keep 🕵️ and you will find something! 💪

alonsogarciapablo on 29 Jan 2019

Update

I've been doing benchmarking and here are my insights by now.

TL; DR; Looks like either the connection to the database or the size of the database is damaging the performance. We have to take into account the the dbd-team pg data disk is 2.7TB

I've added benchmarking in every visualization object to check where the time goes:

"timeviz": [
                {
                    "label": "to_poro",
                    "real": 0.0069547109997074585,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "load_related",
                    "real": 0.000002701999619603157,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "load_dependent",
                    "real": 0.0000027089990908280015,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "liked",
                    "real": 0.001709437001409242,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0.009999999999999787,
                    "total": 0.009999999999999787
                },
                {
                    "label": "permission",
                    "real": 3.5094704250004725,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0.15999999999999992,
                    "utime": 2.4799999999999995,
                    "total": 2.6399999999999997
                },
                {
                    "label": "stats",
                    "real": 0.003290009999545873,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                },
                {
                    "label": "user_table",
                    "real": 0.00007167700096033514,
                    "cstime": 0,
                    "cutime": 0,
                    "stime": 0,
                    "utime": 0,
                    "total": 0
                }
            ]

As you can see a huge amount, 3.5s, goes to the permission part. Why? because permissions is the first time the user presenter is loaded, without cache, aaaaaand there we have the db_size part

So it looks like we still have the same problem. Measuring the time spent for @andy-esch 's user to retrieve the db size you can see the amount of time spent here (2,96s):

irb(main):002:0> Benchmark.measure { Carto::User.find_by_username('eschbacher').db_size_in_bytes }
[pool] Creating a new connection for 10.60.255.215:cartodb_user_6b304b59-2c86-4d7b-a8a2-dc98a5a35d48_db:6432:postgres:ar (1)
[pool] Found a connection for 10.60.255.215:cartodb_user_6b304b59-2c86-4d7b-a8a2-dc98a5a35d48_db:6432:postgres:ar (1)
=> #<Benchmark::Tms:0x0000000c8f7378 @label="", @real=3.636054564267397, @cstime=0.0, @cutime=0.0, @stime=0.15000000000000002, @utime=2.8100000000000005, @total=2.9600000000000004>

As an interesting part, @javitonino , was taking a look too and he notice that using the console the connection to the db takes ~3s:

Benchmark.measure { db = u.in_database(as: :superuser) } -> 3s
Benchmark.measure { u.db_size_in_bytes } -> 0.25s

Also @javitonino did other test: ActiveRecord::ConnectionAdapters::PostgreSQL::OID::TypeMapInitializer#run <4.620301> this command list all the different types in the DB and took 4.6s

Tomorrow I'll keep investigating :)

ethervoid on 29 Jan 2019

Update

Well, we finally have a winner here and a patch for it

ActiveRecord in Rails 4.0 is loading all the types from the database when a connection is made and in our team database is taking ~2.5 seconds because it loads 145k types.

I did a test using two different user models: one from Sequel and the other from AR to do the same operation:

  0.000000   0.000000   0.000000 (  0.235332) --> Sequel
  1.920000   0.150000   2.070000 (  2.689540) --> AR

You can see the problematic method is this one:

ActiveRecord::ConnectionAdapters::PostgreSQL::OID::TypeMapInitializer#run <4.615232>
                                                ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute_and_clear <5.136224>
                                              ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#load_additional_types <5.136321>
                                            ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#initialize_type_map <5.136989>
                                          ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#initialize <5.142473>
                                        Class#new <5.142497>
                                      ActiveRecord::ConnectionHandling#postgresql_connection <5.142617>
                                    ActiveRecord::ConnectionAdapters::ConnectionPool#new_connection <5.142634>
                                  ActiveRecord::ConnectionAdapters::ConnectionPool#checkout_new_connection <5.142647>
                                ActiveRecord::ConnectionAdapters::ConnectionPool#acquire_connection <5.142675>
                              MonitorMixin#mon_synchronize <5.143279>
                            ActiveRecord::ConnectionAdapters::ConnectionPool#checkout <5.143303>
                          MonitorMixin#mon_synchronize <5.143356>
                        ActiveRecord::ConnectionAdapters::ConnectionPool#connection <5.143382>
                      Carto::Db::Connection.get_database <5.144334>
                    CartoDB::ConnectionPool#create_new_connection <5.144612>
                  CartoDB::ConnectionPool#fetch <5.144657>
                Carto::Db::Connection.connect <5.144780>
              Carto::UserService#in_database <5.145184>
            Carto::UserService#cartodb_extension_version <5.146027>
          Carto::UserService#cartodb_extension_version_pre_mu? <5.146076>
        Carto::UserService#db_size_in_bytes <5.672939>
      Kernel#eval <5.833719>
    IRB::WorkSpace#evaluate <5.833742>
  IRB::Context#evaluate <5.833836>
IRB::Irb#signal_status <5.833899>

As you can see, the AR model is taking 2.6 seconds to do the same operation than Sequel. Why? Basically, AR when open a connection with a database get all the types from pg_type which in our database is 145k different types.

We can divide the problem into data gathering and processing:

To execute the query and retrieve all the data takes 1.3seconds

  1.240000   0.090000   1.330000 (  1.331408)

Process the data takes 0.5s

  0.530000   0.010000   0.540000 (  0.537235)

We can improve the data processing a bit but we're not going to deal with the real problem: the huge amount of types.

But why are we having that huge amount of types? Taking a look at the table we can see most of the types are related to tables recordset and arrays of those recordset. So we have decided to filter those in this query and in case our app needs it there is no problem because AR would ask for it doing a query

But...are you saying that every time AR asks for a type that we don't have in our cache it's going to do a query? Yes, but those types are not a common thing to ask for. Basically our app is dealing with the metadata database which has a few custom types (207) so we're, probably, going to notice that overhead.

Is not a perfect solution but is going to save us about ~2 seconds when dealing with huge user databases operations

ethervoid on 31 Jan 2019

👀1

Applying this patch now we're doing it in 0.6s

ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute_and_clear(sql="              SELECT t.oid, t.typname, t.typelem, t.typdelim, t.typinput, r.rngsubtype, t.typtype, t.typbasetype\n              FROM pg_type as t\n              LEFT JOIN pg_range as r ON oid = rngtypid\n              LEFT JOIN pg_type as tt ON (tt.typtype = 'c' AND tt.typarray = t.oid AND tt.typinput = 'record_in(cstring,oid,integer)'::regprocedure)\n              WHERE\n                tt.oid is null\n                AND (t.typname IN ('int2', 'int4', 'int8', 'oid', 'float4', 'float8', 'text', 'varchar', 'char', 'name', 'bpchar', 'bool', 'bit', 'varbit', 'timestamptz', 'date', 'time', 'money', 'bytea', 'point', 'hstore', 'json', 'jsonb', 'cidr', 'inet', 'uuid', 'xml', 'tsvector', 'macaddr', 'citext', 'ltree', 'interval', 'path', 'line', 'polygon', 'circle', 'lseg', 'box', 'timestamp', 'numeric')\n                OR t.typtype IN ('r', 'e', 'd')\n                OR t.typinput = 'array_in(cstring,oid,integer)'::regprocedure\n                OR t.typelem != 0)\n", name="SCHEMA", binds=[]) <0.598322>

I've made some tests and when a new type is need it ask for it without problem.

ethervoid on 31 Jan 2019

🚀2

Update

We've re-routed team users to our test platform in production and here are the results using the patch:

Without the patch activated

screenshot from 2019-02-04 11-28-52

With the patch activated

screenshot from 2019-02-04 11-39-58

As you can see we have improved the time for the dashboard:

/me endpoint have improved from 4.3s to 1.4s
/viz endpoints have improved from: about ~5s to ~3s

In summary, we have passed from ~26s to load the dashboard to ~12.8.

I'd let the patch active and the team users routing through it in order to test for undesired side-effects. If we don't have any issue In a couple of days (2-3), I'll put that patch into production.

Hope this improvement is good enough :)

ethervoid on 4 Feb 2019

🎉1

Now I'm going to start checking the order by size part. Which maybe could be improved in order to avoid making N queries for size data. Stay tunned

ethervoid on 4 Feb 2019

Great news @ethervoid! It's a great improvement!

alonsogarciapablo on 4 Feb 2019

@alonsogarciapablo asked about the performance with the new dashboard. I've changed @andy-esch user to the new dashboard and here you can see the results: