Cartodb: Rethink Limits model rather than add a new field in User model

Created on 16 Nov 2017 · 13Comments · Source: CartoDB/cartodb

TL;DR: users table keeps growing with every single small modification.

users table currently has 92 columns (!), and it keeps growing with every single configuration parameter. Following projects (check inbound links) are going to need more configuration parameters, and we should stop this mess before it explodes.

I'd like to open the debate about the need and the approach.

Possible approaches (not all of them are excluding):
I) Remove unused columns.
II) Create some user_XXX tables, grouping existing columns into meaningful sets.
III) Group meaningful sets into JSONB/HSTORE columns.
IV) Address new needs with any of the previous approaches, postpone migrating existing columns.

Any proposal should also take into account the impact on queries, especially at presenters. Currently, presenters present most of the columns because the cost is low. Nevertheless, we might want to change that to take advantage of the refactor.

Thoughts, @CartoDB/builder-backend ? Not urgent at all.

cc @matallo

Backend

Source

juanignaciosl

Most helpful comment

So, after spending some more time thinking/talking about this, a proposal:

Create a rate_limits table, with an id and a column per limit.
Create an account_types table, with fields account_type, rate_limits_id (will probably grow)
Add a rate_limits_id field to users, NULL by default

We use a common table for all limits, so we can share the same columns for account_types and users, instead of having two separate tables with almost the same schema.

Users will get the limits from rate_limits_id if set, or from the relation through the account type otherwise. As most users don't overwrite the limits, that table will probably be very small (one row per plan, plus a few users which need specializations). We minimize data duplication while still alowing overrides. As the table will be small, migrations should not pose any issue, so I feel confident using postgresql columns, so we can benefit from psql's not null/default.

Syncing the data will be a bit trickier, since we still have to duplicate the data in redis, so a single change of a limit associated to a plan will imply many users invalidations, which are unavoidable anyway. But it should not be too hard (just loop over affected users), and the design is nicer.

javitonino on 8 Feb 2018

👍3

All 13 comments

It could make sense to extract classes and use them as dependencies. Such as PersonalData, AccountInfo, PaymentInfo...

amartinj on 20 Nov 2017

It is time to rethink this question. We are starting with the first steps for Limits v2, covering mainly rate limits. On the one hand, it does not make sense overloading users table. On the other hand, thinking on limits with a broader view drives us to find a good solution in terms of data design, in the line you both suggest.

I add @oleurud to the debate, so he can explain here the specific needs we face with limits.

inigomedina on 10 Jan 2018

I need to think/define some things before talking about this. I am working on it

oleurud on 11 Jan 2018

👍1

We could be talking about 20 and 60 new values only for rate limiting (it is not sure yet)

oleurud on 11 Jan 2018

We are going to use 63 new values (integers) per user and this values must be accessible from Redis. I am not sure if this is the place to explain them, but I am going to do it.

These new values are for add a rate limit for each API 'endpoint' (really it is not a simple endpoint, we can think about it as a string key). For each 'endpoint', we will need to save 3 for all but one, which we will need 6 values. And ideally, we will need to get these 3 or 6 values with a unique Redis request (a Redis Hash, extracting together with HGETALL)

So, thinking about the db model, these 3 or 6 values have a relation between them and we always use them together.

I don't have all info to take a decision about the best db model approach, but I like to use a JSON or JSONB fields (better than new tables) in this cases.

oleurud on 15 Jan 2018

Let's meet or discuss in a feature doc to understand what's actually needed and what's the best approach for this, please.

juanignaciosl on 15 Jan 2018

Same table vs other table
Having everything in the same table makes it easier to implement a 1:1 relationship, but it has the issue that the table grows bigger. Also, given that we use an ORM, it will load all fields in all requests, which can be a problem. We have a mitigation with DEFAULT_SELECT but it's not good enough for the general case (it only works for certain associations, and it's problematic).

Splitting into different tables is harder to manage (ensure 1:1 relation), but it might be worth it for seldom requested information (like these limits). We avoid querying the table when we don't need the limits, but we penalize with an extra query when we do. Seems reasonable for instances where Rails won't use the data directly.

JSON vs fields
JSON is more flexible and easier to update (avoids DB migration). On the other hand, we have to handle those by hand, probably in Rails code, maybe by merging a DEFAULTs dictionary into the saved value in order to fill it.

Having lots of fields delegate all that work to the database, but we might run into issues with migrations, specially in the common cases (adding new fields with NOT NULL/defaults). On the other hand, it ensure the data is consistent and everything is filled, and we can always put the defaults in code to avoid migration issues.

javitonino on 1 Feb 2018

I'd consider another approach as well (brainstorming mode, I haven't collected data about this):

Normalized, limit-by-row tables

This involves two tables:

account_type_limits (account_type, limit_name, default_value) table with the defaults by account type. We've been reluctant to add pricing plan information to the box but, TBH, it's leaked. As we have account types, this might make sense.
user_custom_limits (user_id, limit_name, value) table that contains one row per limit that is not the default one for each user.

This complicates the logic a little bit but, assuming that only a few limits are not the default ones, the tables should be really small, which could lead to huge benefits: the DB size is much smaller and adding new limits is a matter of adding a few rows.

As a variation of this, if limits variance is always related to the plan or user (in other words, if most limits are different between plans and if a user that has a different limit has most limits different), would be the "other table" approach mentioned by javitonino (one column per limit), but with the account_type_limits table for the defaults.

juanignaciosl on 5 Feb 2018

I don't like the key/value type tables. It has disadvantages from both sides (no real schema, harder to query) and doesn't offer much (since we would always query all limits). In that case I'd rather user JSON. But you raise a fine point, it might be the time to leak priceplan level settings to the box, and override as needed.

Something we thought about, which might fit here is to create a limits table, and then reference entries from the users/plans table (add users.permissions_id). It helps keep the table small, and might make updates to limits easier if we tend to have the same three sets of limits for almost all users (at least at DB level, redis is another story).

javitonino on 5 Feb 2018

👍1

So, after spending some more time thinking/talking about this, a proposal:

Create a rate_limits table, with an id and a column per limit.
Create an account_types table, with fields account_type, rate_limits_id (will probably grow)
Add a rate_limits_id field to users, NULL by default

We use a common table for all limits, so we can share the same columns for account_types and users, instead of having two separate tables with almost the same schema.

javitonino on 8 Feb 2018

👍3

rate_limits would first have the new limits (so this doesn't block new development) and then current limit columns will be migrated, right?

👍 , of course! 🚀

juanignaciosl on 9 Feb 2018

Redis convention proposal

Database
As we have talked, it is a good idea to use another Redis db (not in use). I am using the 8 db in the rate limit development. Any problem?

Key
limits:rate:store:{user}:{app}:{endpointGroup}

Being:
user: the CARTO username
endpointGroup: the endpoint group name.
app: sql or maps

Maps API endpoints
anonymous
static
static_named
dataview
dataview_search
analysis
tile --> it has 6 values
attributes
named_list
named_create
named_get
named
named_update
named_delete
named_tiles

SQL API endpoints
query
query_format
job_create
job_get
job_delete

Data to save and persistence
First of all, we must think about the best way in performance terms. For each API request, we will exec a Lua script that makes 2 tasks:

get the limits values by user and endpoint
get the rate limit state

As we talked, we need to save 3 values (each limit needs: maxBurst, countPerPeriod and period) by user and endpoint, less in the tile case that we need to save 6 values (we will save 2 limits). And maybe, we will need to add/remove a limit in a special case.

Thinking about the best way to save the data with @javitonino: use a Redis list, with 3 values per limit. The order must be: maxBurst, countPerPeriod and period (this is important, we have an app writing the limits and another different consuming them)

So in case, we have to save 2 limits: 5 req/second and 20 req/minute: