Cockroach: Feature Request: Row-level replication control

Created on 11 Jul 2017  路  13Comments  路  Source: cockroachdb/cockroach

We'd like the ability to be able to configure replication on a per-row basis, so that we can control which regions individual rows are replicated into. This is intended to reduce read and write latency on data that only needs to be accessed initially in one region, while allowing replication to other regions to be enabled per-row as the application sees fit.

Use case as discussed on Gitter:

I'm designing a solution for our application that aims to give our application entity-level replication control; that is, the application can specify on a per entity (row) basis where that entity is located and where it's replicated to, with the intention of replacing Google Cloud Datastore with it
Right now the solution uses MySQL as the underlying data store, with multiple MySQL HA instances in various regions in Google Cloud
with an API layered on top that manages reading and writing data using XA transactions and specialised database / table layouts, as outlined here: https://docs.google.com/document/d/1z0x4OzMVa3iTH2iY0ZddfvqGz3hz8XEHeQnyihxz5Mg/edit?usp=sharing
I was looking at CockroachDB and it appears to have table level replication control; I was wondering whether a feature request for row level replication control is something that would be acceptable?
The underlying reason for this is that we have entities that don't exist for very long, and often don't need to be accessed from other regions around the world, so it doesn't make sense to incur read/write latency to replicate entities that don't really need to be replicated
...
@vivekmenezes We're working on a cloud gaming API for developers. In our scenario, you might have a game which is distributed globally, and you have players signing in from all around the world
but when players login and the server creates sessions and API keys, it's unlikely that e.g. the North American servers need to know about that player session and API key if the player logged in from Australia
likewise with Game Lobbies and Matchmaking - most likely you want to match up with data that's regionally close together, because that results in a better play experience for players anyway (lower latency to other players / servers)
that said, sometimes you do need to refer to data in other regions, like when players add friends to lobbies and those friends are overseas, and we want to do this transparently
so when users log in, we'd like to be able to have the session data stored regionally close so that we're not incurring like 200+ms latency in having transactions synchronise data globally - after all, there's no point replicating an Australian session to North America initially
but if that session starts to interact with data that's stored in North America, and the North American servers need to start accessing that session data, then we'd like to start having that replicated automatically
basically we want to control the trade-off: initially we want low latency reads and writes, at the expense of any non-regional data access taking longer
then if the data starts to be accessed from around the world, we want to favour low latency reads globally at the expense of higher synchonous writes
but obviously this trade-off needs to be per-entity, some sessions will interact with players in other regions, some will not
and it will vary by tenant (customer/developer/game) as well
(we intend on using a database per tenant, with the tables standardized in layout)
as far as I can tell, setting up globally synchronous Redis wouldn't really help here - we'd just be preemptively putting everything in the "globally low latency reads at higher synchronous writes" category
it would still probably be better than Google Cloud Datastore, which is high latency pretty much everywhere it seems, but I'd rather have something designed to be scalable into the future
rather than potentially doing multiple data migrations as time goes on
(I know some games work around the latency issue by having explicit datacenters that accounts as associated with, e.g. a North America and Australia region being separate and selectable within the game, but this often means that friends in different regions can't play together and it sucks as an experience for players)

A-partitioning O-community

Most helpful comment

@hach-que In case you haven't read it, please take a look at our blog post on How We're Building a Business to Last. We have to balance our desire to build open-source software with our need to build a sustainable business. I strongly encourage you to reach out to [email protected] before writing off CockroachDB.

All 13 comments

The MySQL-based design in the linked document is based on using dedicated columns in the table for configuring replication - I would imagine the implementation in CockroachDB would be similar so that it's easy for client applications to modify or set it while inserting or updating data (as opposed to table or database replication control which requires configuration of zones from the command-line).

@hach-que Table partitioning is on our roadmap. See https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/sql_partitioning.md for the proposed design. Feedback is welcome.

@petermattis I'm having a read through the proposal, and it seems the current design is based on primary keys for determining partitions - that is, you'd have to change the identifier of the row in order to move the data. It's quite possible that I'm reading it wrong though.

So in the users / region example in the proposal, a query we'd want to run based on monitoring out-of-region writes is UPDATE users SET region = 'AU' WHERE user_id = 123, which would physically cause the authoritative row to be moved to the Australian region.

Another thing is that it looks like partitioning requires the region to be part of the primary key, i.e. lookups on user_id require a full-table scan. I don't know whether this also impacts uniqueness checks - i.e. if I add a new user in the Australia region, does it have to perform a table scan across all regions? Preferably we'd like a "user ID -> region" map replicated globally so that the system can quickly determine what region it needs to perform the actual data lookup in (effectively this turns a table scan into an index lookup in a specific region using such a map, and presumably makes uniqueness checks faster if that map is read replicated to most geographic regions?).

In other cases (where we're not just looking up by ID) we're comfortable falling back to a table scan, though presumably doing a query like SELECT * FROM users WHERE region = 'AU' AND ... would only execute the table scan in the Australian region, and not touch nodes in other regions (just to ensure there's no latency incurred there).

The proposal doesn't mention read replication control, though this is something that's important to us as well - we not only want to be able to control where the authoritative version of the data is (i.e. the proposal), but where read replication copies of that data is. I don't know if read replication of data across zones is something that CockroachDB does, but I would assume that's the case in order to provide redundancy of data in multiple zones. We want to be able to control this as well, so we can specify no read replication, read replication to specific geographic regions or read replication globally because our application is aware of scenarios where data starts to become more frequently accessed from the region that it's partitioned in, and we'll want to be able to have the application designate that individual rows should be read replicated to other regions to decrease latency in those regions, at the expense of higher write latency globally.

Hopefully that helps :)

@hach-que Thanks for the feedback. I'll let @danhhz and/or @dt answer your questions as they authored the RFC I pointed to.

The proposal doesn't mention read replication control, though this is something that's important to us as well - we not only want to be able to control where the authoritative version of the data is (i.e. the proposal), but where read replication copies of that data is.

The proposal discusses controlling where the replicas for a row are placed (via ZoneConfigs). Cockroach does not expose control of which replica is the leaseholder (I think this is what you mean by the authority). Instead, Cockroach automatically determines the placement of leaseholders based on a number of factors, including active load on the system. See the leases follow the sun RFC which is already implemented.

Cc @nstewart

@hach-que Yes, your reading is correct: the proposed design does require that the partitioning column(s) be in the primary key (and indeed, as drafted, further requires they be a prefix). And you're spot-on about uniqueness: to do a unique user id check without a scan, we'd need that `userid -> region" index you mentioned, and it should be unique and non-partitioned.

We don't currently support reading from anything other than the leader for a range, but when we do, you're also spot-on that replicating that index globally would be a big when. Atomic uniqueness checks that don't include the partition value obviously still need to cross region boundaries though.

Our index selection should already correctly handle limiting scans to a single partition or partitions if the partition cols are included in the query, so you users scan example should work out of the box when re add partitioning if it selects the partitioned index -- we'll have to take that into account in cases where there are multiple candidate indexes though, esp if/when we start including cardinality and cost approximations in that planning.

Team, I'd like to introduce myself. My name is Andy and I am a PM here at CRDB. I'm happy to share with you that we have heard you loud and clear and are working to support partitioning in 1.2. To that end, we would love to solicit your comments on our current RFC scope proposal found here. In particular, we would love to hear more about how you plan to use partitioning and if we are adequately thinking about your use case. How do you want to use partitioning? Please feel free to share in the comments below or directly at the RFC.

I think I've outlined our requirements above, but let me know if you need more information:

  • We need to be able to dynamically relocate the authoritative row, and control which zones the read replicas are stored in.
  • We need to be able to look up rows quickly, without being aware of their geographic location (i.e. we need a global index of "primary key" -> "geographic location"). In this case we expect that the index is replicated to all regions (this makes create and delete operations more expensive as it means waiting until the ID -> zone entry is replicated everywhere, but makes lookup and single update operations significantly faster as the data can be queried directly in the local region).

When an API server creates an entity in a given location, depending on the type we'll want to store that row in the local region. If that data starts to be accessed by API servers in another geographic location, our API servers need to dynamically add read replicas to other regions and to potentially move the authoritative location by updating the values for that particular row (where those values are for columns that dictate where the data is stored). For some types of data, we don't need backups or redundancy because the data is temporary and fine to discard in case of node failure.

@hach-que sorry for the radio silence but I wanted to mention that partitioning is available in the 2.0 betas. The partitioning docs are still in PR and are probably the best place to get started, but there's also a partitioning RFC if you prefer.

David addressed a few of your specific questions above, here's the rest:

We need to be able to look up rows quickly, without being aware of their geographic location (i.e. we need a global index of "primary key" -> "geographic location"). In this case we expect that the index is replicated to all regions (this makes create and delete operations more expensive as it means waiting until the ID -> zone entry is replicated everywhere, but makes lookup and single update operations significantly faster as the data can be queried directly in the local region).

With your use case in mind, we made sure it's possible to create multiple identical indexes on a table. You would make one per region and then configure them to live one in each region.

We need to be able to dynamically relocate the authoritative row, and control which zones the read replicas are stored in.

When an API server creates an entity in a given location, depending on the type we'll want to store that row in the local region. If that data starts to be accessed by API servers in another geographic location, our API servers need to dynamically add read replicas to other regions and to potentially move the authoritative location by updating the values for that particular row (where those values are for columns that dictate where the data is stored). For some types of data, we don't need backups or redundancy because the data is temporary and fine to discard in case of node failure.

CockroachDB has only one read leaseholder per data range. As mentioned above, it will automatically move to the replica that's closest to the load. The follower reads feature, targeted for 2.1, will allow for reading from the other replicas.

I'm going to close this since partitioning is ready, but please feel free to followup with any questions!

Very disappointed that after all this time waiting, this feature finally arrives and it's enterprise license only, effectively ensuring we can't actually use it.

It looks like we'll have to consider a different technology/approach to distribute our data globally.

@hach-que In case you haven't read it, please take a look at our blog post on How We're Building a Business to Last. We have to balance our desire to build open-source software with our need to build a sustainable business. I strongly encourage you to reach out to [email protected] before writing off CockroachDB.

My strong aversion to licensing lock-in for databases comes from what I've seen in the past, and the investment cost we'd need to make if we chose to go with CockroachDB. Roughly speaking:

  • I've worked at places where Microsoft SQL Server licensing has made up for over 60% of a 6-figure monthly AWS bill. There's no ability to reduce the licensing cost in these scenarios, and no ability to migrate the data to another storage system because of lock-in. This severely limits a company's ability to reduce their ongoing infrastructure costs, because even if you optimize and reduce the number of VMs you're using to e.g. serve front-end API requests, the DB licensing costs are so huge that they dwarf any meaningful savings you can get elsewhere.

    • Even if the price of Enterprise licensing is competitive and viable today, there's no guarantee the pricing won't change in the future, and like the example above, once you're locked into a choice of database, these costs can have a significant impact to on-going expenses.

  • For us, the total cost of CockroachDB has to compete with Google Cloud Datastore. Our Datastore bill last month was just $0.37 AUD. Migrating to CockroachDB on Kubernetes Engine instead of Datastore is already going to cost more than Datastore does today (in terms of compute resources), so adding an additional licensing cost on top of that is not a viable option.
  • If we chose to run the open source version today (where we're not running things in multiple regions yet), we can't design our database with geo-partitioning in mind because the feature is not accessible to us at all. When we expand our infrastructure to multiple regions, we then have to adjust the schema of our database to add geo-partitioning, and doing this when we have lots of live customers is much harder to do than designing for geo-partitioning from day 1.

Hopefully this outlines why Enterprise licensing is not something we're interested in pursuing.

A few points of clarification (for future readers):

  • The docs for Partitioning are available for all to read allowing open-source users to design their applications with partitioning in mind.
  • 30-day self-service enterprise trial licenses are available to allow you to try before you buy.
  • Follow-the-workload can improve performance of multi-region deployments even if you're not using partitioning.

@hach-que I'll reiterate my encouragement to talk to us before writing off enterprise licensing, though I understand if your mind is made up.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tim-o picture tim-o  路  3Comments

magaldima picture magaldima  路  3Comments

nvanbenschoten picture nvanbenschoten  路  3Comments

danhhz picture danhhz  路  3Comments

nvanbenschoten picture nvanbenschoten  路  3Comments