Ghost: Swap auto-increment IDs for ObjectID

Created on 5 Oct 2016  ·  5Comments  ·  Source: TryGhost/Ghost

Auto-increment IDs are a feature of most databases. They're an easy way to create an id which is unique for each row in a relational database, so easy, it's not often thought about.

However, auto-increment IDs have several undesirable features:

  • Any replication system with multiple masters needs to have unique IDs, otherwise replication breaks.
  • Auto-increment IDs leak information, although this is not always a problem.
  • Auto-increment IDs don't globally identify resources.
  • This blog post covers several more small reasons why auto increments aren't great.

When Ghost was originally built, I already identified the need to be able to identify resources globally, and so all models in Ghost have both an auto-increment id and a uuid.

Far from solving the problem, this feels like the worst of both worlds:

  • We still have auto-increment ids and all the downsides they bring
  • We have an additional uuid using uuid.v4() that we rarely leverage and which is only "probably" unique, rather than being a guaranteed globally unique id.

The proposal for Ghost 1.0.0 is to move towards using a single ID for each resource, which is guaranteed to be globally unique. After some research into node modules which provide unique id's (there are many and this one was particularly promising), and also looking at what other platforms use, we settled on the idea of using ObjectID - the ID format that is used by MongoDB, via this npm module: https://github.com/williamkapke/bson-objectid.

Note, there are some small downsides as this could have a performance impact. One interesting potential performance issue is that innoDB uses auto-increment ids to optimise inserts. However, as ObjectID is still serial I don't believe this will be an issue.

Using ObjectId can also increase field and index sizes, I believe the impact for us will be minimal. At the moment, we do a poor job of adding indexes to Ghost, so even the fact that we intend to add an optimise them is going to be an improvement over the existing system. Further, we're reducing from having two potential IDs to need indexes on, to having one. There is plenty of content showing that the performance impact is minimal or a worthwhile tradeoff.

Finally, the one place where objectID might not be nice, is for post previews, we may need to revisit the implementation of these, ref #5097.

models / data

Most helpful comment

I haven't done the collision chance calculation but my gut feeling tells me it can't guarantee a lower collision rate then UUIDv4.

The idea is to at least guarantee no collision within a single system, like Ghost(Pro) where we can ensure that pids are unique etc.

If we really need to improve this further, we could always move to our own version of ObjectID that uses a proper machine fingerprint, but it seems to me if the current implementation is good enough for MongoDB, it's probably good enough for us.

The only real advantage I remember from working with ObjectIds is that since the time is the prefix you have a natural ordering by insert date.

This is a pretty big deal, especially as we're going to be focusing our efforts on optimising for MySQL (and InnoDB). This is the main reason for not using uuid.

All 5 comments

Already started with the implementation, but waiting for https://github.com/tgriesser/bookshelf/issues/1383 right now 👍

@ErisDS Shouldn't UUIDv4 be unique enough for this use case?

From reading the UUIDv4 Wikipedia page section about collisions:

https://en.wikipedia.org/wiki/Universally_unique_identifier#Random%5FUUID%5Fprobability%5Fof%5Fduplicates

Values are:

n = 68,719,476,736 (2^36), chance = 0.0000000000000004 (4×10^−16)
n = 2,199,023,255,552 (2^41), chance = 0.0000000000005 (5×10^−13)
n = 70,368,744,177,664 (2^46), chance = 0.0000000005 (5×10^−10)

As a global identifier this should be a pretty safe bet.

To limit collisions even more you can set the uuid field as the primary key or set a unique index and thus limit same-table collisions even though global collisions would still be possible in theory.

It's been a while since I've used MongoDB but by re-reading the documentation it would seem like ObjectIds are not much more random or unique than UUIDv4s.

  • objectid (12 bytes) = time (4 bytes) + machine identifier (3 bytes) + pid (2 bytes) + inc (3 bytes)
  • time = 4 bytes
    -> epoch with a precision up to seconds
    (https://github.com/williamkapke/bson-objectid/blob/2f8c1ca62280c3198131b02d243ea8341a04a364/objectid.js#L146)
  • machine identifier = 3 bytes
    -> 3 random bytes generated at runtime and for the JavaScript / Node.js driver it's using the non-cryptographicaly secure Math.random() function as a source of random bytes
    (https://github.com/williamkapke/bson-objectid/blob/2f8c1ca62280c3198131b02d243ea8341a04a364/objectid.js#L2)
  • pid = 2 bytes
    -> pids are not generated randomly and a lot of Docker containers will run the main app on pid 1 for example
    (https://github.com/williamkapke/bson-objectid/blob/2f8c1ca62280c3198131b02d243ea8341a04a364/objectid.js#L4)
  • inc = 3 bytes
    -> basically generated the same way as the machine identifier but with increments on top of it
    (https://github.com/williamkapke/bson-objectid/blob/2f8c1ca62280c3198131b02d243ea8341a04a364/objectid.js#L3)
    (https://github.com/williamkapke/bson-objectid/blob/2f8c1ca62280c3198131b02d243ea8341a04a364/objectid.js#L141)

I haven't done the collision chance calculation but my gut feeling tells me it can't guarantee a lower collision rate then UUIDv4.

The only real advantage I remember from working with ObjectIds is that since the time is the prefix you have a natural ordering by insert date.

I haven't done the collision chance calculation but my gut feeling tells me it can't guarantee a lower collision rate then UUIDv4.

The idea is to at least guarantee no collision within a single system, like Ghost(Pro) where we can ensure that pids are unique etc.

If we really need to improve this further, we could always move to our own version of ObjectID that uses a proper machine fingerprint, but it seems to me if the current implementation is good enough for MongoDB, it's probably good enough for us.

The only real advantage I remember from working with ObjectIds is that since the time is the prefix you have a natural ordering by insert date.

This is a pretty big deal, especially as we're going to be focusing our efforts on optimising for MySQL (and InnoDB). This is the main reason for not using uuid.

closed via #7495

Was this page helpful?
0 / 5 - 0 ratings