Azure-docs: Should date part be included to partition key

Created on 25 May 2018 · 8Comments · Source: MicrosoftDocs/azure-docs

Hi,

I have a question. For IoT kind of data, does it make sense to include date part (like year) into partition key? For the purpose of convenience, it's easer to make Partition Key just with device Id, as you described it. However, let's say device sends data once per minute, which will make it half million per year (our real case), my concern that if we don't implement purging, the number of records per partition will keep growing.

Is there some point at which it should become a concern? Any guidelines on that?

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 9da32a62-a60f-1013-bb2f-688c67a446ec
Version Independent ID: 7a2bf261-0a20-30f0-837d-cf02700034aa
Content: Partitioning and horizontal scaling in Azure Cosmos DB
Content Source: articles/cosmos-db/partition-data.md
Service: cosmos-db
Product: unspecified
GitHub Login: @SnehaGunda
Microsoft Alias: rimman

assigned-to-author cosmos-dsvc product-question triaged

Source

maxal1917

Most helpful comment

Yes, you can form a composite key by concatenating the device id and time (for example - partition key = "device123-april-2017") to increase the amount of storage per device.

You could then route queries to the appropriate partition key values using the IN clause e.g. SELECT * FROM c WHERE c.partitionKey IN ("device123-april-2017", "device123-may-2017")

aliuy on 30 May 2018

👍2

All 8 comments

@maxal1917 Thanks for the feedback. We are actively investigating and will get back to you soon.

Mike-Ubezzi-MSFT on 26 May 2018

Yes, you can form a composite key by concatenating the device id and time (for example - partition key = "device123-april-2017") to increase the amount of storage per device.

You could then route queries to the appropriate partition key values using the IN clause e.g. SELECT * FROM c WHERE c.partitionKey IN ("device123-april-2017", "device123-may-2017")

aliuy on 30 May 2018

👍2

@Mike-Ubezzi-MSFT can you assign this issue to @aliuy

SnehaGunda on 30 May 2018

👍1

@aliuy, I realize I can. My question is what are the guidelines? This is obvious overhead and I would rather not dot this unless it's necessary. But there is no magic, and I understand that when partition reaches certain size, it might be better to start partitioning by size for performance and scalability improvements. So, my question is where is that threashold when I should do it?

maxal1917 on 30 May 2018

Generally speaking - having higher cardinality partition keys and finer granularity is always preferred.

The extreme case would be to partition by id - which will yield great storage and throughput distribution. This comes at a trade off on how easy to represent frequently run queries without having to fan-out.

From an upper bound perspective... while a collection can scale to fit unlimited storage - data within a single given partition key value is bounded by 10GB of storage and 10,000 RU/sec.

aliuy on 21 Jun 2018

👍1

@maxal1917 hope Andrew's response helps, we will now proceed with closing this issue.

please-close

SnehaGunda on 22 Jun 2018

@maxal1917 We will now proceed to close this thread. If there are further questions regarding this matter, please reopen it and we will gladly continue the discussion.

Mike-Ubezzi-MSFT on 22 Jun 2018

I've also got a similar question relative to storing IoT device time series data in CosmosDB. Thanks for the discussion, but I'm hoping @aliuy can clarify a comment:

Generally speaking - having higher cardinality partition keys and finer granularity is always preferred.

The extreme case would be to partition by id - which will yield great storage and throughput distribution

I'm a bit confused by this statement. What "id" are you referring to? If you're saying this is the device ID, then I would think it is definitely NOT the extreme case, because each device is going to have an ever-growing number of time series records (in the millions), which will definitely hit the 10GB limit-per-partition (if you're going Fixed Size). Do you mean "id" as in the individual document ID, which is unique per document? That seems more like the "extreme" case you're referring to, but using the generic term "id" when EVERYTHING has an ID (including the device in this case) is a bit unclear.