Hi,
I have a question. For IoT kind of data, does it make sense to include date part (like year) into partition key? For the purpose of convenience, it's easer to make Partition Key just with device Id, as you described it. However, let's say device sends data once per minute, which will make it half million per year (our real case), my concern that if we don't implement purging, the number of records per partition will keep growing.
Is there some point at which it should become a concern? Any guidelines on that?
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@maxal1917 Thanks for the feedback. We are actively investigating and will get back to you soon.
Yes, you can form a composite key by concatenating the device id and time (for example - partition key = "device123-april-2017") to increase the amount of storage per device.
You could then route queries to the appropriate partition key values using the IN clause e.g. SELECT * FROM c WHERE c.partitionKey IN ("device123-april-2017", "device123-may-2017")
@Mike-Ubezzi-MSFT can you assign this issue to @aliuy
@aliuy, I realize I can. My question is what are the guidelines? This is obvious overhead and I would rather not dot this unless it's necessary. But there is no magic, and I understand that when partition reaches certain size, it might be better to start partitioning by size for performance and scalability improvements. So, my question is where is that threashold when I should do it?
Generally speaking - having higher cardinality partition keys and finer granularity is always preferred.
The extreme case would be to partition by id - which will yield great storage and throughput distribution. This comes at a trade off on how easy to represent frequently run queries without having to fan-out.
From an upper bound perspective... while a collection can scale to fit unlimited storage - data within a single given partition key value is bounded by 10GB of storage and 10,000 RU/sec.
@maxal1917 hope Andrew's response helps, we will now proceed with closing this issue.
@maxal1917 We will now proceed to close this thread. If there are further questions regarding this matter, please reopen it and we will gladly continue the discussion.
I've also got a similar question relative to storing IoT device time series data in CosmosDB. Thanks for the discussion, but I'm hoping @aliuy can clarify a comment:
Generally speaking - having higher cardinality partition keys and finer granularity is always preferred.
The extreme case would be to partition by id - which will yield great storage and throughput distribution
I'm a bit confused by this statement. What "id" are you referring to? If you're saying this is the device ID, then I would think it is definitely NOT the extreme case, because each device is going to have an ever-growing number of time series records (in the millions), which will definitely hit the 10GB limit-per-partition (if you're going Fixed Size). Do you mean "id" as in the individual document ID, which is unique per document? That seems more like the "extreme" case you're referring to, but using the generic term "id" when EVERYTHING has an ID (including the device in this case) is a bit unclear.
Most helpful comment
Yes, you can form a composite key by concatenating the device id and time (for example - partition key = "device123-april-2017") to increase the amount of storage per device.
You could then route queries to the appropriate partition key values using the IN clause e.g. SELECT * FROM c WHERE c.partitionKey IN ("device123-april-2017", "device123-may-2017")