It is planned to leverage Change Feed for tracking changes in mission-critical company data. It is unacceptable that some updates to underlying Cosmos collection are lost.
I assume Change Feed has the same SLA as Cosmos DB (99.999%) and I understand it is practically impossible for it to break, but what will happen in hypothetical situation when collection is updated, but change event is not issued? Information on how Change Feed works internally would be also great.
We plan to use Functions as Change Feed consumers. What is the recommended way to resolve a case when Change Feed sent update but Function was down for some reason?
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@crow-ua Thank you for the inquiry about this scenario. We are investigating and will update this issue when we have additional information to share.
@TheovanKraay Thank you for the detailed feedback.
This is being assigned to the content author to evaluate and update the documentation as appropriate.
Did some investigation on ChangeFeedProcessor
, found out that feed polling is customizable and is defaulted to 5 seconds, and also there is possibility to start from the beginning of the feed.
This raises another question - how long changes are persisted in feed? Do we really need Event Hub to store changed documents if we want to access arbitrary change from the past within some defined window (say 1 day)? Reading from change feed with designated offset looks faster and more cost-effective.
Also I've ran a small test with 5 instances of document generation app + 5 Azure Functions consuming the change feed of the collection generators write to (˜1k requests, ~150 RU).
Caught such exception once:
[01/31/2020 10:52:26] [consumer3][2020-01-31T12:52:26.6885110+02:00] Received 3 documents.
[01/31/2020 10:52:49] The operation did not complete within the allocated time 00:00:02.9829250 for object cbs2.
[01/31/2020 10:52:49] Executed 'TrackAccountsChanges' (Succeeded, Id=47fd3c16-580c-414d-b768-b5080955ef33)
[01/31/2020 10:52:49] The operation did not complete within the allocated time 00:00:02.9829250 for object cbs2.
Does it indicate those updates were lost?
Ran load test with dozens of clients that generated ~20k documents. 30% of requests that exceeded RU limit have failed, but consumer Functions did not drop anything. Although even 20 minutes after load test has finished they are still receiving updates. I presume this is by design? Is there any way to track this progress?
Also I've noticed for every 1K RU single lease generates around 15-25 RUs as well. I guess there is no way to project this number?
@TheovanKraay did you have chance looking at this thread? Thanks in advance for any additional information.
@crow-ua the change feed log persists forever. Your instinct is correct, change feed can indeed be used as an alternative to a separate event queue, particularly if Cosmos DB is your only source/target for event sourcing. For the error, which version of Azure functions are you using? I’m aware V1 has some issues in supporting change feed processor. Please ensure you use the latest version of Functions. Tagging @ealsur for his superior knowledge of Azure functions.
@TheovanKraay Good to know. Maybe you can point to some examples how to implement that? I think I saw code how to implement custom checkpointing logic for change feed, but not actually accessing and navigating it.
As for functions - I'm using netcoreapp3.1
on Function Runtime Version: 3.0.12930.0
.
Side note - 35 minutes after load test all my consumers finished receiving updates. Number of received documents do not match, e.g. 14844 documents for function #1, 21232 for #2, 21380 for #3, 21281 for #4, 21323 for #5. I will re-run this test few times more, but this is a little bit strange. I expected all change feed consumers to eventually receive the same number of document updates.
Ok, there are multiple points here.
@ealsur Thanks for valuable information. Today I've re-ran my tests from scratch and all the consumers received the same number of document updates. I was doing only inserts and re-creating leases prior to running tests previously though, so maybe there is something else that skewed the results. Also I've did not manage to reproduce function error (the whole processing is just putting the whole received document to Event Hub).
Given the assumption we'll handle all the retry in our custom Processor and consume Change Feed directly in our services, are there any drawbacks other than additional ~2% RU added per consumer per collection? So far we tried to execute recommended architecture referenced in bunch of samples, Jet.com
customer success story, etc. that uses Event Hub as intermediate storage for updated documents (Change Feed -> Azure Function -> Event Hub -> Consumer), but it does not look feasible now. We don't expect millions requests per second and just want to have updates in order and have ability to get arbitrary update upon request.
If I understand correctly from:
Given the assumption we'll handle all the retry in our custom Processor and consume Change Feed directly in our services, are there any drawbacks other than additional ~2% RU added per consumer per collection?
This means you will use your own compute to run Change Feed Processor instead of Functions? Drawbacks really subjective, Functions uses the same Change Feed Processor, so the drawback is tied to billing and time. Using your own compute does give you a higher degree of customization at the expense of having to manage and maintain it.
I'm not sure where that ~2% RU per consumer comes from, each instance consuming the Change Feed will consume RUs based on frequency of operations and volume of data, so it totally depends on your data size, volume of operations, and data distribution across partitions.
Regarding using Event Hub, I'm not sure why it does not look feasible, care to elaborate?
@ealsur
Using your own compute does give you a higher degree of customization at the expense of having to manage and maintain it.
Right, we may need for implementing complex errors processing. Despite transient errors like network problems we may have a whole class of data-bound issues since we are querying information from various sources. There is a case where data consumer would need to access some "previous" update (the one that was marked as successfully processed). I'm currently still investigating this, but off-the-cuff that's not possible with Functions. To rephrase - get single document from CF instead of moving feed cursor back in time and reprocess everything after it.
Regarding using Event Hub, I'm not sure why it does not look feasible, care to elaborate?
It adds complexity and costs. Initially we thought about it as a reliable event history source where we can potentially locate arbitrary changed document. That works like a charm actually, I'm persisting just the document offset and partition in Event Hub info and it's sufficient for Dead Letter handling.
Also we were thinking about Event Capture (+3x throughput costs additionally) to store raw or processed documents. Now if we can use CF directly, it looks more robust. Also, if needed, we can still do our own "Events Capture" via Function in literally no time.
My numbers on RUs are indeed synthetic, but it does not look there would be much costs saving for single CF consumer writing to EH and then everybody consuming EH vs. every consumer consuming CF directly. As @TheovanKraay absolutely nailed above, Cosmos our only source for events. If we had multiple source it would make more sense to use EH as an aggregate storage.
From a consumer standpoint there is also no much difference developing a consumer for Change Feed vs. Event Hub too.
Most helpful comment
Ok, there are multiple points here.