Azure-docs: Cosmos DB Change Feed reliability/availability

Created on 29 Jan 2020  Â·  10Comments  Â·  Source: MicrosoftDocs/azure-docs

It is planned to leverage Change Feed for tracking changes in mission-critical company data. It is unacceptable that some updates to underlying Cosmos collection are lost.
I assume Change Feed has the same SLA as Cosmos DB (99.999%) and I understand it is practically impossible for it to break, but what will happen in hypothetical situation when collection is updated, but change event is not issued? Information on how Change Feed works internally would be also great.
We plan to use Functions as Change Feed consumers. What is the recommended way to resolve a case when Change Feed sent update but Function was down for some reason?


Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

Pri1 assigned-to-author cosmos-dsvc product-question triaged

Most helpful comment

Ok, there are multiple points here.

  1. See https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#components-of-the-change-feed-processor to understand how each component is used
  2. Functions Change Feed Trigger uses the Change Feed Processor underneath, not all of the APIs that the Change Feed Processor has are exposed in Functions.
  3. FeedPollDelay is only used when Change Feed has no changes, see https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#processing-life-cycle
  4. The process of Checkpointing means, updating the state with the latest successful point in time (Continuation) that just got processed by your code.
  5. Regarding your sample logs, those logs indicate that your Function is taking too long to process the changes. See https://docs.microsoft.com/en-us/azure/azure-functions/functions-best-practices#avoid-long-running-functions, the Trigger already sent the changes to your Function, if you failed to process them, or did not correctly handled delivery, it is the Functions responsibility.
  6. If you have multiple Functions that should trigger from the same Monitored collection, and each one does different things, see https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-create-multiple-cosmos-db-triggers. If you want to scale out a single Function, you do not need to deploy multiple versions of it, you can either relay on Consumption Plan auto-scale or manually add/remove instances to the Function App Service Plan (depending on your hosting model).
  7. Changes do not get retried. See https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger, due to Functions billing model and to avoid infinite retries/billing, the Trigger will report a batch of changes to your Function and if it fails to process them (Unhandled exception, bug in your code), it won't retry them. You can certainly reprocess changes from the beginning though https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-changefeed-functions#need-to-restart-and-re-process-all-the-items-in-my-container-from-the-beginning.
  8. You can always use the Change Feed Processor directly, in the compute of your choice: https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#implementing-the-change-feed-processor, the biggest difference is that when there is an unhandled failure, the Change Feed Processor will retry the same batch of changes. Since billing in this architecture is your own compute, infinite retries due to a non-transient error do not push an unbounded billing.
  9. Functions Trigger, like Change Feed Processor, has an at-least-once delivery assert, with no SLAs for the time it takes for those changes to be delivered. The reason is, the time it takes for the changes to be picked up, depend on user factors (processing speed for past changes, compute size, network latency and location, etc)
  10. The Change Feed API does not contain ALL the intermediate changes a document receives (in an Update scenario):

    1. Change Feed Processor / Change Feed Trigger reads changes on time A, it checkpoints

    2. A document is created.

    3. Change Feed Processor / Change Feed Trigger reads changes on time B, picks up and reports creation of document

    4. Document is updated to version 2

    5. Document is updated to version 3

    6. Change Feed Processor / Change Feed Trigger reads changes on time C, it will pickup the document update but only version 3.

All 10 comments

@crow-ua Thank you for the inquiry about this scenario. We are investigating and will update this issue when we have additional information to share.

@TheovanKraay Thank you for the detailed feedback.
This is being assigned to the content author to evaluate and update the documentation as appropriate.

Did some investigation on ChangeFeedProcessor, found out that feed polling is customizable and is defaulted to 5 seconds, and also there is possibility to start from the beginning of the feed.
This raises another question - how long changes are persisted in feed? Do we really need Event Hub to store changed documents if we want to access arbitrary change from the past within some defined window (say 1 day)? Reading from change feed with designated offset looks faster and more cost-effective.
Also I've ran a small test with 5 instances of document generation app + 5 Azure Functions consuming the change feed of the collection generators write to (˜1k requests, ~150 RU).
Caught such exception once:
[01/31/2020 10:52:26] [consumer3][2020-01-31T12:52:26.6885110+02:00] Received 3 documents. [01/31/2020 10:52:49] The operation did not complete within the allocated time 00:00:02.9829250 for object cbs2. [01/31/2020 10:52:49] Executed 'TrackAccountsChanges' (Succeeded, Id=47fd3c16-580c-414d-b768-b5080955ef33) [01/31/2020 10:52:49] The operation did not complete within the allocated time 00:00:02.9829250 for object cbs2.
Does it indicate those updates were lost?

Ran load test with dozens of clients that generated ~20k documents. 30% of requests that exceeded RU limit have failed, but consumer Functions did not drop anything. Although even 20 minutes after load test has finished they are still receiving updates. I presume this is by design? Is there any way to track this progress?
Also I've noticed for every 1K RU single lease generates around 15-25 RUs as well. I guess there is no way to project this number?
@TheovanKraay did you have chance looking at this thread? Thanks in advance for any additional information.

@crow-ua the change feed log persists forever. Your instinct is correct, change feed can indeed be used as an alternative to a separate event queue, particularly if Cosmos DB is your only source/target for event sourcing. For the error, which version of Azure functions are you using? I’m aware V1 has some issues in supporting change feed processor. Please ensure you use the latest version of Functions. Tagging @ealsur for his superior knowledge of Azure functions.

@TheovanKraay Good to know. Maybe you can point to some examples how to implement that? I think I saw code how to implement custom checkpointing logic for change feed, but not actually accessing and navigating it.
As for functions - I'm using netcoreapp3.1 on Function Runtime Version: 3.0.12930.0.
Side note - 35 minutes after load test all my consumers finished receiving updates. Number of received documents do not match, e.g. 14844 documents for function #1, 21232 for #2, 21380 for #3, 21281 for #4, 21323 for #5. I will re-run this test few times more, but this is a little bit strange. I expected all change feed consumers to eventually receive the same number of document updates.

Ok, there are multiple points here.

  1. See https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#components-of-the-change-feed-processor to understand how each component is used
  2. Functions Change Feed Trigger uses the Change Feed Processor underneath, not all of the APIs that the Change Feed Processor has are exposed in Functions.
  3. FeedPollDelay is only used when Change Feed has no changes, see https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#processing-life-cycle
  4. The process of Checkpointing means, updating the state with the latest successful point in time (Continuation) that just got processed by your code.
  5. Regarding your sample logs, those logs indicate that your Function is taking too long to process the changes. See https://docs.microsoft.com/en-us/azure/azure-functions/functions-best-practices#avoid-long-running-functions, the Trigger already sent the changes to your Function, if you failed to process them, or did not correctly handled delivery, it is the Functions responsibility.
  6. If you have multiple Functions that should trigger from the same Monitored collection, and each one does different things, see https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-create-multiple-cosmos-db-triggers. If you want to scale out a single Function, you do not need to deploy multiple versions of it, you can either relay on Consumption Plan auto-scale or manually add/remove instances to the Function App Service Plan (depending on your hosting model).
  7. Changes do not get retried. See https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger, due to Functions billing model and to avoid infinite retries/billing, the Trigger will report a batch of changes to your Function and if it fails to process them (Unhandled exception, bug in your code), it won't retry them. You can certainly reprocess changes from the beginning though https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-changefeed-functions#need-to-restart-and-re-process-all-the-items-in-my-container-from-the-beginning.
  8. You can always use the Change Feed Processor directly, in the compute of your choice: https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#implementing-the-change-feed-processor, the biggest difference is that when there is an unhandled failure, the Change Feed Processor will retry the same batch of changes. Since billing in this architecture is your own compute, infinite retries due to a non-transient error do not push an unbounded billing.
  9. Functions Trigger, like Change Feed Processor, has an at-least-once delivery assert, with no SLAs for the time it takes for those changes to be delivered. The reason is, the time it takes for the changes to be picked up, depend on user factors (processing speed for past changes, compute size, network latency and location, etc)
  10. The Change Feed API does not contain ALL the intermediate changes a document receives (in an Update scenario):

    1. Change Feed Processor / Change Feed Trigger reads changes on time A, it checkpoints

    2. A document is created.

    3. Change Feed Processor / Change Feed Trigger reads changes on time B, picks up and reports creation of document

    4. Document is updated to version 2

    5. Document is updated to version 3

    6. Change Feed Processor / Change Feed Trigger reads changes on time C, it will pickup the document update but only version 3.

@ealsur Thanks for valuable information. Today I've re-ran my tests from scratch and all the consumers received the same number of document updates. I was doing only inserts and re-creating leases prior to running tests previously though, so maybe there is something else that skewed the results. Also I've did not manage to reproduce function error (the whole processing is just putting the whole received document to Event Hub).
Given the assumption we'll handle all the retry in our custom Processor and consume Change Feed directly in our services, are there any drawbacks other than additional ~2% RU added per consumer per collection? So far we tried to execute recommended architecture referenced in bunch of samples, Jet.com customer success story, etc. that uses Event Hub as intermediate storage for updated documents (Change Feed -> Azure Function -> Event Hub -> Consumer), but it does not look feasible now. We don't expect millions requests per second and just want to have updates in order and have ability to get arbitrary update upon request.

If I understand correctly from:

Given the assumption we'll handle all the retry in our custom Processor and consume Change Feed directly in our services, are there any drawbacks other than additional ~2% RU added per consumer per collection?

This means you will use your own compute to run Change Feed Processor instead of Functions? Drawbacks really subjective, Functions uses the same Change Feed Processor, so the drawback is tied to billing and time. Using your own compute does give you a higher degree of customization at the expense of having to manage and maintain it.

I'm not sure where that ~2% RU per consumer comes from, each instance consuming the Change Feed will consume RUs based on frequency of operations and volume of data, so it totally depends on your data size, volume of operations, and data distribution across partitions.

Regarding using Event Hub, I'm not sure why it does not look feasible, care to elaborate?

@ealsur

Using your own compute does give you a higher degree of customization at the expense of having to manage and maintain it.

Right, we may need for implementing complex errors processing. Despite transient errors like network problems we may have a whole class of data-bound issues since we are querying information from various sources. There is a case where data consumer would need to access some "previous" update (the one that was marked as successfully processed). I'm currently still investigating this, but off-the-cuff that's not possible with Functions. To rephrase - get single document from CF instead of moving feed cursor back in time and reprocess everything after it.

Regarding using Event Hub, I'm not sure why it does not look feasible, care to elaborate?

It adds complexity and costs. Initially we thought about it as a reliable event history source where we can potentially locate arbitrary changed document. That works like a charm actually, I'm persisting just the document offset and partition in Event Hub info and it's sufficient for Dead Letter handling.
Also we were thinking about Event Capture (+3x throughput costs additionally) to store raw or processed documents. Now if we can use CF directly, it looks more robust. Also, if needed, we can still do our own "Events Capture" via Function in literally no time.
My numbers on RUs are indeed synthetic, but it does not look there would be much costs saving for single CF consumer writing to EH and then everybody consuming EH vs. every consumer consuming CF directly. As @TheovanKraay absolutely nailed above, Cosmos our only source for events. If we had multiple source it would make more sense to use EH as an aggregate storage.
From a consumer standpoint there is also no much difference developing a consumer for Change Feed vs. Event Hub too.

Was this page helpful?
0 / 5 - 0 ratings