Azure-functions-durable-extension: Question about durable functions performance and table storage latency

Created on 20 Aug 2019  路  13Comments  路  Source: Azure/azure-functions-durable-extension

Hello there,

I'm currently in the process of migrating a bunch of Azure functions to use the Durable extension for handling background processing, failures and retries (we originally used a custom logic based on queues).

While studying the documentation I've realized that durable functions make heavy use of Azure table storage to keep state. My concern is that since there seems to be no upper bound on the table storage latency, should I expect to see high variability in terms of performance?

How are people dealing with this?

thanks

question

Most helpful comment

Anecdotally, I'm running a Durable Function App that focuses on high-throughput and low-latency, and I've never noticed any wild differences, other than the odd transient dip in performance (which is to be expected of any shared-tenancy storage technology)

Do your customers/stakeholders have an SLA on your product/system? If no, try to define what is acceptable and then work backwards from that.

What I've done is used App Insights to emit Custom Events before and after each Activity trigger, this gives me a way to observe the throughput and look for downward performance trends over time. This has been useful when diagnosing issues to allow me to place blame on the vendor rather than my own code, and trigger investigations. Example https://github.com/Azure/azure-functions-durable-extension/issues/779

All 13 comments

Anecdotally, I'm running a Durable Function App that focuses on high-throughput and low-latency, and I've never noticed any wild differences, other than the odd transient dip in performance (which is to be expected of any shared-tenancy storage technology)

Do your customers/stakeholders have an SLA on your product/system? If no, try to define what is acceptable and then work backwards from that.

What I've done is used App Insights to emit Custom Events before and after each Activity trigger, this gives me a way to observe the throughput and look for downward performance trends over time. This has been useful when diagnosing issues to allow me to place blame on the vendor rather than my own code, and trigger investigations. Example https://github.com/Azure/azure-functions-durable-extension/issues/779

Thanks for sharing your experience @olitomlinson - I'm working on an open source consumer facing product, so we have set our SLAs based on the kind of user experience we want to provide to our users (i.e. a fast and responsive service).

One of our fundamental choices was to rely on cosmosdb for storing data, instead of simpler solutions like table storage - a choice made more for the guaranteed performance than for the added functionalities.

I understand that "usually" table storage is fast, and we haven't had any issue with that so far - my question I suppose is more about what could happen when, for instance, table storage latency spikes to 10s of seconds. How will durable functions behave? How will that impact the performance of the other functions running in the same app service?

Honestly is more about peace of mind than anything else, not having an "official" upper boundary on table storage latency concerns me a bit - bu maybe is just me 馃樃

Have you considered using Cosmos instead of table storage to get bounded latency ?

I haven't tried this, but it's an idea that came to mind while reading your post.

It's theoretically possible to use, I'm not sure how you would specify two connection strings though: one for tables and one for queues on the TaskHub.

@ElanHasson I haven/t thought about this, yes in theory cosmos provides a table storage API so it should be possible - it's probably not a big change to make, I'll look into the code, thanks for the idea!

Please do let us know how that turns out!

Using cosmos can also open up other opportunities for events around orchestrations and activities via the change feed.

For example, you could look at the change feed for completed or failed instances and run any retrylogic or even raise events that a particular instance ID has completed instead of polling using the http API.

I don't think its as easy as you would hope. You would have to create a new Storage Provider in DurableTask project to support Cosmos.

But then what about the Queues? Azure Storage Queues run off the same data storage technology as Azure Table. So if you are worried about latency of the Tables you also have to worry about the latency of the Queues as well. In which case you now need to build a ServiceBus provider (or something similar) to act as the broker, instead of Azure Storage Queues. More work for you.

If you go down this route, then your dependencies are spread across two separate technologies, (CosmosDb and Service Bus) however you need both to be up and alive, and your throughput/availability is always going to be based on the lowest common denominator of each of those Technologies.

Using Azure Storage, in my opinion is the best option - with just 1 dependency, its simple all or nothing proposition.

Btw there is a Redis DurableTask provider, but is still in early development and it has many limitations.


My rationale for adopting Durable Functions was that buying into the Framework now will allow me to ship faster to my customers (greenfield project, with existing 35,000 customers migrating to it).

As our proposition matures and gets more traction over the next few years, we will benefit from the advancements of Azure Storage improvements and alternative Storage Providers as DF matures.


Also, the Service Fabric team are currently working on a ReliableCollections provider for DurableTask. So if you are after low-latency and dedicated VM level throughput assurances, you might want to track that work in progress. Obviously this isn't an exact like for like with the Durable Functions Framework, but it is something worth considering.

I'm not sure I agree with you on this. I'm proposing mitigation of the unbounded latency introduced by storage by moving the table aspect of storage to cosmos.

Am I missing something in that you can't just change the connection string for an azure tables implementation and point it to cosmos without any work?

Queues will have to remain in as-is. I'm only suggesting we should have a way to set the queue connection string independently of the table connection string.

While we don't have a "drop in" solution to queues, at least we can solve half of the latency equation

Sorry, I didn't know you could use the existing Azure Storage SDK with CosmosDb! That's pretty neat!

If what you're saying is true, you will still have to do some work in DurableTask, to use a different connection string and then pull that package into a new build of Durable Functions Extension.

@cgillum is it that simple?

No worries @olitomlinson 馃榿

There is probably a little bit of work but it's much less than implementing a brand new provider.

Thanks for all the answers, I'll definitely try and look into it!

I really like this idea to use Cosmos DB with Service Bus.

The issue I faced with Storage Account:
We've created a Durable Entity Function, which was triggered ~1.000 times per minute. Each trigger updated the state of the Durable Entity, which resulted in a lot of Transactions towards Storage Account. This reflected in the bill we received.

With Cosmos DB I could have limited the bill defining some sane RU. In this solution, we already had a Service Bus (standard) so we could have reused the same namespace. We needed the ordered delivery, which is not supported in Storage Account Queues.

Another benefit of having Cosmos DB Table behind the scenes is that I could query the table, not just with Partition keys and row keys, but I could write more complex queries as well, possibly supply a UI with data/statistics, etc.
Cosmos Table indexes other properties as well, in contrast to Table Storage which indexes only PartitionID RowID and DateTime.

So if I would pay ~300$ ish for a storage solution I would expect that I can query it and I don't need to copy the data to some other storage solution.

Btw there is a Redis DurableTask provider, but is still in early development and it has many limitations.

@olitomlinson Yes, I found it after coming to the idea on my own that a Redis provider would be ideal for improving latency/performance, especially with durable entity functions. Sadly, it was barely a husk and offers almost zero support for anything useful. I was also told there are no plans to finish it (/cc @cgillum). I guess Microsoft offering an effective redis provider would make storage accounts redundant in many cases, cutting off a nice relatively unbounded revenue stream among other things.

Was this page helpful?
0 / 5 - 0 ratings