Azure-docs: Storage failover performance

Created on 21 Dec 2020  Â·  2Comments  Â·  Source: MicrosoftDocs/azure-docs

In reviewing the article https://docs.microsoft.com/en-us/azure/storage/common/storage-disaster-recovery-guidance it states that failover typicaly can take upto an hour. Why is the storage account so slow to fail over once initiated? I would expect to be able to read/write to a RA-GRS account much faster since all that needs to be done is to change DNS and make writable. I am assuming the time is driven by the time it takes to replicate storage TO LRS, is there any plans to address the performance time in the future? btw my testing validated the long time required approximately 20 minutes for about 100mb blob storage.

-Johnny Galindo


Document Details

âš  Do not edit this section. It is required for docs.microsoft.com âžź GitHub issue linking.

commosubsvc cxp product-question storagsvc triaged

Most helpful comment

@byte-master

An account failover usually involves some data loss. It's important to understand the implications of initiating an account failover.

Data loss can occur because replication is asynchronous, so if there are writes that haven’t been synchronized yet they may be lost. This makes it critical to not initiate a failover too quickly. A short term outages are often mitigated within minutes (if not seconds), and don’t involve data loss (as local transactions are synchronous and asynchronous replication is simply delayed, not lost). It’s only when the system is completely non-recoverable (e.g. if a meteor strike took out a data center) that we really want to invoke failover and deal with the potential for data loss (in the meteor strike case, the data is clearly lost). That’s why we quote an hour, if we switch too fast, we risk losing data that we otherwise wouldn’t lose.

When a customer manually fails over, and there is no underlying Azure failure, the system may wait for transactions to drain out to minimize data loss.

Switching makes the account LRS (which doesn’t require copying any data) because we have to stop replicating data. Maintaining replication would require reversing replication direction and (in the outage case) would also mean we were trying to replicate to a storage stamp that was no longer available, so we wouldn’t be able to meet any redundancy guarantees (which is why re-establishing GRS is a manual operation). Applications that were counting on that redundancy or using region specific URLs would break and costs would be incurred in re-establishing GRS. Until GRS is re-established, applications would (at best) be operating in a degraded mode (without GRS backup).

The size of the storage account isn’t important (to failover), the data is already replicated. It’s only the data in flight that’s a concern, and in a true, full outage, it’s either there or it’s not (which is why data loss is a risk). (Size of data is important when re-establishing GRS since the data must be copied to a new region.)

All this can make GRS failover a costly (and even slightly risky) operation. It’s designed as a way to minimize data loss and disruption to a business in the case of major outages. It’s not an optimal solution for high-availability solutions except as a defense in depth strategy when paired with other technologies (e.g. GZRS storage which add synchronous zonal replication, Cosmos DB with multiple write regions, or other synchronization technologies).

Which, again, is why we don’t want to failover too quickly: we don’t want to create extra work and expense for the customer, risk breaking applications, or put them into a non-replicating state if we don’t absolutely have to.

All 2 comments

Thanks for the feedback! We are currently investigating and will update you shortly.

@byte-master

An account failover usually involves some data loss. It's important to understand the implications of initiating an account failover.

Data loss can occur because replication is asynchronous, so if there are writes that haven’t been synchronized yet they may be lost. This makes it critical to not initiate a failover too quickly. A short term outages are often mitigated within minutes (if not seconds), and don’t involve data loss (as local transactions are synchronous and asynchronous replication is simply delayed, not lost). It’s only when the system is completely non-recoverable (e.g. if a meteor strike took out a data center) that we really want to invoke failover and deal with the potential for data loss (in the meteor strike case, the data is clearly lost). That’s why we quote an hour, if we switch too fast, we risk losing data that we otherwise wouldn’t lose.

When a customer manually fails over, and there is no underlying Azure failure, the system may wait for transactions to drain out to minimize data loss.

Switching makes the account LRS (which doesn’t require copying any data) because we have to stop replicating data. Maintaining replication would require reversing replication direction and (in the outage case) would also mean we were trying to replicate to a storage stamp that was no longer available, so we wouldn’t be able to meet any redundancy guarantees (which is why re-establishing GRS is a manual operation). Applications that were counting on that redundancy or using region specific URLs would break and costs would be incurred in re-establishing GRS. Until GRS is re-established, applications would (at best) be operating in a degraded mode (without GRS backup).

The size of the storage account isn’t important (to failover), the data is already replicated. It’s only the data in flight that’s a concern, and in a true, full outage, it’s either there or it’s not (which is why data loss is a risk). (Size of data is important when re-establishing GRS since the data must be copied to a new region.)

All this can make GRS failover a costly (and even slightly risky) operation. It’s designed as a way to minimize data loss and disruption to a business in the case of major outages. It’s not an optimal solution for high-availability solutions except as a defense in depth strategy when paired with other technologies (e.g. GZRS storage which add synchronous zonal replication, Cosmos DB with multiple write regions, or other synchronization technologies).

Which, again, is why we don’t want to failover too quickly: we don’t want to create extra work and expense for the customer, risk breaking applications, or put them into a non-replicating state if we don’t absolutely have to.

Was this page helpful?
0 / 5 - 0 ratings