Azure-functions-host: Chance of getting 409 while renewing the primary host lock

Created on 5 Sep 2017 · 39Comments · Source: Azure/azure-functions-host

While investigating another issue I stumbled upon a lot of these 409s coming from renewing the primary host lock. I did some investigation and I believe that, because we do a 'fire and forget' with the host disposal, it's possible that a new host comes up and acquires the lock just before the old host releases it. That means that the next attempt to renew fails, as the lock has been released.

The error looks like: Singleton lock renewal failed for blob 'locks/AppName/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2017-09-02T18:38:31.526Z (111006 milliseconds ago) with a duration of 57508 milliseconds. The lease period was 15000 milliseconds.

I think it's happening with something like below.
Note 1 -- this is all on a single instance and references to hosts are about the hosts being cycled via the ScriptHostManager.
Note 2 -- All of the host locks use the current machine id as the proposed lease, which allows multiple hosts on the same machine modify the lease.

Host1 -> acquire lock
Host1 -> renew lock
something triggers host recycle and creates a new host while disposing of the old host
Host2 -> acquire lock
Host1 -> release lock
Host2 -> renew lock -> throw

It's actually possible for several other combinations to happen (I've seen a race where Host1 renews right as it releases, also causing a throw).

Ultimately, these errors seem benign as they'll eventually recover and all will be well, but it causes a lot of errors in the logs (I see 100k+ 409 logs that involve the host lock over the last week).

needs-investigation

Source

brettsam

👍4

Most helpful comment

I am also getting this error in a v2 C# function app.

Singleton lock renewal failed for blob 'xxx/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-09-14T04:18:13.801Z (40380 milliseconds ago) with a duration of 7 milliseconds. The lease period was 15000 milliseconds.

Is there something we should be doing to prevent this? Or should we ignore this error?

mpnow on 14 Sep 2019

👍9

All 39 comments

For now solution is to add settings.job with

{ "is_singleton": true }

It will not scale but at least it will not fail..

vovikdrg on 30 Oct 2017

@brettsam You said "acquires the lock just before the old host releases it" - how can that happen? The lease can only be held by one. Perhaps what you're saying is something like:

h1 has lock and is renewing it regularly
host restart is triggered
h1 is orphaned (on background thread)
h1 stopasync is called, and while that is progressing h2 is starting up
h2 attempts to acquire lock, before h1 has released it (before disposal happens)

However, PrimaryHostCoordinator.AcquireOrRenewLeaseAsync seems to already handle this case correctly - if it doesn't have the lease and fails to acquire it, no error is thrown.

mathewc on 11 Dec 2017

@mathewc, the lease can be held by multiple hosts inside single instance as we use the same leaseId (instanceId) to acquire the lease. It's flow @brettsam mentioned:
Host1 -> acquire lock
Host1 -> renew lock
something triggers host recycle and creates a new host while disposing of the old host
Host2 -> acquire lock
Host1 -> release lock
Host2 -> renew lock -> throw

alrod on 12 Dec 2017

I see, thanks for clarifying. If the problem is that the renewal fails because we already have the lock then can't we handle that condition by ignoring that specific error?

mathewc on 12 Dec 2017

Work has been merged in 1.x. Need to make sure this isn't an issue in 2.0.

fabiocav on 16 Jan 2018

mathewc on 23 Jan 2018

Hi @mathewc @fabiocav Is there any plans to fix this issue on function app v1 ? or the only solution is to move to v2 ?

itshemant09 on 11 May 2019

It's not clear whether this issue exists in v2 or not, however because this issue is so old and we haven't heard any complaints for v2 it likely doesn't. @itshemant09 are you seeing lots of these? The errors are benign, so shouldn't cause functional issues for you - just log noise.

mathewc on 14 May 2019

Same issue in v2 java function app.

Singleton lock renewal failed for blob 'xxx/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.

Runtime version: 2.0.12507.0 (beta)

asears on 12 Jun 2019

👍1

I am also getting this error in a v2 C# function app.

Is there something we should be doing to prevent this? Or should we ignore this error?

mpnow on 14 Sep 2019

👍9

I'm getting this error but it is causing my application to crash with an unhandled exception so I cannot ignore it.

hisuwh on 25 Sep 2019

We're also seeing this error. Any help would be great.

10/01/2019T08:21:52.000 -0500
Host lock lease acquired by instance ID 'a6e320715da18613c2cfb13182ae9d01'.
10/01/2019T08:22:00.000 -0500
Singleton lock renewal failed for blob 'aqinn1234-masterentitydata/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-10-01T13:21:35.637Z (24167 milliseconds ago) with a duration of 8 milliseconds. The lease period was 15000 milliseconds.
10/01/2019T08:22:00.000-0500
Failed to renew host lock lease: Another host has acquired the lease. The last successful renewal completed at 2019-10-01T13:21:35.637Z (24449 milliseconds ago) with a duration of 8 milliseconds.

beaubgosse on 1 Oct 2019

These logs are mostly noise and shouldn't be causing any crashes. It just means that the host lease was lost temporarily and it will be re-acquired shortly.

@hisuwh if you can share your exact error message with us, that would be helpful.

For anyone seeing this, if you could share your Function App name (either explicitly or privately), I can see if I find anything from the backend logs.

brettsam on 1 Oct 2019

@brettsam This is the error I'm getting:

[09:44:18 ERR] Singleton lock renewal failed for blob '6aaf581a4d74d37580b1ebda440a1171/Company.Application.AppProcess.CleanupFunction.Cleanup.Listener' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-10-02T09:41:14.402Z (184439 milliseconds ago) with a duration of 26 milliseconds. The lease period was 60000 milliseconds.

Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The lease ID specified did not match the lease ID for the blob.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at Microsoft.Azure.WebJobs.Host.StorageBaseDistributedLockManager.SingletonLockHandle.RenewAsync(ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host.Storage\Singleton\BlobLeaseDistributedLockManager.cs:line 347
   at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.ExecuteAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:line 332
   at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.RunAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs:line 147
   at System.Threading.Thread.ThreadMain_ThreadStart()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

I'm not actually running in functions but using the Azure Webjobs SDK and running in kubernetes. Still like you say I wouldn't expect this error to kill my process but it is

hisuwh on 2 Oct 2019

@hisuwh -- That's a different issue (note that the blob name is /...Listener, rather than /host). There's an old, long issue about that here: https://github.com/Azure/azure-webjobs-sdk/issues/822. For that issue, there's no real fix that we can do without a significant architectural change to run some of our host operations in a separate processes. The cause is almost always due to functions blocking threads that prevent our lease renewal timers to fire, thus the leases expire unexpectedly. My recommendation for that issue is to evaluate your function code for potential threading issues -- make sure it's using async everywhere possible and not blocking any threads.

brettsam on 2 Oct 2019

@brettsam its crashing at a point in time when none of the functions are actually firing. So I'm not sure how reviewing my code is going to help this issue.

hisuwh on 2 Oct 2019

I've pinged you on the other thread since your issue is not with the primary host lock. Let's continue there so we don't confuse this one.

brettsam on 2 Oct 2019

👍1

I'm seeing this error with a long-running, scaled-up set of the latest 2.0 azure node functions runtime docker image using a blob trigger:

Singleton lock renewal failed for blob '***-functions/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-10-27T16:12:10.853Z (17048 milliseconds ago) with a duration of 4 milliseconds. The lease period was 15000 milliseconds.

Just after seeing this message, blob trigger processing of new blobs stops. Restarting 1 of the 8 functions containers did not cause processing to resume, though the restarted container did not issue any error messages, about locks or otherwise. Picking another 3 of the 8 functions containers, effectively at random, and restarting them caused new blob processing to resume.

derekrprice on 28 Oct 2019

Related to Azure/azure-webjobs-sdk-extensions#249 or #4784?

derekrprice on 28 Oct 2019

Anyone have any idea what may be causing this? We're still seeing it. It appears to occur pretty consistently and once we see the error, processing in our 6 concurrently running containers stops, usually for many hours.

derekrprice on 15 Nov 2019

I am also getting this error in a v1 C# function app.
Singleton lock renewal failed for blob 'locks/bingnrthawkeyealertingtest/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-11-18T02:35:14.847Z (12014 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.

I followed this Troubleshooting , but this error is still there.

Is there something we should be doing to prevent this? Any insight you have would be greatly appreciated, thanks in advance for your help!

XimenaWang on 18 Nov 2019

@XimenaWang -- Is this causing issues? The "primary host" should be able to switch between different instances and I don't believe you'll have any issues. See my comment above: https://github.com/Azure/azure-functions-host/issues/1864#issuecomment-537067586.

I believe @derekrprice is seeing something unrelated that is causing the polling for blobs to deadlock (we're following up elsewhere).

brettsam on 18 Nov 2019

@brettsam
We are seeing the same error when using ServiceBus triggered AZ functions (v2) in the consumption tier (EAST US).

Singleton lock renewal failed for blob 'xx/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2019-11-18T21:32:58.201Z (11992 milliseconds ago) with a duration of 10 milliseconds. The lease period was 15000 milliseconds.

Not sure if this is of significance, but we see some signs that it occurs after the functions are done processing the messages on the queue.

thelgevold on 19 Nov 2019

Any lease renewals that end with /host are likely nothing to worry about. Instances can swap between "primary" hosts without affecting the running functions. Let me know if you see anything that may be leading to failures and I can investigate.

brettsam on 20 Nov 2019

👎2

We've been seeing this error logged in our function apps (C# v2) fairly regularly. It doesn't appear to cause any issues, but assuming there is nothing to worry about it's just unhelpful noise. Are there any plans to 'fix' this or change the logging level so that it doesn't appear as an error?

planty on 23 Dec 2019

👍2

Adding on to the issues people are seeing but I am still on v1 and just had a Function crash because of it. The issue is not benign on v1.

Singleton lock renewal failed for blob 'xx/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-01-14T16:30:56.506Z (16435 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.

The remote server returned an error: (409) Conflict. The remote server returned an error: (409) Conflict.

An unhandled exception has occurred. Host is shutting down.

brainded on 14 Jan 2020

👍2

Any lease renewals that end with /host are likely nothing to worry about. Instances can swap between "primary" hosts without affecting the running functions. Let me know if you see anything that may be leading to failures and I can investigate.

Do you mean like "Singleton lock renewal failed for blob 'func-test-dev-001/host'"? I am seeing this after I added a deployment slot.

mslot on 24 Feb 2020

I also see this from time to time and would like to keep my errors to 0.

It doesn't happen often, but I have 46 function apps running so across all of them its constant. I think it happens during release and swapping.

And one function app cannot be swapped from Azure DevOps for some reason and this error is happening most often for that function app. I am not sure about causality.

But I think there is cause for getting to the bottom of this error even if it does seem benign it could be causing unexpected side effects.

nbgyxi on 2 Apr 2020

@brettsam

Any lease renewals that end with /host are likely nothing to worry about. Instances can swap between "primary" hosts without affecting the running functions. Let me know if you see anything that may be leading to failures and I can investigate.

2020-04-15 09:35:27,550 Host.Singleton RD00155DE250FA ERROR Singleton lock renewal failed for blob 'fr911prodadapterprocessorfunc/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-04-15T09:34:52.747Z (34714 milliseconds ago) with a duration of 140 milliseconds. The lease period was 15000 milliseconds.

Overnight I saw this error a number of times, but it caused enough of a delay (5-10mins!) in consumption of ServiceBus messages that it triggered several application-level errors for us as processing these messages in a timely fashion is critical. Currently these are v2 functions with ServiceBus triggers deployed to an AppService that is scaled out on 2-N instances.

Paul

pseabury on 15 Apr 2020

Disregard my previous - after further investigation these errors were just coincidental to another network issue outside of Azure.

pseabury on 15 Apr 2020

I'm also seeing this error. This is the most recent instance:

Singleton lock renewal failed for blob 'myfunctionhere/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-04-26T09:50:57.017Z (18690 milliseconds ago) with a duration of 4 milliseconds. The lease period was 15000 milliseconds

I'm running on Functions v2 and all of the functions inside this app are HTTP triggered and short living.

I agree with the other people in this thread. If this isn't an issue, it shouldn't be logged as an error. Errors are something that we need to look at and not simply ignore (IMO).

ThomasArdal on 26 Apr 2020

👍5

Idea! Maybe use structured logging properties for the 2020-04-26T09:50:57.017Z (18690 milliseconds ago) part of the message. This would make it a lot easier for people to ignore errors of this type 👍

ThomasArdal on 27 Apr 2020

I have a webapp running azure functions and I am getting this everytime I restart the webapp or swap slots.

2020-09-15T10:01:12Z [Error] Singleton lock renewal failed for blob '.../host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-09-15T10:00:47.566Z (24432 milliseconds ago) with a duration of 3 milliseconds. The lease period was 15000 milliseconds.

wpitallo on 15 Sep 2020

@wpitallo any other side effect? This may happen if the primary host changes and the lease is stolen by another instance.

fabiocav on 15 Sep 2020

Not that I am aware of at this stage.

wpitallo on 15 Sep 2020

Same issue here with an Azure (~3) timer triggered function;

Singleton lock renewal failed for blob '.../host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.

Last successful lock time on this is weird. Doesn't seem to cause problems, just fills app insights with sev 3 errors

shaw-system1 on 5 Jan 2021

Hi All

I am facing same issue in my Azure Function app.

It seems like the real problem is that in non prod environments we are using single blob storage used by multiple Azure function apps. Hence when renew lease operation is performed by one of the Azure function app, then due multiple azure function apps using blob storage, the error of "Singleton lock renewal failed for blob" happens. Which in turns lead to restart of job host and hence azure function abruptly stops without any function error.

I can conclude the above fact based on the my investigation that our production environment doesn't have same setup of using single blob storage used by multiple azure function apps. And we have never face "Singleton lock renewal failed for blob" error in that environment.

So as per my understanding the solution to this error is avoid sharing single blob storage with multiple Azure functions.

TIA

ketanbembi on 21 Jan 2021

I can confirm that this happens in environments with individual storage accounts for each function app as well. I always create a new storage account for every new function app I create. Sharing storage accounts introduce some other problems why I wouldn't recommend sharing storage accounts between function apps ever 👍

ThomasArdal on 21 Jan 2021

I'll be following this, since we also see this problem with a consumption plan Azure Function App

beaubgosse on 21 Apr 2021

Was this page helpful?

0 / 5 - 0 ratings