Azure-webjobs-sdk: Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within

Created on 23 Jun 2016  路  19Comments  路  Source: Azure/azure-webjobs-sdk

my continuous web job with queue is able to complete multiple messages in when run in local, but is able to process only one message when deployed in azure from visual studio. if the queue receives multiple messages it fails with above error
could any one please help me
below are the logs
sampleResults.txt

Most helpful comment

The error of "The client could not finish the operation within specified timeout" comes from Azure Storage SDK.

// Microsoft.WindowsAzure.Storage.Core.Util.Exceptions
internal static StorageException GenerateTimeoutException(RequestResult res, Exception inner)
{
    if (res != null)
    {
        res.HttpStatusCode = 408;
    }
    TimeoutException timeoutEx = new TimeoutException("The client could not finish the operation within specified timeout.", inner);
    return new StorageException(res, timeoutEx.Message, timeoutEx)
    {
        IsRetryable = false
    };
}

Per Azure Storage client's design, if a storage operation cannot finished within 5 mins, this exception will be raised from its func code about checking timeout.

// Microsoft.WindowsAzure.Storage.Shared.Protocol.Constants
/// <summary>
/// Default client side timeout for all service clients.
/// </summary>
public static readonly TimeSpan DefaultClientSideTimeout = TimeSpan.FromMinutes(5.0);

While Microsoft.WindowsAzure.Storage is right the component by which Webjobs SDK communicates with Azure Storage.

Here is my repro as below.( @mathewc, I'd also mail you the whole project with my Access Key, by which you could simply repro the issue by just double clicking)

1. Create an Azure Storage account, and create 4 queues(say, testjob1,testjob2,testjob3,testjob4, and the corresponding poison queues)

2. Insert a huge number(I used 5000) of messages into each queue. (A must, to simulate the long-time full load)

3. Create an .net(c#) console application. Add the references to 
    * Microsoft.Azure.WebJobs SDK, ver 2.0.0.0
    * Microsoft.WindowsAzure.Storage, ver 8.1.1.0
   and set the Storage Access key (for AzureWebJobsDashboard,AzureWebJobsStorage) to the storage created in step 1. 

4. In Main(), set JobHostConfiguration.Queues.BatchSize = 4, and then JobHost.RunAndBlock().

5. Create 4 queue processing functions for each Queues. Keeping the function body empty is OK. 

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob1")] string jobConfig,
            [Queue("testjob1-poison")]ICollector<string> failedJobQueue)
        {

        }

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob2")] string jobConfig,
            [Queue("testjob2-poison")]ICollector<string> failedJobQueue)
        {

        }

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob3")] string jobConfig,
            [Queue("testjob3-poison")]ICollector<string> failedJobQueue)
        {

        }

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob4")] string jobConfig,
            [Queue("testjob4-poison")]ICollector<string> failedJobQueue)
        {

        }

6. Compile and run the ConsoleApplication on your box locally. No need to deploye to Azure App Service.  

7. Wait for approximate 10mins or Longer (if your testing box is a workstation edition Windows). If server edition, might need to wait even longe).

8. With a relative big probability(say >80%), the application will crash with the TimeoutException above.

After the TimeoutException occurred, I attached a debugger into the process, and saw there were dozens (60 in this debugging, sometimes 80+ or 100+ or even more) of Azure Storage APM objects at the moment of the crash

0:011>!mex.grep APM !dumpheap -stat
00007ff7e99c2730       60         2880 Microsoft.WindowsAzure.Storage.Core.Util.APMWithTimeout
0:011> !DumpHeap /d -mt 00007ff7e99c2730
         Address               MT     Size
00000224b4b7a8e8 00007ff7e99c2730       48     
00000224b4b7d2e0 00007ff7e99c2730       48     
00000224b4b7f2c8 00007ff7e99c2730       48     
00000224b4b83020 00007ff7e99c2730       48     
00000224b4b85e98 00007ff7e99c2730       48     
00000224b4b8b608 00007ff7e99c2730       48     
00000224b4b8e268 00007ff7e99c2730       48     
00000224b4b91bc8 00007ff7e99c2730       48     
00000224b4b938b0 00007ff7e99c2730       48     
00000224b4b97458 00007ff7e99c2730       48     
00000224b50863e8 00007ff7e99c2730       48    
00000224b5089aa0 00007ff7e99c2730       48          
00000224b508c5a0 00007ff7e99c2730       48     
00000224b508ee88 00007ff7e99c2730       48     
00000224b5091860 00007ff7e99c2730       48     
00000224b50f5c38 00007ff7e99c2730       48     
00000224b5110e48 00007ff7e99c2730       48     
00000224b5112800 00007ff7e99c2730       48     
00000224b5113bf8 00007ff7e99c2730       48     
00000224b51172d0 00007ff7e99c2730       48     
00000224b511ee30 00007ff7e99c2730       48     
00000224b512d360 00007ff7e99c2730       48     
00000224b5132fc8 00007ff7e99c2730       48     
00000224b5135c08 00007ff7e99c2730       48     
00000224b5175280 00007ff7e99c2730       48     
00000224b51756f0 00007ff7e99c2730       48     
00000224b517f008 00007ff7e99c2730       48     
00000224b51a6e38 00007ff7e99c2730       48     
00000224b51bdbd0 00007ff7e99c2730       48     
00000224b51c4a18 00007ff7e99c2730       48     
00000224b51c76b0 00007ff7e99c2730       48     
00000224b51cc678 00007ff7e99c2730       48     
00000224b51cfd60 00007ff7e99c2730       48     
00000224b51d4ca0 00007ff7e99c2730       48     
00000224b51f93a8 00007ff7e99c2730       48     
00000224b51fc078 00007ff7e99c2730       48     
00000224b521da70 00007ff7e99c2730       48     
00000224b52a0c00 00007ff7e99c2730       48     
00000224b52a3bb0 00007ff7e99c2730       48     
00000224b52a64d8 00007ff7e99c2730       48     
00000224b52a7d80 00007ff7e99c2730       48     
00000224b52df520 00007ff7e99c2730       48     
00000224b52e5048 00007ff7e99c2730       48     
00000224b52e71b0 00007ff7e99c2730       48     
00000224b52ec380 00007ff7e99c2730       48     
00000224b52ee5c0 00007ff7e99c2730       48     
00000224b52f2fd0 00007ff7e99c2730       48     
00000224b52f4960 00007ff7e99c2730       48     
00000224b5320820 00007ff7e99c2730       48     
00000224b5339850 00007ff7e99c2730       48     
00000224b534f300 00007ff7e99c2730       48     
00000224b53ee7a8 00007ff7e99c2730       48     
00000224b5581b30 00007ff7e99c2730       48     
00000224b568a0c8 00007ff7e99c2730       48     
00000224b5794470 00007ff7e99c2730       48     
00000224b579a180 00007ff7e99c2730       48     
00000224b57c4fe8 00007ff7e99c2730       48     
00000224b57edca0 00007ff7e99c2730       48     
00000224b583f8a8 00007ff7e99c2730       48     
00000224b5851ab8 00007ff7e99c2730       48

Which shows, there were lots of unfinished async requests to Azure Storage.

Checked further, and saw most of them pointing to blob, e.g.:

http://webapplogs.blob.core.chinacloudapi.cn:80/azure-webjobs-hosts/output-logs/927c30d35cf143689bef9103d6374d28.txt
http://webapplogs.blob.core.chinacloudapi.cn:80/azure-jobs-host-output/2519009635007243405_1b447ea596614a859ea4efa835479e80

By checking the code, Microsoft.WindowsAzure.Storage client accesses Azure Storage DC via System.Net.HttpWebRequest. As we know, for a Windows Desktop based .net application, if uses httpWebRequest, by default, there were only 2 http channels allowed to a host (or 10 for Windows Server edition). So, suppose the timeoutException came from: those 60+ async Azure Storage operations were waiting for http connection, but the only 2(or 10) http connections were too busy to handle all of them timely, finally, after one of the async storage request waited for 300s , it triggered that timeoutException.

As a test, I added the codes below into Main(), without changing anything other

ThreadPool.SetMinThreads(100, 100);
ServicePointManager.DefaultConnectionLimit = 100;

Then, the exception never raised.

Since users are expecting to simply use the Webjobs SDK, and those underlying behaviors are more or less out of users' control, personally, is it possible that, add some code into WebJobsSDK/or AzureStorageClient, which monitors the process or component level azure storage operations time-taken and dynamically set the Service Point connection limit?

All 19 comments

What version of the SDK are you using, and what version of Azure Storage SDK?

Many thanks for reply Mathew
below are the details
WebJob SDK version 1.1.2
Storage version 7.0.0

Having the same issue here. Same vresions of Webjob SDK and Storage lib. I consistently get this error, usually around 7 minutes into the execution of the webjob. Clarification, I consistently get this error for a particular queue message (one which generates a larger/longer to process load).

Have looked at various timeout app setting options (also various attributes such as "Timeout") but with no luck.

Any advice on how to stop this timeout?

Do you have these same issues if you downgrade your storage version to 6.2.0? Note that the Storage SDK version that the SDK targets (and is tested against) is 4.3 (a really old version). In the 1.1.2 release, we removed the version cap to allow people to move to later versions, however that is at your own discretion. We have run our full test passes successfully against 6.2.0, but there are known breaking changes in 7.0.0 on some codepaths.

Please reopen this if you are still experiencing issues on version 6.2.0.

Hi Mathew
I have changed the version and deployed it was working fine for 4 to 5 queue messages but was getting error if submitted more. below is the error
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at LPTWebJob.Functions.ProcessQueueMessage(String message, TextWriter log)

I also tried downgrading to 6.2.0... no luck then 4.3.0 still no luck. Same exception as before :/

Hey Ken, do you have3 any idea how to re-open this issue? If so please let me know so that we can re-open this am not sure if Mathew gets to know abt our issue

Can you please provide me with a constrained/simplified repro app that exhibits the behavior (e.g. a simple console project with job function and host that I can run)? If you could get a repro together and share it in a repo, that would help us figure this out as quickly as possible. Thanks :)

Hi Mathew,

I have shared the web job code and the details in mail. Please look into issue and help us. Along with the old logs I have shared , I was able to capture logs for when I have repro the issue. PFA
logs_12072016.txt

Was there ever a solution to this? I'm still getting this error message on my webjob. Seems to be a problem when a webjob run for more than 10 mins.

Same or very similar problem here
Microsoft.WindowsAzure.Storage.StorageException: Microsoft.WindowsAzure.Storage.StorageException:
The client could not finish the operation within specified timeout. --->
System.TimeoutException: The client could not finish the operation within specified timeout.
--- End of inner exception stack trace ---
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsyncT
in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
at Microsoft.WindowsAzure.Storage.Blob.CloudBlobContainer.EndCreateIfNotExists(IAsyncResult asyncResult)
in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Blob\CloudBlobContainer.cs:line 329
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass11.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.WebJobs.Host.Protocols.PersistentQueueWriter1.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Loggers.CompositeFunctionInstanceLogger.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.d__13.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.d__10.MoveNext() Request Information RequestID: RequestDate: StatusMessage:

net: 461
jobs : 2.0.0
Storage 8.1.3

Is there a solution to this? Seeing the same issue myself.

Can confirm this is happening for us too. The WebJob is not logging _anything_ to the storage/log so this seems to be an internal SDK issue. We're not even using Queues so really have no idea where CloudQueue.cs comes in:

[07/17/2017 15:44:34 > e63840: SYS INFO] WebJob is still running
[07/17/2017 21:08:27 > e63840: ERR ] 
[07/17/2017 21:08:27 > e63840: ERR ] Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within specified timeout. ---> System.TimeoutException: The client could not finish the operation within specified timeout.
[07/17/2017 21:08:27 > e63840: ERR ]    --- End of inner exception stack trace ---
[07/17/2017 21:08:27 > e63840: ERR ]    at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
[07/17/2017 21:08:27 > e63840: ERR ]    at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.EndExists(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Queue\CloudQueue.cs:line 1027
[07/17/2017 21:08:27 > e63840: ERR ]    at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
[07/17/2017 21:08:27 > e63840: ERR ] --- End of stack trace from previous location where exception was thrown ---
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
[07/17/2017 21:08:27 > e63840: ERR ]    at Microsoft.Azure.WebJobs.Host.Queues.Listeners.QueueListener.<ExecuteAsync>d__4.MoveNext()
[07/17/2017 21:08:27 > e63840: ERR ] --- End of stack trace from previous location where exception was thrown ---
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
[07/17/2017 21:08:27 > e63840: ERR ]    at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.<RunAsync>d__d.MoveNext()
[07/17/2017 21:08:27 > e63840: ERR ] --- End of stack trace from previous location where exception was thrown ---
[07/17/2017 21:08:27 > e63840: ERR ]    at Microsoft.Azure.WebJobs.Host.Timers.BackgroundExceptionDispatcher.<>c__DisplayClass1.<Throw>b__0()
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
[07/17/2017 21:08:27 > e63840: ERR ]    at System.Threading.ThreadHelper.ThreadStart()
[07/17/2017 21:08:27 > e63840: ERR ] 
[07/17/2017 21:08:28 > e63840: ERR ] Unhandled Exception: 
[07/17/2017 21:08:28 > e63840: SYS ERR ] Job failed due to exit code -532462766
[07/17/2017 21:08:28 > e63840: SYS INFO] Process went down, waiting for 0 seconds
[07/17/2017 21:08:28 > e63840: SYS INFO] Status changed to PendingRestart

EDIT: This thread seems to suggest that this should be fixed since 2.0.0 and we're on 1.x, will update and see if occurs again.

EDIT2: It is still occurring with version 2.0.0

The error of "The client could not finish the operation within specified timeout" comes from Azure Storage SDK.

// Microsoft.WindowsAzure.Storage.Core.Util.Exceptions
internal static StorageException GenerateTimeoutException(RequestResult res, Exception inner)
{
    if (res != null)
    {
        res.HttpStatusCode = 408;
    }
    TimeoutException timeoutEx = new TimeoutException("The client could not finish the operation within specified timeout.", inner);
    return new StorageException(res, timeoutEx.Message, timeoutEx)
    {
        IsRetryable = false
    };
}

Per Azure Storage client's design, if a storage operation cannot finished within 5 mins, this exception will be raised from its func code about checking timeout.

// Microsoft.WindowsAzure.Storage.Shared.Protocol.Constants
/// <summary>
/// Default client side timeout for all service clients.
/// </summary>
public static readonly TimeSpan DefaultClientSideTimeout = TimeSpan.FromMinutes(5.0);

While Microsoft.WindowsAzure.Storage is right the component by which Webjobs SDK communicates with Azure Storage.

Here is my repro as below.( @mathewc, I'd also mail you the whole project with my Access Key, by which you could simply repro the issue by just double clicking)

1. Create an Azure Storage account, and create 4 queues(say, testjob1,testjob2,testjob3,testjob4, and the corresponding poison queues)

2. Insert a huge number(I used 5000) of messages into each queue. (A must, to simulate the long-time full load)

3. Create an .net(c#) console application. Add the references to 
    * Microsoft.Azure.WebJobs SDK, ver 2.0.0.0
    * Microsoft.WindowsAzure.Storage, ver 8.1.1.0
   and set the Storage Access key (for AzureWebJobsDashboard,AzureWebJobsStorage) to the storage created in step 1. 

4. In Main(), set JobHostConfiguration.Queues.BatchSize = 4, and then JobHost.RunAndBlock().

5. Create 4 queue processing functions for each Queues. Keeping the function body empty is OK. 

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob1")] string jobConfig,
            [Queue("testjob1-poison")]ICollector<string> failedJobQueue)
        {

        }

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob2")] string jobConfig,
            [Queue("testjob2-poison")]ICollector<string> failedJobQueue)
        {

        }

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob3")] string jobConfig,
            [Queue("testjob3-poison")]ICollector<string> failedJobQueue)
        {

        }

    public async Task ProcessQueueMessage(
            [QueueTrigger("testjob4")] string jobConfig,
            [Queue("testjob4-poison")]ICollector<string> failedJobQueue)
        {

        }

6. Compile and run the ConsoleApplication on your box locally. No need to deploye to Azure App Service.  

7. Wait for approximate 10mins or Longer (if your testing box is a workstation edition Windows). If server edition, might need to wait even longe).

8. With a relative big probability(say >80%), the application will crash with the TimeoutException above.

After the TimeoutException occurred, I attached a debugger into the process, and saw there were dozens (60 in this debugging, sometimes 80+ or 100+ or even more) of Azure Storage APM objects at the moment of the crash

0:011>!mex.grep APM !dumpheap -stat
00007ff7e99c2730       60         2880 Microsoft.WindowsAzure.Storage.Core.Util.APMWithTimeout
0:011> !DumpHeap /d -mt 00007ff7e99c2730
         Address               MT     Size
00000224b4b7a8e8 00007ff7e99c2730       48     
00000224b4b7d2e0 00007ff7e99c2730       48     
00000224b4b7f2c8 00007ff7e99c2730       48     
00000224b4b83020 00007ff7e99c2730       48     
00000224b4b85e98 00007ff7e99c2730       48     
00000224b4b8b608 00007ff7e99c2730       48     
00000224b4b8e268 00007ff7e99c2730       48     
00000224b4b91bc8 00007ff7e99c2730       48     
00000224b4b938b0 00007ff7e99c2730       48     
00000224b4b97458 00007ff7e99c2730       48     
00000224b50863e8 00007ff7e99c2730       48    
00000224b5089aa0 00007ff7e99c2730       48          
00000224b508c5a0 00007ff7e99c2730       48     
00000224b508ee88 00007ff7e99c2730       48     
00000224b5091860 00007ff7e99c2730       48     
00000224b50f5c38 00007ff7e99c2730       48     
00000224b5110e48 00007ff7e99c2730       48     
00000224b5112800 00007ff7e99c2730       48     
00000224b5113bf8 00007ff7e99c2730       48     
00000224b51172d0 00007ff7e99c2730       48     
00000224b511ee30 00007ff7e99c2730       48     
00000224b512d360 00007ff7e99c2730       48     
00000224b5132fc8 00007ff7e99c2730       48     
00000224b5135c08 00007ff7e99c2730       48     
00000224b5175280 00007ff7e99c2730       48     
00000224b51756f0 00007ff7e99c2730       48     
00000224b517f008 00007ff7e99c2730       48     
00000224b51a6e38 00007ff7e99c2730       48     
00000224b51bdbd0 00007ff7e99c2730       48     
00000224b51c4a18 00007ff7e99c2730       48     
00000224b51c76b0 00007ff7e99c2730       48     
00000224b51cc678 00007ff7e99c2730       48     
00000224b51cfd60 00007ff7e99c2730       48     
00000224b51d4ca0 00007ff7e99c2730       48     
00000224b51f93a8 00007ff7e99c2730       48     
00000224b51fc078 00007ff7e99c2730       48     
00000224b521da70 00007ff7e99c2730       48     
00000224b52a0c00 00007ff7e99c2730       48     
00000224b52a3bb0 00007ff7e99c2730       48     
00000224b52a64d8 00007ff7e99c2730       48     
00000224b52a7d80 00007ff7e99c2730       48     
00000224b52df520 00007ff7e99c2730       48     
00000224b52e5048 00007ff7e99c2730       48     
00000224b52e71b0 00007ff7e99c2730       48     
00000224b52ec380 00007ff7e99c2730       48     
00000224b52ee5c0 00007ff7e99c2730       48     
00000224b52f2fd0 00007ff7e99c2730       48     
00000224b52f4960 00007ff7e99c2730       48     
00000224b5320820 00007ff7e99c2730       48     
00000224b5339850 00007ff7e99c2730       48     
00000224b534f300 00007ff7e99c2730       48     
00000224b53ee7a8 00007ff7e99c2730       48     
00000224b5581b30 00007ff7e99c2730       48     
00000224b568a0c8 00007ff7e99c2730       48     
00000224b5794470 00007ff7e99c2730       48     
00000224b579a180 00007ff7e99c2730       48     
00000224b57c4fe8 00007ff7e99c2730       48     
00000224b57edca0 00007ff7e99c2730       48     
00000224b583f8a8 00007ff7e99c2730       48     
00000224b5851ab8 00007ff7e99c2730       48

Which shows, there were lots of unfinished async requests to Azure Storage.

Checked further, and saw most of them pointing to blob, e.g.:

http://webapplogs.blob.core.chinacloudapi.cn:80/azure-webjobs-hosts/output-logs/927c30d35cf143689bef9103d6374d28.txt
http://webapplogs.blob.core.chinacloudapi.cn:80/azure-jobs-host-output/2519009635007243405_1b447ea596614a859ea4efa835479e80

By checking the code, Microsoft.WindowsAzure.Storage client accesses Azure Storage DC via System.Net.HttpWebRequest. As we know, for a Windows Desktop based .net application, if uses httpWebRequest, by default, there were only 2 http channels allowed to a host (or 10 for Windows Server edition). So, suppose the timeoutException came from: those 60+ async Azure Storage operations were waiting for http connection, but the only 2(or 10) http connections were too busy to handle all of them timely, finally, after one of the async storage request waited for 300s , it triggered that timeoutException.

As a test, I added the codes below into Main(), without changing anything other

ThreadPool.SetMinThreads(100, 100);
ServicePointManager.DefaultConnectionLimit = 100;

Then, the exception never raised.

Since users are expecting to simply use the Webjobs SDK, and those underlying behaviors are more or less out of users' control, personally, is it possible that, add some code into WebJobsSDK/or AzureStorageClient, which monitors the process or component level azure storage operations time-taken and dynamically set the Service Point connection limit?

@bz0108 Respect to your research and solution, worked great and improved the job execution speed 10x!

We are encountering this issue as well. We are currently using Microsoft.Azure.WebJobs v2.0.0.
Following the suggestion to increase the minimum number of threads the thread pool creates, and the default connection limit did not work for us. Would changing our Web Job to an Azure Function App help us in any way? Or will there be a fix for this coming in the near future?

@bz0108 Thanks for the detailed analysis. We actually identified this ourselves in the context of Azure Functions (built on this SDK) and put in place code in that runtime to configure min threads automatically (see code here). That makes sense to do in Azure Functions because its a managed environment in the sense that the customer doesn't own host startup/management. However, in WebJobs scenarios the customer owns the host process and that's the place where such configuration should take place.

All that said, I agree with you that if this is a common thing that needs configuration in order to run successfully at scale, we should consider automatically configuring a default.

Assigning back to triage for discussion.

So in the context of azure functions, we as the service owners own main() and it makes sense that we would change these values. But for webjobs, the user still owns main(). We're just libraries. Its a bad pattern for the libraries to transparently set these values that apply for the entire process.

I updated my main function with below code

// I have set DefaultConnectionLimit = Int32.MaxValue based on recommendation from below link
// https://docs.microsoft.com/en-us/azure/app-service/webjobs-sdk-how-to#jobhost-servicepointmanager-settings

System.Net.ServicePointManager.DefaultConnectionLimit = Int32.MaxValue;
System.Threading.ThreadPool.SetMinThreads(100, 100);

and got this error

Microsoft.WindowsAzure.Storage.StorageException: There were not enough free threads in the ThreadPool to complete the operation.
---> System.InvalidOperationException: There were not enough free threads in the ThreadPool to complete the operation.
at System.Net.HttpWebRequest.BeginGetResponse(AsyncCallback callback, Object state)
at Microsoft.WindowsAzure.Storage.Core.Util.APMWithTimeout.Begin(Func3 beginMethod, AsyncCallback callback, Object state, TimeSpan timeout) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\APMWithTimeout.cs:line 60 at Microsoft.WindowsAzure.Storage.Core.Util.APMWithTimeout.RunWithTimeout(Func3 beginMethod, AsyncCallback callback, TimerCallback timeoutCallback, Object state, TimeSpan timeout) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\APMWithTimeout.cs:line 35
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.BeginGetResponseT in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 243

--- End of inner exception stack trace ---
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsyncT in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.EndUpdateMessage(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\QueueCloudQueue.cs:line 1474
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass4.b__3(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 114

--- End of stack trace from previous location where exception was thrown


at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Queues.Listeners.UpdateQueueMessageVisibilityCommand.d__5.MoveNext()

--- End of stack trace from previous location where exception was thrown


at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.d__14.MoveNext()

--- End of stack trace from previous location where exception was thrown


at Microsoft.Azure.WebJobs.Host.Timers.WebJobsExceptionHandler.<>c__DisplayClass3_0.b__0()
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart() Request Information RequestID: RequestDate: StatusMessage: System.InvalidOperationException: There were not enough free threads in the ThreadPool to complete the operation.
at System.Net.HttpWebRequest.BeginGetResponse(AsyncCallback callback, Object state)
at Microsoft.WindowsAzure.Storage.Core.Util.APMWithTimeout.Begin(Func3 beginMethod, AsyncCallback callback, Object state, TimeSpan timeout) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\APMWithTimeout.cs:line 60 at Microsoft.WindowsAzure.Storage.Core.Util.APMWithTimeout.RunWithTimeout(Func3 beginMethod, AsyncCallback callback, TimerCallback timeoutCallback, Object state, TimeSpan timeout) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\APMWithTimeout.cs:line 35
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.BeginGetResponseT in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 243

Was this page helpful?
0 / 5 - 0 ratings