Azure-webjobs-sdk: WebJob processing of Service Bus queue crashes with high MaxConcurrentCalls

Created on 3 May 2018  Â·  15Comments  Â·  Source: Azure/azure-webjobs-sdk

Simple .NET Core 2.0 WebJob to consume a Service Bus queue, using WebJobs SDK 3.0.0-beta4.
Leaving MaxConcurrentCalls to its default value of 16, when running the WebJob in Azure, on an S3 app service (i.e. 4 core, 7 GB RAM), processing starts, completes a few items off the queue, but then stops/crashes. Running the same program on my local workstation (Xeon E5-1650 @ 3.5GHz, 32 GB RAM) no such problem occurs.
If I reduce MaxConcurrentCalls down to 3 or less, the program will happy continue to process until the queue is empty.

Expected behavior

I would expect the WebJob to continue working even if MaxConcurrentCalls is high.
Whether increasing MaxConcurrentCalls would increase throughput or not is beside the point.

Actual behavior

WebJob appears to stop/crash after processing the first few messages.

Known workarounds

Set MaxConcurrentCalls to 3, 2 or 1.

Related information

Version 3.0.0-beta4.
I've tried to use beta5 but it is missing ErrorTrigger and the WebJob doesn't start up at all.

All 15 comments

After further investigation it appears this is the same problem reported in #1530

Some questions:

  • What are you doing in the function itself?
  • If it's a no-op function, does it work okay with 16 functions running at once?
  • What error do you see when it crashes?

@brettsam to answer your questions:

  1. Nothing - I have a simple repro VS solution where the WebJob function body is empty. Would you like me to provide this repro?
  2. I don't understand. Do you mean setting MaxConcurrentCalls to 16 or having 16 separate named functions?
  3. According to the diagnostic logs I can find in Azure, the process simply exits with a non-zero error code. There is no error message logged in the output.
[05/14/2018 22:23:28 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/14/2018 22:23:28 > e8106a: SYS INFO] Status changed to Running
[05/14/2018 22:23:29 > e8106a: INFO] 
[05/14/2018 22:23:29 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\mhn1hqgu.pbl>dotnet Consumer.dll 
[05/14/2018 22:23:29 > e8106a: INFO] Listening
[05/14/2018 22:23:33 > e8106a: SYS ERR ] Job failed due to exit code -1073610685
[05/14/2018 22:23:33 > e8106a: SYS INFO] Process went down, waiting for 60 seconds
[05/14/2018 22:23:33 > e8106a: SYS INFO] Status changed to PendingRestart
[05/14/2018 22:24:33 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/14/2018 22:24:33 > e8106a: SYS INFO] Status changed to Running
[05/14/2018 22:24:33 > e8106a: INFO] 
[05/14/2018 22:24:33 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\mhn1hqgu.pbl>dotnet Consumer.dll 
[05/14/2018 22:24:34 > e8106a: INFO] Listening
[05/14/2018 22:24:39 > e8106a: SYS ERR ] Job failed due to exit code -1073610685
[05/14/2018 22:24:39 > e8106a: SYS INFO] Process went down, waiting for 60 seconds
[05/14/2018 22:24:39 > e8106a: SYS INFO] Status changed to PendingRestart
[05/14/2018 22:25:40 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/14/2018 22:25:40 > e8106a: SYS INFO] Status changed to Running
[05/14/2018 22:25:40 > e8106a: INFO] 
[05/14/2018 22:25:40 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\mhn1hqgu.pbl>dotnet Consumer.dll 
[05/14/2018 22:25:40 > e8106a: INFO] Listening

Sorry, by "16 functions running at once", I meant having MaxConcurrentCalls at 16. That was a confusing way for me to say it :-)

If you can share the code that repros this, that'd be very useful.

OK — I believe 16 is the default value, which that is what I've tested with.

Repro VS solution:

  • https://github.com/mattbrooks2010/azure-webjobs-sdk-issue-1686
  • The Consumer project is a .NET Core console app implementing a continuous Azure WebJob using Microsoft.Azure.WebJobs.ServiceBus.3.0.0-beta5
  • The Consumer.Function.Run method intentionally does nothing — not even log output. You may wish to include a call to Console.WriteLine(message) or similar if you want to see progress messages in the WebJob log file. It is also intentionally non-async.
  • The Producer project is a .NET Core console app which you can use to generate queue messages for the WebJob above

You'll need:

  • An Azure Service Bus namespace resource with a queue name of testqueue
  • An Azure App Service resource with connection strings for AzureWebJobsDashboard, AzureWebJobsStorage and AzureWebJobsServiceBus*
  • To publish the Consumer app using dotnet publish -c release
  • To deploy the published Consumer app to the App Service above as a continuous WebJob
  • To generate 100 queue messages using dotnet run -c release "Endpoint=sb://my-service-bus.servicebus.windows.net/;SharedAccessKeyName=AppSharedAccess Key;SharedAccessKey=xyz" 100
  • To observe the Azure WebJob log file, e.g. /data/jobs/continuous/Consumer/job_log.txt

_* Note: I needed to use the manage Service Bus connection string, rather than the shared connection string. When I used the latter, the WebJob would sit there and not process any queue messages. I think this is a separate bug._

The current 3.0 build has no logging by default (this is changing soon). For now, try adding this and re-deploy. I've done this and things seem to be running for me:

cfg.LoggerFactory = new LoggerFactory(); cfg.LoggerFactory.AddConsole();

You'll need to add the Microsoft.Extensions.Logging.Console nuget.

Then I enabled Application Logging (Filesystem) on my site, and I could see the streaming logs. The WebJobs dashboard also showed everything running as expected.

It's possible there's some race or other error that I'm just not hitting -- if you add that logging, do you see anything interesting?

Thanks for taking a look at my repro and for the tip about wiring up the console logger. There is nothing unexpected in the output from my point of view. I'll include it below anyway.

It is curious that you're not experiencing the same process termination condition. Please could we try and troubleshoot possible differences?

  • Runtime TFM netcoreapp2.0
  • Workstation .NET Core SDK v2.1.200
  • App Service .NET Core SDK v2.1.300-rc1-008673
  • App Service .NET Core runtime v2.1.0-rc1
  • App Service location UK West
  • App Service plan Standard: 1 Large
  • Service Bus messaging tier Standard
  • How many messages did you queue - could you try 1000?
  • Anything else that could be relevant?
[05/15/2018 18:50:28 > e8106a: SYS INFO] Job directory change detected: Job file 'run.cmd' timestamp differs between source and working directories.
[05/15/2018 18:50:29 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/15/2018 18:50:29 > e8106a: SYS INFO] Status changed to Running
[05/15/2018 18:50:29 > e8106a: INFO] 
[05/15/2018 18:50:29 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\zuhxwl2i.imy>dotnet Consumer.dll 
[05/15/2018 18:50:29 > e8106a: INFO] Listening
[05/15/2018 18:50:32 > e8106a: INFO] info: Host.Startup[0]
[05/15/2018 18:50:32 > e8106a: INFO]       Found the following functions:
[05/15/2018 18:50:32 > e8106a: INFO]       Consumer.Function.Run
[05/15/2018 18:50:32 > e8106a: INFO]       
[05/15/2018 18:50:32 > e8106a: INFO] info: Host.Startup[0]
[05/15/2018 18:50:32 > e8106a: INFO]       Job host started
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=bae411c7-1a25-4492-805e-e820be58aaaf)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=9a6b35ac-3c51-469d-8660-e2c7dd9d4d07)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=974fb94e-92d7-4d48-b146-357a24fa8939)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=587d533c-79c3-4cde-ac1f-80cc9d9c8e46)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=3086ca88-687d-4e72-b607-72604b039738)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=be90685c-e121-4777-b342-ad107f2d6077)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=17aac30c-3259-4ade-9e46-7a6e1ff4d769)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=20e1ee6d-c475-4d82-a2a2-37e642fc41cd)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=a6fd0add-7aaa-48e8-8c4a-7e0875b63012)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=02b06f3b-bc53-4ff6-ad07-1b70814e2d17)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=6caf27c2-e3c4-4da5-bf8a-d3a4d5709f5d)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=c211910f-70aa-4672-8d04-2b31050e4522)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=6abb949f-9631-4d57-8f52-76557112255f)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO]       Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=6772e4cd-f0ce-4893-a886-95ece853d98e)
[05/15/2018 18:51:43 > e8106a: SYS ERR ] Job failed due to exit code -1073610685
[05/15/2018 18:51:43 > e8106a: SYS INFO] Process went down, waiting for 60 seconds
[05/15/2018 18:51:43 > e8106a: SYS INFO] Status changed to PendingRestart

I'm seeing it now. That exit code appears to map to RPC_NT_INTERNAL_ERROR (assuming I'm looking this up correctly: https://errorcodelookup.com/?type=ntstatus&code=C0020043).

The WebJob will be running, and then I notice the process has restarted in the middle of nowhere. I'm not sure where the error is coming from. I'm going to follow up with others.

Thanks for the update — I'm glad you've been able to reproduce this now. Please keep me posted.

Related to #1530.

What if you explicitly turn off dashboard logging with cfg.DashboardConnectionString = null;? I'm suspecting that it's some race or unhandled exception in the dashboard logging code.

Yes! After setting cfg.DashboardConnectionString = null I'm now seeing 1000s of successfully executed functions / messages being processed per second without the process terminating.

Thanks for the confirmation -- now I need to track down where this is happening! I'll update this thread with what I find.

It is probably worth pointing out for the benefit of others, that the documentation for Get started with the WebJobs SDK > Enable console logging recommends the approach of disabling dashboard logging.

The dashboard is a legacy monitoring tool, and dashboard logging is not recommended for high-throughput production scenarios.

Update -- we've identified it as an issue involving the App Service Sandbox. It appears that during cancellation of a blob upload (and possibly in other cancelled HTTP scenarios), there's a chance that some internal API call will be rejected by the sandbox, resulting in an uncatchable exception that brings down the process. It's being investigated.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nibras85 picture nibras85  Â·  3Comments

fwd079 picture fwd079  Â·  5Comments

scramsby picture scramsby  Â·  3Comments

jansoren picture jansoren  Â·  5Comments

kamranayub picture kamranayub  Â·  4Comments