Simple .NET Core 2.0 WebJob to consume a Service Bus queue, using WebJobs SDK 3.0.0-beta4.
Leaving MaxConcurrentCalls to its default value of 16, when running the WebJob in Azure, on an S3 app service (i.e. 4 core, 7 GB RAM), processing starts, completes a few items off the queue, but then stops/crashes. Running the same program on my local workstation (Xeon E5-1650 @ 3.5GHz, 32 GB RAM) no such problem occurs.
If I reduce MaxConcurrentCalls down to 3 or less, the program will happy continue to process until the queue is empty.
I would expect the WebJob to continue working even if MaxConcurrentCalls is high.
Whether increasing MaxConcurrentCalls would increase throughput or not is beside the point.
WebJob appears to stop/crash after processing the first few messages.
Set MaxConcurrentCalls to 3, 2 or 1.
Version 3.0.0-beta4.
I've tried to use beta5 but it is missing ErrorTrigger and the WebJob doesn't start up at all.
After further investigation it appears this is the same problem reported in #1530
Some questions:
@brettsam to answer your questions:
MaxConcurrentCalls to 16 or having 16 separate named functions?[05/14/2018 22:23:28 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/14/2018 22:23:28 > e8106a: SYS INFO] Status changed to Running
[05/14/2018 22:23:29 > e8106a: INFO]
[05/14/2018 22:23:29 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\mhn1hqgu.pbl>dotnet Consumer.dll
[05/14/2018 22:23:29 > e8106a: INFO] Listening
[05/14/2018 22:23:33 > e8106a: SYS ERR ] Job failed due to exit code -1073610685
[05/14/2018 22:23:33 > e8106a: SYS INFO] Process went down, waiting for 60 seconds
[05/14/2018 22:23:33 > e8106a: SYS INFO] Status changed to PendingRestart
[05/14/2018 22:24:33 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/14/2018 22:24:33 > e8106a: SYS INFO] Status changed to Running
[05/14/2018 22:24:33 > e8106a: INFO]
[05/14/2018 22:24:33 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\mhn1hqgu.pbl>dotnet Consumer.dll
[05/14/2018 22:24:34 > e8106a: INFO] Listening
[05/14/2018 22:24:39 > e8106a: SYS ERR ] Job failed due to exit code -1073610685
[05/14/2018 22:24:39 > e8106a: SYS INFO] Process went down, waiting for 60 seconds
[05/14/2018 22:24:39 > e8106a: SYS INFO] Status changed to PendingRestart
[05/14/2018 22:25:40 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/14/2018 22:25:40 > e8106a: SYS INFO] Status changed to Running
[05/14/2018 22:25:40 > e8106a: INFO]
[05/14/2018 22:25:40 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\mhn1hqgu.pbl>dotnet Consumer.dll
[05/14/2018 22:25:40 > e8106a: INFO] Listening
Sorry, by "16 functions running at once", I meant having MaxConcurrentCalls at 16. That was a confusing way for me to say it :-)
If you can share the code that repros this, that'd be very useful.
OK — I believe 16 is the default value, which that is what I've tested with.
Repro VS solution:
Consumer project is a .NET Core console app implementing a continuous Azure WebJob using Microsoft.Azure.WebJobs.ServiceBus.3.0.0-beta5Consumer.Function.Run method intentionally does nothing — not even log output. You may wish to include a call to Console.WriteLine(message) or similar if you want to see progress messages in the WebJob log file. It is also intentionally non-async. Producer project is a .NET Core console app which you can use to generate queue messages for the WebJob aboveYou'll need:
testqueueAzureWebJobsDashboard, AzureWebJobsStorage and AzureWebJobsServiceBus*Consumer app using dotnet publish -c releaseConsumer app to the App Service above as a continuous WebJobdotnet run -c release "Endpoint=sb://my-service-bus.servicebus.windows.net/;SharedAccessKeyName=AppSharedAccess
Key;SharedAccessKey=xyz" 100/data/jobs/continuous/Consumer/job_log.txt_* Note: I needed to use the manage Service Bus connection string, rather than the shared connection string. When I used the latter, the WebJob would sit there and not process any queue messages. I think this is a separate bug._
The current 3.0 build has no logging by default (this is changing soon). For now, try adding this and re-deploy. I've done this and things seem to be running for me:
cfg.LoggerFactory = new LoggerFactory();
cfg.LoggerFactory.AddConsole();
You'll need to add the Microsoft.Extensions.Logging.Console nuget.
Then I enabled Application Logging (Filesystem) on my site, and I could see the streaming logs. The WebJobs dashboard also showed everything running as expected.
It's possible there's some race or other error that I'm just not hitting -- if you add that logging, do you see anything interesting?
Thanks for taking a look at my repro and for the tip about wiring up the console logger. There is nothing unexpected in the output from my point of view. I'll include it below anyway.
It is curious that you're not experiencing the same process termination condition. Please could we try and troubleshoot possible differences?
netcoreapp2.0[05/15/2018 18:50:28 > e8106a: SYS INFO] Job directory change detected: Job file 'run.cmd' timestamp differs between source and working directories.
[05/15/2018 18:50:29 > e8106a: SYS INFO] Run script 'run.cmd' with script host - 'WindowsScriptHost'
[05/15/2018 18:50:29 > e8106a: SYS INFO] Status changed to Running
[05/15/2018 18:50:29 > e8106a: INFO]
[05/15/2018 18:50:29 > e8106a: INFO] D:\local\Temp\jobs\continuous\Consumer\zuhxwl2i.imy>dotnet Consumer.dll
[05/15/2018 18:50:29 > e8106a: INFO] Listening
[05/15/2018 18:50:32 > e8106a: INFO] info: Host.Startup[0]
[05/15/2018 18:50:32 > e8106a: INFO] Found the following functions:
[05/15/2018 18:50:32 > e8106a: INFO] Consumer.Function.Run
[05/15/2018 18:50:32 > e8106a: INFO]
[05/15/2018 18:50:32 > e8106a: INFO] info: Host.Startup[0]
[05/15/2018 18:50:32 > e8106a: INFO] Job host started
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=bae411c7-1a25-4492-805e-e820be58aaaf)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=9a6b35ac-3c51-469d-8660-e2c7dd9d4d07)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=974fb94e-92d7-4d48-b146-357a24fa8939)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=587d533c-79c3-4cde-ac1f-80cc9d9c8e46)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=3086ca88-687d-4e72-b607-72604b039738)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=be90685c-e121-4777-b342-ad107f2d6077)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=17aac30c-3259-4ade-9e46-7a6e1ff4d769)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=20e1ee6d-c475-4d82-a2a2-37e642fc41cd)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=a6fd0add-7aaa-48e8-8c4a-7e0875b63012)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=02b06f3b-bc53-4ff6-ad07-1b70814e2d17)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=6caf27c2-e3c4-4da5-bf8a-d3a4d5709f5d)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=c211910f-70aa-4672-8d04-2b31050e4522)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=6abb949f-9631-4d57-8f52-76557112255f)
[05/15/2018 18:51:43 > e8106a: INFO] info: Function.Run[0]
[05/15/2018 18:51:43 > e8106a: INFO] Executing 'Function.Run' (Reason='New ServiceBus message detected on 'testqueue'.', Id=6772e4cd-f0ce-4893-a886-95ece853d98e)
[05/15/2018 18:51:43 > e8106a: SYS ERR ] Job failed due to exit code -1073610685
[05/15/2018 18:51:43 > e8106a: SYS INFO] Process went down, waiting for 60 seconds
[05/15/2018 18:51:43 > e8106a: SYS INFO] Status changed to PendingRestart
I'm seeing it now. That exit code appears to map to RPC_NT_INTERNAL_ERROR (assuming I'm looking this up correctly: https://errorcodelookup.com/?type=ntstatus&code=C0020043).
The WebJob will be running, and then I notice the process has restarted in the middle of nowhere. I'm not sure where the error is coming from. I'm going to follow up with others.
Thanks for the update — I'm glad you've been able to reproduce this now. Please keep me posted.
Related to #1530.
What if you explicitly turn off dashboard logging with cfg.DashboardConnectionString = null;? I'm suspecting that it's some race or unhandled exception in the dashboard logging code.
Yes! After setting cfg.DashboardConnectionString = null I'm now seeing 1000s of successfully executed functions / messages being processed per second without the process terminating.
Thanks for the confirmation -- now I need to track down where this is happening! I'll update this thread with what I find.
It is probably worth pointing out for the benefit of others, that the documentation for Get started with the WebJobs SDK > Enable console logging recommends the approach of disabling dashboard logging.
The dashboard is a legacy monitoring tool, and dashboard logging is not recommended for high-throughput production scenarios.
Update -- we've identified it as an issue involving the App Service Sandbox. It appears that during cancellation of a blob upload (and possibly in other cancelled HTTP scenarios), there's a chance that some internal API call will be rejected by the sandbox, resulting in an uncatchable exception that brings down the process. It's being investigated.