Here is the scenario.
We have one WebJob with 2 functions using QueueTrigger
The first function outputs a message to the mail queue using as shown below.
public async Task ProcessMailGeneratorQueue([QueueTrigger("%AzureSettings:MailGeneratorQueueName%")] string message, ILogger logger, ExecutionContext context, [Queue("%AzureSettings:MailQueueName%")] CloudQueue outputQueue)
{
MailNotificationModel model = JsonConvert.DeserializeObject<MailNotificationModel>(message, new MailNotificationModelConverter());
logger.LogInformation($"Generating {model.ReferenceKey}");
var email = await _mailGenerator.PrepareMailMessage(model);
CloudQueueMessage queueMessage = new CloudQueueMessage(JsonConvert.SerializeObject(email));
await outputQueue.AddMessageAsync(queueMessage);
}
This has been working well for several months. However in the past ~5 weeks ago, we have had several occurrences when the 2nd function is not triggering anymore while the queue contains many messages.
If we restart the web job from the portal (i.e. Stop, Wait for it to stop and Start), the function works properly again.
At this stage, we are not sure how we can further troubleshoot the problem as we have no error traces (we are using AppInsights). I am therefore looking for some advice as how we can further troubleshoot this issue. Is there something we should look after? or something we should add to our logging?
Any help or idea is welcome as this is really annoying and makes the whole solution useless if we cannot rely on it.
Our volume for this WebJob remain quite "small". We are not talking about thousands of messages, nor even hundreds. At peak time, we could have couple of dozens of messages per minute.
We are using the following package versions
<PackageReference Include="Microsoft.Azure.WebJobs" Version="3.0.3" />
<PackageReference Include="Microsoft.Azure.WebJobs.Extensions" Version="3.0.1" />
<PackageReference Include="Microsoft.Azure.WebJobs.Extensions.Storage" Version="3.0.2" />
<PackageReference Include="Microsoft.Azure.WebJobs.Logging.ApplicationInsights" Version="3.0.3" />
<PackageReference Include="Microsoft.Extensions.Caching.Memory" Version="2.2.0" />
<PackageReference Include="Microsoft.Extensions.Configuration" Version="2.2.0" />
<PackageReference Include="Microsoft.Extensions.Configuration.CommandLine" Version="2.2.0" />
<PackageReference Include="Microsoft.Extensions.DependencyInjection" Version="2.2.0" />
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="2.2.0" />
<PackageReference Include="WindowsAzure.Storage" Version="9.3.3" />
At this point, only workaround is to stop/start the WebJob in order for it to dequeue again and we have created a function that retrieves the number of messages in queue and alert us when the number of messages reaches a certain threshold (25 in the last 5 minutes)
Possibly similar issues #2072 #1187
We keep having the issue once a week or so. Any idea how we could collect additional information as what could happen? At this point any idea would be great.
@mathewc @fabiocav @brettsam
@superjulius if you are still seeing this then please let us know and gather the Storage account logs of the Queue involved and a dump file of the Dotnet.exe hosting your Web Job. https://blogs.msdn.microsoft.com/kaushal/2017/05/04/azure-app-service-manually-collect-memory-dumps/
Thanks @FinVamp1 ! We have not seen the issue for few weeks now...but this had also happened in the past.
We will try gathering more information in case the issue pops up. I am just a bit unclear as what logs should be provided. I have checked the storage logs but did not find anything that seems helpful.
Is it safe to upgrade to storageV2 today and set some retention policy to clean content in those blobs?
Note that we've also just added more logs to queue triggers (#2277) -- when this is released to nuget (which should be fairly soon), please give it a try to see if the logs it generates are helpful. You'll need to enable Debug-level logs to get them to show up, but the logs will write every time we poll the queue and every time we delay a poll.
As for what storage logs are useful -- you can enable diagnostic logs to be written to the $logs blob container on the same storage account. This will tell us everytime we attempted to read/write/delete a queue message. You can find those settings here (note the "Logging" section):

I'm going to close this as I believe #2277 will help diagnose this problem. If not, or if you get more details, please feel free to re-open and we can take another look.
Most helpful comment
Note that we've also just added more logs to queue triggers (#2277) -- when this is released to nuget (which should be fairly soon), please give it a try to see if the logs it generates are helpful. You'll need to enable Debug-level logs to get them to show up, but the logs will write every time we poll the queue and every time we delay a poll.
As for what storage logs are useful -- you can enable diagnostic logs to be written to the
$logsblob container on the same storage account. This will tell us everytime we attempted to read/write/delete a queue message. You can find those settings here (note the "Logging" section):