Hi,
I have an continous Web Job that executes with a QueueTrigger. Normally if there is an exception or any problem, the job will fail, the queued message will go back into the queue, and the job will try to reprocess (until it finishes or fails 3 times and goes into the poison queue).
However, I noticed on 10/26/2015 that no messages had processed in the past day or so. I investigated on the Azure Portal, and saw that the webjob still had a "running" status. I clicked into the web job, and discovered that the current execution was still going, and had been executing for the past 2 days. For some reason, the job did not time out or quit, and there were no further QueueTriggers even though there were multiple messages backed up in the Queue.
There were also no logs or exceptions/errors thrown (I have a decent amount of logging and exception handling in the method).
I aborted the current job execution via the Azure Portal, and once that happened, all of the backed up queue messages began processing immediately.
I can provide account details via email if needed ([email protected]).
A bit more requested info:
Web Jobs SDK: using Web Jobs SDK 1.0.6
JobHost setup:
var config = new JobHostConfiguration();
config.StorageConnectionString = config.DashboardConnectionString = ConfigurationManager.AppSettings["AzureStorageConnection"];
config.Queues.MaxPollingInterval = TimeSpan.FromSeconds(60);
config.Queues.BatchSize = 1;
config.Queues.MaxDequeueCount = 5;
var host = new JobHost(config);
host.RunAndBlock();
Processing Code:
public async static Task ProcessFromQueue([QueueTrigger("alertqueue")] string queuedMessage, CancellationToken token)
{
if (token.IsCancellationRequested)
{
logger.Error("Cancellation requested");
return;
}
logger.Info("Processing message:" + queuedMessage);
try
{
var worker = IocContainer.Resolve<IQueueWorker>();
await worker.DoWork(queuedMessage);
}
catch (Exception ex)
{
logger.Error("Error processing queued message:" + queuedMessage, ex);
throw;
}
logger.Info("Finished processing message:" + queuedMessage);
}
What is "logger" and how does it log? That's a bit of unknown code - it might be that there was no error, and your logger didn't write out the message. Where does it log to? Also, how do you guarantee timeouts occur in worker.DoWork?
I strongly suspect that somewhere AFTER we invoke your job function it is hanging/never returning. The SDK does not make any assumptions currently about how long your function may need to run so doesn't enforce any timeout. So if your code hangs, the job hangs indefinitely.
I'm considering adding a TimeoutAttribute (e.g [Timeout("1:00:00")] timeout after 1 hour) that allows you to opt-in to this behavior. We'd also have global knob on JobHostConfiguration that you can set.
@mathewc - "logger" is a DI instance of NLog, it outputs to console as well as an integration with Raygun (an online error tracking system). The odd thing is that not even the initial log of "logger.Info("Processing message:" + queuedMessage);" was in the logs, which indicates to me that perhaps there was an error before the function could even fire?
Inside DoWork, any async calls are with RestSharp, which has a default 30 second timeout.
Having the TimeoutAttribute sounds like a good addition.
@rustd @ThreeScreenStudios Ok, I've implemented TimeoutAttribute. Here's an example function that would hang for a day if TimeoutAttribute was not used:
[Timeout("00:00:10")]
public static async Task ProcessMessage(
[QueueTrigger("samples-input")] string message,
TextWriter log,
CancellationToken cancellationToken)
{
log.WriteLine("Begin ProcessMessage");
await Task.Delay(TimeSpan.FromDays(1), cancellationToken);
log.WriteLine("PRocessMessage complete");
}
Notes:
JobHostConfiguration.FunctionTimeout global value. It's null by default, but you can set it. This value will be used for all functions, unless those functions override via class/method level TimeoutAttributes.CancellationToken, you just have to periodically check it for cancellation, and if cancelled stop your work and call cancellationToken.ThrowIfCancellationRequested()@ThreeScreenStudios I'll also point out that the reason your function hung and wouldn't process any more messages is because you have JobHostQueuesConfiguration.BatchSize set to 1. That means that only a single message is pulled per batch, and we won't pull another batch until that one is complete. Why do you have it set to 1? If you allowed multiple (as is the default), the one message might have hung, but others would continue to process.
had the same problem several times in the last days where triggered functions get stuck in the code below forever. I had a BatchSize of 32 and all of them got stuck after a while.
The Timeout attribute is a great solution for that, exactly what I am looking for.
Is there already a new build available? Or do I have to compile the sources myself?
public static async Task FtpToBlob(
[QueueTrigger("ftp-download-file")] FtpToAzureBlobArgs ftpToAzureBlobArgs,
string Filename,
string FtpFolder,
string SomeId,
string CloudDir,
[Blob("mycontainer/{CloudDir}/{SomeId}/{FileName}")] ICloudBlob output,
TextWriter log)
{
try
{
var uri = new Uri($"ftp://ftp.example.com/Foo/Bar/{FtpFolder}/{Filename}");
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(uri);
request.Method = WebRequestMethods.Ftp.DownloadFile;
request.Credentials = new NetworkCredential(FtpUser, FtpPass);
FtpWebResponse response = (FtpWebResponse)request.GetResponse();
await output.UploadFromStreamAsync(response.GetResponseStream());
await log.WriteLineAsync("Downloaded: " + uri.ToString());
}
catch (Exception ex)
{
await log.WriteLineAsync(ex.StackTrace);
}
await log.WriteLineAsync("Finished");
}
@mathewc - ah thanks for pointing out the batch size issue - is there any guidance on how to choose an optimal batch size?
Also thanks for putting the TimeoutAttribute, I think that will be quite helpful for many folks.
@agnauck If all of your functions are getting stuck after a while, that indicates a problem in your code. To use the new TimeoutAttribute, you'll update your method signature to take the CancellationToken, and should then pass that to other async operations you initiate. No there isn't a build out yet - I'll get one out today (on our myget feed) and let you guys know.
@ThreeScreenStudios Well, the defaults are designed to be optimal (default is 16, max is 32). I was wondering why you dialed it back to 1.
@mathewc the code is posted above is all the code I have in this WebJob. I will add the CancellationToken as suggested.
Regarding batchSize This limit applies separately to each function that has a QueueTrigger attribute. If you don't want parallel execution for messages received on one queue, set the batch size to 1.
For more information read this https://azure.microsoft.com/en-us/documentation/articles/websites-dotnet-webjobs-sdk-storage-queues-how-to/
Ok, the TimeoutAttribute feature is in. Please see the release notes for details, and for a link to a sample.
@agnauck @ThreeScreenStudios Can you guys please give this a try and verify that it meets your needs? Thanks. You can pull the latest bits from our myget feed (instructions here). Version 1.1.0-beta1-10149 includes the changes.
works perfect. Thanks, this is a great new feature and very helpful for us.
@agnauck @ThreeScreenStudios @mathewc Hi guys. I kind of have the same situation where the web job is getting hung on a single process for hours and even with extensive logging, I couldn't log anything. No exceptions or errors too. It's like the thread doesn't reach the code itself and it hangs indefinitely. I am running a single instance continuous web job with a restart time of 2 seconds. I also have similar continuous web jobs that are running fine. I have tried to restart it, rename it, delete it, redeploy it, but nothing fixes the issue. Rechecked the code multiple times, the code is running fine locally without any issues. What are all the possible reasons for this to happen? Can anyone help with this?
Hi,
i also faced the same issue. my web job reads messages from topic. what i observed is my web job suddenly stopped processing messages from topic even topic is keep getting messages from sender. i opened logs of my web job and saw that my web job is processing a message from "2 hours" with status "running".my web job has custom error logging mechanism but there are no errors found. it seems like web job is really not processing that message but showing it as processing. after i waited 1 more hour , i aborted the message manually using azure web job logs web page then remaining messages started processing by web job so how to resolve this issue?
Got same issue two days ago, WebJob stuck to process messages from queue. It just stuck with message: Never Finished. The underlying code does database calls and other API calls, but it was unchanged for a few months and bad thing that this is happened in production without any notifications or warning or failures.
For now I see the TImeout attribute will solve this issue and able to throw exception if needed, but I'm wondering what can cause such issues?
BatchSize = 16, MaxDequeueCount = 2, MaxPollingInterval = 3 seconds.
TimeoutAttribute seems perfect for most of the discussed problems but it has to be implemented in many places in the job, just as it should be done for any CancellationToken.
My concern is that the "hung" part of the job might be a single unit that it won't throw exception when the CancellationToken is triggered.
foreach(var item in items){
cancelToken.ThrowIfCancellationRequested();
SingleUnitJobThatHungsForever();
}
How we can stop the webjob process execution in the same way we can do from Azure portal after a specific timeout? Is there a way to kill the specific queue message process without the usage of TimeoutAttribute?
Most helpful comment
TimeoutAttributeseems perfect for most of the discussed problems but it has to be implemented in many places in the job, just as it should be done for anyCancellationToken.My concern is that the "hung" part of the job might be a single unit that it won't throw exception when the
CancellationTokenis triggered.How we can stop the webjob process execution in the same way we can do from Azure portal after a specific timeout? Is there a way to kill the specific queue message process without the usage of
TimeoutAttribute?