When using V8.0.1 of WindowsAzure.Storage V1.1.2 of WebJobs fail to remove poison messages from the queue. It copies the message to the poison queue but leaves the original message in the queue. The bad message gets processed again after the sleep timeout and generates another poison message. This goes on indefinitely.
This does not happen with Storage <= 7.1.2 for V1.1.2 of WebJobs.
Create a WebJob project using V1.1.2 of WebJobs and V7.1.2 of WindowsAzure.Storage
Create a queue handling function that raises an error when processing a message:
public static async Task ProcessQueueMessage1([QueueTrigger("queue1")] string message,
[Queue("queue2")] IAsyncCollector<string> queue2,
TextWriter log)
{
try
{
Task.Delay(10000).Wait();
log.WriteLine($"Queue1: {message}");
await queue2.AddAsync("Message for queue2");
throw new InvalidOperationException("Queue1 error.");
}
catch (Exception ex)
{
log.WriteLine(ex.Message);
throw;
}
}
Add any message to a Queue named queue1.
The message will be added to the queue1-poison queue and removed from queue1.
Update the WindowsAzure.Storage package to 8.0.1
Republish the WebJob project
Add a message to queue1.
The message will be added to the queue1-poison queue and remain hidden in queue1.
After 10 minutes (the sleep timeout) a new poison message will be added and the message will remain in queue1. This goes on indefinitely.
The poison message should be removed from queue1 after it has been initially added to queue1-poison queue.
The poison message is never removed from queue1
Use version 7.1.2 or less of WindowsAzure.Storage.
@ms-aprima - We don't currently test against that version, so this isn't supported yet.
Is there a reason you need 8.x of the Storage SDK?
We are having the exact same issue @ms-aprima described. WindowsAzure.Storage v7.2.1 also works fine but not the v8.0.1
@christopheranderson we needed some of the large blob upload handling features in 8.x. One of the reasons we are using WebJobs is because we need to handle large blobs. We can put the large blob handling code in a separate project for the time being. This issue came up when we upgraded to 8 for the blob features.
We do plan to move forward to the latest 8.x version of storage before our 2.0 release is complete, we just haven't done that yet.
@ms-aprima or @demirag -- when you guys hit this issue, were you building your project against the WebJobs nugets, or were you rebuilding the WebJobs source with the updated references?
I tried your repro with the PR #1010 and am not seeing the issue, but I want to understand the scenario.
I'm able to repro this if I:
I'm not able to repro it when I actually update the WebJobs sources to use 8.0.1. My guess is that some part of the internal Storage class hierarchy changed so our package, built on the earlier builds, doesn't work. I'm trying to find the actual error that's being thrown now.
I misspoke above. I can repro it in 8.0.1 directly, and I think I know why. Looking at the changelog (https://github.com/Azure/azure-storage-net/blob/master/changelog.txt) I see this line:
Queues: Add Message now modifies the PopReceipt, Id, NextVisibleTime, InsertionTime, and ExpirationTime properties of its CloudQueueMessage parameter. The message can then be passed to the UpdateMessage and DeleteMessage APIs.
What happens:
queue.AddMessage() on that same CloudQueueMessage instance.With the change above in 8.0.1, I can see that the Id of the CloudQueueMessage changes after we add it to the poison queue (due to the change above). That means when we go to delete it, we get a 404, which we ignore.
We should be able to fix this without moving to 8.0.1 by adding a new CloudQueueMessage to the poison queue, rather than reusing the existing one. We also need to add a test to make sure this scenario properly fails before the fix.
This brings up another interesting scenario in 8.0.1. If we pass someone a CloudQueueMessage directly to their function and they add it to another queue, we'll lose track of it. We'll need to beef up some tests and try these scenarios out.
@brettsam We are building project against WebJobs nugets. The version of WebJobs V1.1.2 or v.2.0.0-beta2 does not matter, our issue is only related to WindowsAzure.Storage nuget package. The maximum version that runs without this problem is v.7.2.1. Whenever this package is updated to v8.0.1, the bug is there.
@brettsam We have the same conditions as @demirag describes above.
@brettsam is this something you can fix for the 2.0 release that we're wrapping up now? If so, please move that to the milestone https://github.com/Azure/azure-webjobs-sdk/issues?q=is%3Aopen+is%3Aissue+milestone%3A2.0.0-release
For those (like me) who cannot wait the next release to get the WebJobs SDK to work with the latest releases of Azure Storage, and based on the explanations of @brettsam, you can simply write a custom CustomQueueProcessorFactory to create a new CloudQueueMessage in CopyMessageToPoisonQueueAsync.
```c#
namespace ConsoleApplication1
{
using Microsoft.Azure.WebJobs.Host.Queues;
using Microsoft.WindowsAzure.Storage.Queue;
using System.Threading;
using System.Threading.Tasks;
public class CustomQueueProcessorFactory : IQueueProcessorFactory
{
public QueueProcessor Create(QueueProcessorFactoryContext context)
{
return new CustomQueueProcessor(context);
}
private class CustomQueueProcessor : QueueProcessor
{
public CustomQueueProcessor(QueueProcessorFactoryContext context)
: base(context)
{
}
protected override Task CopyMessageToPoisonQueueAsync(CloudQueueMessage message, CloudQueue poisonQueue, CancellationToken cancellationToken)
{
var newMessage = new CloudQueueMessage(message.Id, message.PopReceipt);
newMessage.SetMessageContent(message.AsBytes);
return base.CopyMessageToPoisonQueueAsync(newMessage, poisonQueue, cancellationToken);
}
}
}
}
Then in your Main, you just have to set the custom queue processor factory in the job host configuration:
```c#
var config = new JobHostConfiguration();
config.Queues.QueueProcessorFactory = new CustomQueueProcessorFactory();
I could get it work with WindowsAzure.Storage 8.1.1 and Microsoft.Azure.WebJobs 2.0.0.
Hope that helps!
I've been looking at this today and it looks like a bigger work item than I'd originally anticipated. Effectively, we can no longer trust the Id or PopReceipt of a message after we've passed it on to a method that we may not control. Someone may insert that into a new queue, thus changing both properties.
On top of that, CloudQueueMessage has a bunch of internal setters for its properties, meaning we can't easily clone one before passing it to the various methods.
I'm working on a change that stores the Id and PopReceipt and passes those to each method, along with the actual CloudQueueMessage. However, it feels awkward to explain that you have to trust the passed-in Id and PopReceipt and not use the ones from the message. I'm still trying to work out whether there is a better fix.
I'm experiencing this bug with 2.0.0 of the webjobs sdk and 8.5.0 of WindowsAzure.Storage. What version(s) was this fixed in?
I'm also still seeing this with 2.0.0 of the WebJobs SDK and 8.6.0 of WindowsAzure.Storage. Also interested to know what versions this is fixed in.
@xt0rted @juliankay
It should be fixed since 2.1.0-beta1. But there is currently no released version of 2.1.0. The latest version including the fix is 2.1.0-beta4. I am also waiting for a fix for this issue to update the WindowsAzure.Storage nuget package, but i dont know when they will finally release 2.1.0.
You could install the prerelease/beta version (if you want or can) to get rid of the issue. Installing the beta version is not an option for me in production, so i will have to wait for 2.1.0 to be released.
Most helpful comment
@brettsam We are building project against WebJobs nugets. The version of WebJobs V1.1.2 or v.2.0.0-beta2 does not matter, our issue is only related to WindowsAzure.Storage nuget package. The maximum version that runs without this problem is v.7.2.1. Whenever this package is updated to v8.0.1, the bug is there.