(This was initially opened with the Storage SDK as https://github.com/Azure/azure-storage-net/issues/539, I have adapted the title to fit the real cause of the issue)
I was asked to open it here for cross-referencing purposes.
ServiceBusTrigger and AutoComplete=true can lead to data loss.It looks like when a WebJob that writes to blob storage is restarted, there is a race condition
where topic messages can be marked as completed (We use AutoComplete = true) but the blob isn't actually present in the storage.
This seems to happen as soon as the WebJob needs to restart, so all the following lead to repro:
This doesn't seem to happen when restarting the WebApp that is hosting the WebJob. I believe this is normal because the WebJob is running on the Service Plan and not on the WebApp itself
(Correct me if I'm wrong.)
I wrote a sample application that does away with all of our business logic and narrows it down to one of the Azure SDKs (or my ability to C#) for code review or if someone would like to try to reproduce the issue on their end.
The best way to reproduce this issue is to play around with the application settings, scale and to manually restart the webjob while it is processing messages.
For an average test run, I see roughly 2000-5000 blobs out of ~100K that would simply not be in blob storage, but also not be in any dead letter queue or elsewhere. This varies depending on how many times I trigger web job restarts as mentioned above.
Reproducing can be a little time consuming, so what I usually do is setup an S3 with 10 instances of the webjob and change the application settings/restart a few times while I watch the topic queue length go down in ServiceBusExplorer. When the message count is low enough, I'll scale down to S1 1 instance until the queues hit 0 and stop the job before running the validation script (see pseudocode below)
Storage Account
StandardDisabledLRSWebJob
truetruetruetrue3264120 secondsOnWebJob
numbers.1, 2, ..., N in the topic topic/sub. N is the expected number of blobs.m in topic/sub,m with arbitrary content in the numbers container.await b.UploadTextAsync(m.ToString(), 'true'))m as complete (<auto-complete>)Validation
numbers, call the collection blobs.blobs.Count == N.expected be the collection of numbers 1..N.expected.Except(blobs).Count == 0.expected.Except(blobs).(Assertions fail and a list of blobs is printed.)
CloudBlockContainer.UploadTextAsync().async calls to their sync version.AutoComplete is the issueHttpClient to try and rule out the Storage SDK.Function (I would rather not resort to this in our production deployments)As you would guess, it's a bit unnerving to deploy to production or even enable autoscale / change settings in a production environment when it carries a risk of losing production data in the process.
As mentioned in the linked issue, we've found that AutoComplete = false prevents this issue from happening, and we've resorted to using that right now.
Messages are marked as completed before actual processing succeeds, so a shutdown will result in lost messages
The messages should only be completed if the handler executes successfully (returns and did not throw.)
As I mentioned in the referenced issue:
This makes me think that either the AutoComplete documentation is wrong or there is a bug in the WebJobs.ServiceBus library (emphasis mine):
Gets or sets a value that indicates whether the message-pump should call
CompleteAsync(String)orCompleteAsync(String)on messages after the callback has completed processing.
trueto complete the message processing automatically on successful execution of the operation; otherwise, false.
(Notice that one sentence reads "successful execution", the other simply says "completed processing")
To summarize, this is clearly not a Storage SDK bug (Sorry!), but might be a bug/documentation typo in the WebJobs.ServiceBus SDK.
Maybe the documentation should clarify what the purpose of AutoComplete is and if this is the expected behavior, clearly mention so.
Cheers,
Alex
@alxbl Thank you for the extremely thorough bug report. While I'm not a service bus expert, I do not think this is a documentation issue.
We're in the middle of looking at a few different service bus bugs that seem to relate to completion. I can't promise we'll fix this one immediately but we're at least going to do a first-pass investigation.
No worries, we've rewritten the code to always explicitly complete, so there's no time pressure on this on our part. Thanks for looking into it :)
The other issue Paul was mentioning is https://github.com/Azure/azure-webjobs-sdk/issues/1605. It might be related given that during the restart the operation cancel exceptions will be causing your in flight messages to fail (but they should be retried).
If I get a build up on myget with the fix for #1605 would you be willing to try it with your repro?
@mathewc I can give it a shot sometime this weekend. Let me know when the build is available.
Instructions for our myget feed are here.
Build 2.2.0-beta1-11248 of Microsoft.Azure.WebJobs.ServiceBus has the fix.
Good morning,
I had a bit of time to try and repro this morning. I did a run with 100K messages and a S3/10 instances. I did about 10 restarts and scaled down to S3/1 for the last 2K messages.
The run had 160 messages missing, which feels like an improvement from last time... but it's been many months since I last worked on this particular project, and have since moved to a different team so it's difficult to be certain.
> JobFeeder.exe validate
Establishing connection... ok
Validating... failed!
160 blobs are missing
4727, 4728, 4729, 4730, 4731, 4732, 4733, 4734, 4735, 4736, 4737, 4738, 4739, 4740, 4741, 4742,
4743, 4744, 4745, 5266, 5267, 5268, 5269, 5270, 5271, 5272, 5273, 5274, 5275, 5276, 5277, 5278,
5279, 5280, 5281, 5282, 5283, 5284, 5285, 5286, 5287, 5288, 5289, 5290, 5291, 5292, 5293, 5294,
5295, 5296, 5297, 5298, 5299, 5300, 5301, 5302, 5303, 5304, 5305, 5306, 5307, 5308, 5309, 5310,
5311, 5312, 5313, 5314, 5315, 5316, 5317, 5318, 5319, 5320, 5321, 5322, 5323, 5324, 5325, 5326,
5327, 5328, 5329, 5330, 5526, 5527, 5528, 5529, 5530, 5531, 5532, 5533, 5534, 5535, 5536, 5537,
5538, 5539, 5540, 5541, 5542, 5543, 5544, 5545, 5546, 5547, 5548, 5549, 5550, 5551, 5552, 5553,
5554, 5555, 5556, 5557, 5558, 5559, 5560, 5561, 5562, 5563, 5564, 5565, 5566, 5567, 5568, 5569,
5570, 5571, 5572, 5573, 5574, 5575, 5576, 5577, 5578, 5579, 5580, 5581, 5582, 5583, 5584, 16907,
16908, 16911, 16913, 16918, 16920, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931,
16932, 16933, 16934
Unfortunately, with my current repro code I don't have much more information than that to give. I've attached my packages.config for the WebJob project, in hope that it might prove helpful. In particular:
<package id="Microsoft.Azure.WebJobs" version="2.2.0-beta1-11248" targetFramework="net462" />
<package id="Microsoft.Azure.WebJobs.Core" version="2.2.0-beta1-11248" targetFramework="net462" />
<package id="Microsoft.Azure.WebJobs.ServiceBus" version="2.2.0-beta1-11248" targetFramework="net462" />
The code I use to repro is available in the issue description, all that is missing would be a powershell script that automates the start/restart of the webjob in order to fully automate the test+validation run.
Let me know if I can do anything else to help.
Whats the current status of this Issue?
As far as I'm aware it should still be possible to reproduce this issue following the repro steps provided above with the sample code. Unfortunately, I am no longer working in an Azure environment and do not have the resources to re-test. We had ended up using the workaround I described (manually completing all messages and disabling auto-complete) and it seemed to have fixed the problems. I no longer work there though so I'm not sure if that is still the case.
Cheers,