Azure-sdk-for-net: [Azure Monitor] Expected errors are reported in Application Insights for Blob storage operations

Created on 11 Feb 2020 · 13Comments · Source: Azure/azure-sdk-for-net

Describe the bug
I'm using the Azure.Messaging.EventHubs.Processor (5.0.1) with an EH with 32 partitions. Every partition gets checkpointed every 10 seconds (if there was new data arriving). Now I started to notice in Application Insights, that some of the checkpointing calls to Blob storage fail with a 412 error code.

````
Azure.RequestFailedException: The condition specified using HTTP conditional header(s) is not met.
RequestId:2130b973-701e-0130-5c14-e1c499000000
Time:2020-02-11T19:47:37.1509796Z
Status: 412 (The condition specified using HTTP conditional header(s) is not met.)

ErrorCode: ConditionNotMet

Headers:
Server: Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-request-id: 2130b973-701e-0130-5c14-e1c499000000
x-ms-client-request-id: f01bd21c-fda0-429a-b88f-2eb41a153efb
x-ms-version: 2019-02-02
x-ms-error-code: ConditionNotMet
Date: Tue, 11 Feb 2020 19:47:36 GMT
Content-Length: 252
Content-Type: application/xml

at Azure.Storage.Blobs.BlobRestClient.Blob.SetMetadataAsync_CreateResponse(Response response)
at Azure.Storage.Blobs.BlobRestClient.Blob.SetMetadataAsync(ClientDiagnostics clientDiagnostics, HttpPipeline pipeline, Uri resourceUri, Nullable1 timeout, IDictionary2 metadata, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable1 encryptionAlgorithm, Nullable1 ifModifiedSince, Nullable1 ifUnmodifiedSince, Nullable1 ifMatch, Nullable`1 ifNoneMatch, String requestId, Boolean async, String operationName, CancellationToken cancellationToken)
````

I can also see the errors as "Client Error" in my Blob storage metrics. Most of the calls seem to work fine, but some create the error. Looks something inside the SDK to me, not directly related to my code.

Expected behavior
Should run without errors.

To Reproduce
Hard to tell. The errors also come when I have almost zero load on the EH (only a handful messages).
Happy to jump on a screen share if that helps.

Environment:

Name and version of the Library package used: Azure.Messaging.EventHubs.Processor (5.0.1)
Hosting platform or OS and .NET runtime version : dotnet 3.1 in Linux container running on AKS.

Monitor - ApplicationInsights Service Service Attention customer-reported needs-team-attention question

Source

sebader

Most helpful comment

There shouldn't be anything needed from you at this point, @sebader. Actions are needed from the Azure Monitor team. The bot was reacting to tags, but I don't believe those tags were accurate.

jsquire on 4 May 2020

👍2

All 13 comments

//fyi @jsquire

tg-msft on 11 Feb 2020

@kinelski: Can you take a look and see if this is because of the conditional access that we're making with ownership requests? I'm trying to determine if these are expected and either Event Hubs or Storage is surfacing something that it shouldn't as an error.

jsquire on 11 Feb 2020

👍1

@sebader Thank you for reporting this issue.

Could you tell us how many Event Processor Client instances are being used in your scenario? This might help us understand the nature of the problem.

kinelski on 12 Feb 2020

It’s auto scaling between 4 and 32 instances on AKS depending on the load.

sebader on 12 Feb 2020

@sebader: Apologies for the difficulties and thank you for bringing this to our attention. While we certainly agree that having these errors appear is confusing and not the experience that we want to offer, it is, unfortunately, by-design in the current implementation.

For context, the diagnostics emitting these errors are coming from the Storage client that Event Hubs uses, which is based entirely off of the response. Because it is a 4xx series, it is automatically interpreted by the diagnostics framework as an error.

The Event Processor makes a conditional request when trying to claim ownership of a partition for processing. Because processor instances compete for ownership, it is expected that many of these requests do not succeed due to another instance having already claimed it. Within Event Hubs, this is treated as a normal code path. However, at that point, the Storage client has already registered the failure with its diagnostics.

I've opened #9934 as a feature request for exposing the ability to treat service responses that are expected and normal for the consuming application as non-failures.

jsquire on 12 Feb 2020

I've left comment here https://github.com/Azure/azure-sdk-for-net/issues/9934#issuecomment-585476448 and basically this is the Azure Monitor (Application Insights) approach to mark 4xx as failures.

If we attempt to change this from Azure SDK side, it will become inconsistent with the rest of Azure Monitor logic handling 4xx status codes for incoming and outgoing requests .

One approach we provide is to do some custom logic in the code to mark suck failures as non-failures
https://stackoverflow.com/questions/37533431/how-to-tell-application-insights-to-ignore-404-responses
https://docs.microsoft.com/en-us/azure/azure-monitor/app/api-filtering-sampling

This is a bit involved, but allows to customize almost everything.

From Azure Monitor side, I believe we should do better job helping you isolate such calls and tell they are noise. App Map for example has a filter to remove 4xx failures and I think we should do more.

@sebader can you please help me understand an issue a bit better?

are you expecting to see only calls made directly by your code? I.e. is what happens under checkpointer a concern for you and would you want to know about underlying storage operations?
what are the problems these failures introduce from service monitoring perspective? do they hide anything? are you able to separate them from real issues? Are they just noise/confusion?

lmolkova on 13 Feb 2020

@jsquire thanks for looking into it and the thorough the explanation!

@lmolkova First of all I didn't have any idea where the error came from. I just saw it popping up in my monitoring. From a user perspective, of course you get concerned if there are unexplained errors. A user does not know that those in this case represent "works-as-expected". If I as a user see errors, which seem to be related to checkpointing, I get concerned that my checkpoints might not be properly written and thus I will run into issues. So I would say, no, they are not just noise. Without making it clear to the user (and I don't really know what that could look like here), it raises concern.

I also would expect to see errors of underlying SDKs in my monitoring - if they do represent actual errors that I as the app owner need to take care of. When that's not the case, I would expect the SDK to hide them, or at least clearly mark them as noise in the monitoring without me as the user looking through github, docs, etc. to find out whats going on and then manually build a filter.
And yes, from the issue that @jsquire created, I understand that this might not be so easy to do.

Does this make sense from a user perspective?

sebader on 13 Feb 2020

@lmolkova: I'm not quite sure what next steps would be here; there does not seem to be action that the Event Hubs client library can directly take to influence the behavior, and there are legitimate considerations raised to the proposal in #9934.

Should we open an issue somewhere for consideration or is this something that is considered by-design and that we aren't able to influence?

jsquire on 17 Mar 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azmonapplicationinsights.

msftbot[bot] on 2 Apr 2020

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

msftbot[bot] on 30 Apr 2020

What is the expected feedback here from me? Was there any new change done?

sebader on 3 May 2020

There shouldn't be anything needed from you at this point, @sebader. Actions are needed from the Azure Monitor team. The bot was reacting to tags, but I don't believe those tags were accurate.

jsquire on 4 May 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings