Azure-webjobs-sdk: EventHub integration offset value errors

Created on 18 Jan 2019 · 26Comments · Source: Azure/azure-webjobs-sdk

Repro steps

Provide the steps required to reproduce the problem

Create an EventHub trigger integration.
Pause or delete integration for a period that exceeds the retention of the EventHub
Resume the integration / unpause the trigger.

Expected behavior

Upon resumption of the trigger, the stored offsets will be invalid. The EventHub trigger should compensate for this and be able to reset the offset.

In addition, upon the deletion of an input trigger, the corresponding blob data for offset checkpointing should be deleted from the storage account.

Actual behavior

Any partitions with invalid offsets will constantly produce errors from the AMQP consumer. The trigger never fixes these offsets and this error is not viewable from within the functions app logs, etc. It is only viewable (and thus Microsoft support cannot identify the problem either) with Application Insights.

e.g.

System.ArgumentException: The supplied offset '55838201792' is invalid. The last offset in the system is '30089580512' TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287, Timestamp:2019-01-18T04:56:57 Reference:<redacted>, TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287|$default, Timestamp:2019-01-18T04:56:57 TrackingId:<redacted>_G6, SystemTracker:gateway5, Timestamp:2019-01-18T04:56:57
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)



System.OperationCanceledException: The AMQP object session36857 is aborted.
   at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
   at Microsoft.Azure.Amqp.AmqpObject.OpenAsyncResult.End(IAsyncResult result)
   at Microsoft.Azure.Amqp.AmqpObject.EndOpen(IAsyncResult result)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.CreateLinkAsync(TimeSpan timeout)
   at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)
   at Microsoft.Azure.Amqp.Singleton`1.CreateValue(TaskCompletionSource`1 tcs, TimeSpan timeout)
   at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)

Known workarounds

We believe that deleting the blobs with the bad offsets will resolve the problem by causing the blob to be recreated.

Additional information

The Azure EventHub and Functions integration should to do two things:

Upon detecting an offset error, it needs to make a decision about what to do. That is to reset the offset checkpoint and probably (safest) to recapture from earliest data in that partition or to capture from the latest data. There might be value making this user-configurable.
When an EventHub trigger is deleted, the corresponding offset data should be deleted from the storage account.

Bonus: it would be nice if the user could see these errors in the logs of the functions app, but they do not appear there.

For the details:

The EventHub integration keeps offset data in a path in the storage account at:
azure-webjobs-eventhub/<namespace>.servicebus.windows.net/<eventhub name>/<consumer group>/

In here, there is a file for each partition. The contents of the file are structure is as show below:

{"Offset":"<offset count>","SequenceNumber":<number>,"PartitionId":"0","Owner":"<uuid>","Token":"<uuid>","Epoch":<number>}

P2 needs-investigation pending-customer-response

Source

mbrancato

👍8

Most helpful comment

Noise? No. This prevents messages from being ingested and processed by the function from any of the affected partitions until the offsets are fixed.

mbrancato on 23 Jan 2019

👍4

All 26 comments

Is this just causing noise? or is the behavior after this happens incorrect?

fabiocav on 23 Jan 2019

Noise? No. This prevents messages from being ingested and processed by the function from any of the affected partitions until the offsets are fixed.

mbrancato on 23 Jan 2019

👍4

I reviewed the case here and understand what is happening. Sometimes the offset we have stored in the storage account (via the event processor host) may point to an offset that is no longer valid. For example, if the message was removed from the Event Hub due to the retention policy. This puts your app in a corrupt state where you really have to delete the storage partition data or create a new function / name so that it creates new storage data partition data.

Ideally we'd have some code that could catch this exception and potentially automatically clear the offset data?

jeffhollan on 15 Mar 2019

We are starting to see this issue a lot as well through standard eventhub usage (nothing fancy, just the tutorial here). Deleting the blobs has thus far not worked for fixing whatever is causing this and it's happening every day now. And as is said above, this is only captured and viewable with app insights, but it doesn't specify which partition is at fault so it's difficult to track down where this even originates from.

ththiem on 1 Apr 2019

@ththiem are you removing and recreating the Event Hub and using the same name? I chatted with the event hubs team and they mentioned that can cause the invalid offset exception as well.

jeffhollan on 2 Apr 2019

👀1

Our workaround involves stopping the integration, deleting the blobs manually, then restarting the integration. There is no need to delete the EventHub because EH, unlike Kafka, does not track offsets for the consumer.

mbrancato on 9 Apr 2019

Our workaround involves stopping the integration, deleting the blobs manually, then restarting the integration. There is no need to delete the EventHub because EH, unlike Kafka, does not track offsets for the consumer.

Unfortunately, your workaround didn't work for me. I did the following steps:

Stopped the function app
Deleted the "azure-webjobs-eventhub/.servicebus.windows.net///" in the function's storage blob
Started the function app

Did I miss anything?

fabianmeyer on 17 Apr 2019

If you want to dig in more, you'll need to read Exceptions from Application Insights for that function. The error messages should indicate which partition and offset value is causing the problem. if you're not getting Exceptions or they indicate a different problem, this may not solve the issue.

Under the path azure-webjobs-eventhub/<namespace>.servicebus.windows.net/<eventhub name>/<consumer group>/ you should see individual partitions, IIRC. These represent the data keeping track of offsets per partition.

mbrancato on 18 Apr 2019

We were testing our ARM templates and had the folly to redeploy our event hub. Next thing you know, all our consuming functions stop working. Then I arrive here. Crap.

oising on 14 Nov 2019

FYI, stopping the function hosts and deleting the blobs in the consumer group ($Default usually) folder under the azure-event-hubs container in the associated storage accounts did the trick.

oising on 14 Nov 2019

Hmm, I spoke too soon. I thought it fixed it, but it did not.

oising on 15 Nov 2019

We are running into the same issue. We recreated our event hub and now we see this issue with offsets. Shouldn't EH PRocessor Host handle this on its own? What is the right solution or design here? @jeffhollan (ping shreyagr internally if you need more details)

shreyagarwal1991 on 26 Nov 2019

We just ran into this issue as well. This was solved by deleting the whole azure-webjobs-eventhub blob in the storage account associated with the function.

brunhil on 31 Jan 2020

Would be good to discover if there is a common root cause here? @brunhil do you know how or why the corruption was hit in the first place? Was it recreating event hubs with same name after delete?

jeffhollan on 31 Jan 2020

@jeffhollan - I completely agree. We were deleting and recreating an event hub under the exact same name.

Let me know how I can be of assistance in identifying a root cause. I plan on looking at function logs to see if there’s any more details on why the corruption happened.

brunhil on 31 Jan 2020

Ok that's great - in this case I think I know what is happening and not sure if there is much we can do on the product side directly except maybe making it easier to get out of this state. We rely on the "Event Processor Host" SDK that Event Hubs provides which uses Azure Storage to create and store checkpoints for how far the stream has progressed. The naming convention is something like we create a container called azure-webjobs-eventhub, and then in it the EPH does naming of:

{namespace}/{event-hub-name}/{consumer-group} and tracks progress there.

The problem here is if you use the same AzureWebJobsStorage account, and the same names for namespace / event hub / consumer group, when the function starts it follows this naming, sees there was already some checkpoints created, and starts to try to resume processing. The problem is, even though it's named the same, this is a blank event hub, so it starts throwing errors like "I can't start from offset XXXX because I don't even have that offset."

So for now recommendation would be to either delete one of the storage artifacts when deleting the Event Hub, naming the event hub differently, or using a different AzureWebJobsStorage account to mitigate

jeffhollan on 31 Jan 2020

@jeffhollan - I think this is a reasonable approach. It would be great to have this documented in the following page for Troubleshooting Event Hubs.

I think the only way to resolve this would be to add the ClientId of the EventHub to the message and then compare that against the messages as you receive them. Then you could tell if the EventHubs are different and handle accordingly

brunhil on 3 Feb 2020

Thanks for sharing this guys. +1 to add the workaround in the Troubleshooting Event Hubs page.

thdotnet on 3 Mar 2020

Is there any special way / permission to delete this blob? It's disabled on my Azure portal...

thdotnet on 3 Mar 2020

found it ... was just lease on the blobs..

thdotnet on 3 Mar 2020

Hi @jeffhollan - my original problem was not due to deleting the EventHub. It was because the EventHub consumer was paused longer than the retention period. I just want to make clear that deleting storage, etc were attempts to fix the problem, not the cause. That said messing, with storage etc can land in the same state.

I think EventHub just needs to detect when the offset is invalid and cleanup the storage.

mbrancato on 4 Mar 2020

👍3

// @ShubhaVijayasarathy from event hubs

jeffhollan on 4 Mar 2020

Ping. I just found that a trigger of ours has been broken for the last 12 days because of this issue. Just sitting there broken ... not good.
+1 please document at https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-messaging-exceptions
+1 please automate the recovery

ishepherd on 27 May 2020

At a minimum, some cmdlets in the Az module (or an extension in azure-cli) would be nice to have to detect/correct the issue. Faffing around in storage explorer is unnecessarily cumbersome.

oising on 31 May 2020

Ok that's great - in this case I think I know what is happening and not sure if there is much we can do on the product side directly except maybe making it easier to get out of this state. We rely on the "Event Processor Host" SDK that Event Hubs provides which uses Azure Storage to create and store checkpoints for how far the stream has progressed. The naming convention is something like we create a container called azure-webjobs-eventhub, and then in it the EPH does naming of:

{namespace}/{event-hub-name}/{consumer-group} and tracks progress there.

The problem here is if you use the same AzureWebJobsStorage account, and the same names for namespace / event hub / consumer group, when the function starts it follows this naming, sees there was already some checkpoints created, and starts to try to resume processing. The problem is, even though it's named the same, this is a blank event hub, so it starts throwing errors like "I can't start from offset XXXX because I don't even have that offset."

So for now recommendation would be to either delete one of the storage artifacts when deleting the Event Hub, naming the event hub differently, or using a different AzureWebJobsStorage account to mitigate

Is this the recommendation when an EventHub failover is done? Do a failover, delete the storage (and have logic in error eventhandler to keep restarting at intervals)?

rnarayana on 14 Oct 2020

We just ran into this also, deleting the blobs fixed the issue, but we had to deal with the super cumbersome issue where Azure Functions can't specify a starting offset so we could avoid dupe processing. We manually worked around that, but luckily this is fixed and waiting on deployment: https://github.com/Azure/azure-functions-eventhubs-extension/issues/64 for future issues like this.