Provide the steps required to reproduce the problem
Create an EventHub trigger integration.
Pause or delete integration for a period that exceeds the retention of the EventHub
Resume the integration / unpause the trigger.
Upon resumption of the trigger, the stored offsets will be invalid. The EventHub trigger should compensate for this and be able to reset the offset.
In addition, upon the deletion of an input trigger, the corresponding blob data for offset checkpointing should be deleted from the storage account.
Any partitions with invalid offsets will constantly produce errors from the AMQP consumer. The trigger never fixes these offsets and this error is not viewable from within the functions app logs, etc. It is only viewable (and thus Microsoft support cannot identify the problem either) with Application Insights.
e.g.
System.ArgumentException: The supplied offset '55838201792' is invalid. The last offset in the system is '30089580512' TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287, Timestamp:2019-01-18T04:56:57 Reference:<redacted>, TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287|$default, Timestamp:2019-01-18T04:56:57 TrackingId:<redacted>_G6, SystemTracker:gateway5, Timestamp:2019-01-18T04:56:57
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)
System.OperationCanceledException: The AMQP object session36857 is aborted.
at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
at Microsoft.Azure.Amqp.AmqpObject.OpenAsyncResult.End(IAsyncResult result)
at Microsoft.Azure.Amqp.AmqpObject.EndOpen(IAsyncResult result)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.CreateLinkAsync(TimeSpan timeout)
at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)
at Microsoft.Azure.Amqp.Singleton`1.CreateValue(TaskCompletionSource`1 tcs, TimeSpan timeout)
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)
We believe that deleting the blobs with the bad offsets will resolve the problem by causing the blob to be recreated.
The Azure EventHub and Functions integration should to do two things:
Bonus: it would be nice if the user could see these errors in the logs of the functions app, but they do not appear there.
For the details:
The EventHub integration keeps offset data in a path in the storage account at:
azure-webjobs-eventhub/<namespace>.servicebus.windows.net/<eventhub name>/<consumer group>/
In here, there is a file for each partition. The contents of the file are structure is as show below:
{"Offset":"<offset count>","SequenceNumber":<number>,"PartitionId":"0","Owner":"<uuid>","Token":"<uuid>","Epoch":<number>}
Is this just causing noise? or is the behavior after this happens incorrect?
Noise? No. This prevents messages from being ingested and processed by the function from any of the affected partitions until the offsets are fixed.
I reviewed the case here and understand what is happening. Sometimes the offset we have stored in the storage account (via the event processor host) may point to an offset that is no longer valid. For example, if the message was removed from the Event Hub due to the retention policy. This puts your app in a corrupt state where you really have to delete the storage partition data or create a new function / name so that it creates new storage data partition data.
Ideally we'd have some code that could catch this exception and potentially automatically clear the offset data?
We are starting to see this issue a lot as well through standard eventhub usage (nothing fancy, just the tutorial here). Deleting the blobs has thus far not worked for fixing whatever is causing this and it's happening every day now. And as is said above, this is only captured and viewable with app insights, but it doesn't specify which partition is at fault so it's difficult to track down where this even originates from.
@ththiem are you removing and recreating the Event Hub and using the same name? I chatted with the event hubs team and they mentioned that can cause the invalid offset exception as well.
Our workaround involves stopping the integration, deleting the blobs manually, then restarting the integration. There is no need to delete the EventHub because EH, unlike Kafka, does not track offsets for the consumer.
Our workaround involves stopping the integration, deleting the blobs manually, then restarting the integration. There is no need to delete the EventHub because EH, unlike Kafka, does not track offsets for the consumer.
Unfortunately, your workaround didn't work for me. I did the following steps:
Did I miss anything?
If you want to dig in more, you'll need to read Exceptions from Application Insights for that function. The error messages should indicate which partition and offset value is causing the problem. if you're not getting Exceptions or they indicate a different problem, this may not solve the issue.
Under the path azure-webjobs-eventhub/<namespace>.servicebus.windows.net/<eventhub name>/<consumer group>/ you should see individual partitions, IIRC. These represent the data keeping track of offsets per partition.
We were testing our ARM templates and had the folly to redeploy our event hub. Next thing you know, all our consuming functions stop working. Then I arrive here. Crap.
FYI, stopping the function hosts and deleting the blobs in the consumer group ($Default usually) folder under the azure-event-hubs container in the associated storage accounts did the trick.
Hmm, I spoke too soon. I thought it fixed it, but it did not.
We are running into the same issue. We recreated our event hub and now we see this issue with offsets. Shouldn't EH PRocessor Host handle this on its own? What is the right solution or design here? @jeffhollan (ping shreyagr internally if you need more details)
We just ran into this issue as well. This was solved by deleting the whole azure-webjobs-eventhub blob in the storage account associated with the function.
Would be good to discover if there is a common root cause here? @brunhil do you know how or why the corruption was hit in the first place? Was it recreating event hubs with same name after delete?
@jeffhollan - I completely agree. We were deleting and recreating an event hub under the exact same name.
Let me know how I can be of assistance in identifying a root cause. I plan on looking at function logs to see if there鈥檚 any more details on why the corruption happened.
Ok that's great - in this case I think I know what is happening and not sure if there is much we can do on the product side directly except maybe making it easier to get out of this state. We rely on the "Event Processor Host" SDK that Event Hubs provides which uses Azure Storage to create and store checkpoints for how far the stream has progressed. The naming convention is something like we create a container called azure-webjobs-eventhub, and then in it the EPH does naming of:
{namespace}/{event-hub-name}/{consumer-group} and tracks progress there.
The problem here is if you use the same AzureWebJobsStorage account, and the same names for namespace / event hub / consumer group, when the function starts it follows this naming, sees there was already some checkpoints created, and starts to try to resume processing. The problem is, even though it's named the same, this is a blank event hub, so it starts throwing errors like "I can't start from offset XXXX because I don't even have that offset."
So for now recommendation would be to either delete one of the storage artifacts when deleting the Event Hub, naming the event hub differently, or using a different AzureWebJobsStorage account to mitigate
@jeffhollan - I think this is a reasonable approach. It would be great to have this documented in the following page for Troubleshooting Event Hubs.
I think the only way to resolve this would be to add the ClientId of the EventHub to the message and then compare that against the messages as you receive them. Then you could tell if the EventHubs are different and handle accordingly
Thanks for sharing this guys. +1 to add the workaround in the Troubleshooting Event Hubs page.
Is there any special way / permission to delete this blob? It's disabled on my Azure portal...
found it ... was just lease on the blobs..
Hi @jeffhollan - my original problem was not due to deleting the EventHub. It was because the EventHub consumer was paused longer than the retention period. I just want to make clear that deleting storage, etc were attempts to fix the problem, not the cause. That said messing, with storage etc can land in the same state.
I think EventHub just needs to detect when the offset is invalid and cleanup the storage.
// @ShubhaVijayasarathy from event hubs
Ping. I just found that a trigger of ours has been broken for the last 12 days because of this issue. Just sitting there broken ... not good.
+1 please document at https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-messaging-exceptions
+1 please automate the recovery
At a minimum, some cmdlets in the Az module (or an extension in azure-cli) would be nice to have to detect/correct the issue. Faffing around in storage explorer is unnecessarily cumbersome.
Ok that's great - in this case I think I know what is happening and not sure if there is much we can do on the product side directly except maybe making it easier to get out of this state. We rely on the "Event Processor Host" SDK that Event Hubs provides which uses Azure Storage to create and store checkpoints for how far the stream has progressed. The naming convention is something like we create a container called azure-webjobs-eventhub, and then in it the EPH does naming of:
{namespace}/{event-hub-name}/{consumer-group} and tracks progress there.
The problem here is if you use the same
AzureWebJobsStorageaccount, and the same names for namespace / event hub / consumer group, when the function starts it follows this naming, sees there was already some checkpoints created, and starts to try to resume processing. The problem is, even though it's named the same, this is a blank event hub, so it starts throwing errors like "I can't start from offset XXXX because I don't even have that offset."So for now recommendation would be to either delete one of the storage artifacts when deleting the Event Hub, naming the event hub differently, or using a different
AzureWebJobsStorageaccount to mitigate
Is this the recommendation when an EventHub failover is done? Do a failover, delete the storage (and have logic in error eventhandler to keep restarting at intervals)?
We just ran into this also, deleting the blobs fixed the issue, but we had to deal with the super cumbersome issue where Azure Functions can't specify a starting offset so we could avoid dupe processing. We manually worked around that, but luckily this is fixed and waiting on deployment: https://github.com/Azure/azure-functions-eventhubs-extension/issues/64 for future issues like this.
Most helpful comment
Noise? No. This prevents messages from being ingested and processed by the function from any of the affected partitions until the offsets are fixed.