Azure-sdk-for-js: Event hub sometimes stops delivering messages, without errors

Created on 30 Jun 2019  路  12Comments  路  Source: Azure/azure-sdk-for-js

Client Event Hubs customer-reported

Most helpful comment

@samerdokas Since this is consistently reproducible for you, can you please enable the logs so that we can get a clearer picture of the sequence of events?

At a minimum, you can set the env variable DEBUG to azure:event-hubs:error
For more verbose logging you can set it to azure:event-hubs*

All 12 comments

Thanks @jtjk for reporting this issues and testing against the new SDK. We have triaged the issue and will look into it.

@jtjk There have been a lot of changes in the Event Hubs SDK ever since the issue you linked was logged. Can you please provide us some more information that we go with?

  • Name and version of the package being used
  • Sample code for receiving messages, that is as close as possible to your code where you are seeing this issue
  • More details around your scenario just like others have described in https://github.com/Azure/azure-event-hubs-node/issues/30
  • Is this consistently reproducible or happens rarely? If you enabled logs, then we could benefit from looking at it to figure out what is going on

@ramya-rao-a
I can confirm that this issue is still present. After a random amount of time (could range from minutes to days, rarely weeks), the hub subscription will permanently stop delivering messages. There are no calls to error callbacks, and nothing gets logged (with default logging options).

Yes, it is consistently reproducible. Given enough time (usually 1-3 days), it will happen.

Dockerfile
FROM node:lts-alpine

package.json
"@azure/event-hubs": "2.1.1",

Code with unnecessary parts (e.g. docs, comments, logging output) removed:

async function subscribe(connectionString, consumerGroup, messageHandler, errorHandler) {
  const client = await EventHubClient.createFromIotHubConnectionString(connectionString);
  const partitions = await client.getPartitionIds();
  partitions.forEach((partition) => client.receive(partition, messageHandler, errorHandler, {
    consumerGroup,
    eventPosition: EventPosition.fromEnd(),
    name: 'REDACTED'
  }));
}

Functions given as messageHandler (onEventData) and errorHandler (onErrorDuringReceive):

  onEventData(eventData) {
    // logger.silly(`Message from IoT Hub: ${JSON.stringify(eventData)}`);
    this.onMessage(eventData.body);
  }

  onMessage(message) {
     // examine message data and potentially make a REST call which will callback to handle its results
  }

  onErrorDuringReceive(error) {
    logger.error(`Could not receive REDACTED: ${JSON.stringify(error)}`);
  }

Note: the docker container that runs the code briefly outlined above is running on an Azure Kubernetes Service instance, if that is of any importance.

@samerdokas Since this is consistently reproducible for you, can you please enable the logs so that we can get a clearer picture of the sequence of events?

At a minimum, you can set the env variable DEBUG to azure:event-hubs:error
For more verbose logging you can set it to azure:event-hubs*

@ramya-rao-a Confirming that the suggested debugging features were enabled as of today, in the two containers that were most affected by this issue; one had the DEBUG env var set to azure:event-hubs:error, the other to azure:event-hubs*.

I'll post an update when the issue manifests again.

Unsure if it's related, but it seems that we experienced something similar to this in the Event Hub in Azure itself today... After a Spring App restart, the Event hub responded briefly, but then fully stopped all message egress. Only switching to a different event hub made the flow run again. Azure metrics show continual message ingress, but zero egress.

@bhoogter Please log a support ticket with Azure Event Hubs. From their server side logging, they should be able to determine what went wrong if you can provide a timeframe and details of your event hub instance.

We are facing a similar issue with event-processor-host 2.1.0 which uses event-hubs 2.1.1 as a sub dependency.
We were able to reproduce the error with debug logs
event-hubs-error-log.pdf. There we can see that there was a connection problem (OperationTimeoutError) to the event-hub on 11/22/19 17:12:53.619. From this moment on, we had never received a message from partition 0 again. There is no retry attempt in the following days. In addition, the error gets swallowed silently. In particular, we did not receive any error that we could handle in our service. Of course, there was no other consumer inside the same consumer group for the whole time of observation. Only after a manual service restart, connection-2 (partition 0) has connected again and we received all missing events within our event-hub retention time.

My expected behavior would be that the event-hubs lib would retry to establish the amqp connection on a TimeoutError. If it is not able to reconnect after a few attempts, it should throw an error.

Thanks for the logs @moritz-tr

While we look at the logs, please consider using version 2.1.3 of the @azure/event-hubs library.
We have made a few improvements around the error handling scenarios which should help.

Thanks for the feedback @ramya-rao-a . We now installed the latest @azure/event-hubs and wait till the next OperationTimeout Error occurs. We'll keep you posted!

We've released @azure/event-hubs 2.1.4, and @azure/event-processor-host 2.1.1 (which sets a minimum version for event-hubs to be the one above). This update allows the SDK to detect when the connection has gone idle (no data or heartbeat received from service) after 60 seconds so it can attempt to reconnect.

I'm going to close this issue due to the improvements made from versions 2.1.1-2.1.4 of event hubs. If you see any problems please open a new issue.

Thanks for working with Microsoft on GitHub! Tell us how you feel about your experience using the reactions on this comment.

Was this page helpful?
0 / 5 - 0 ratings