Azure-sdk-for-js: [EPH] Uncaught error thrown on disconnect

Created on 1 Jul 2019  路  13Comments  路  Source: Azure/azure-sdk-for-js

  • Package Name: @azure/event-processor-host
  • Package Version: 1.0.6
  • Operating system:
  • [x] nodejs

    • version: 8.9.4

  • [x] typescript

    • version: 3.4.1

Describe the bug
While listening to a partition of an IoT Hub instance, the package sometimes throws an uncaught error.

The first issue is that the error can only be caught with a process.on('uncaughtException', () => {...}).
The second is that the raised execpetion passes along the string ([object Object]) as its message, which provides no information.
The last and the most important issue here is that, whether the error makes the Node.js application restarts or not after getting that error (we tried both scenarios), the EPH no longer consumes messages from the partition it was consuming unless the application is restarted manually after a while (unfortunately, I could not determine anything more precise).

Additional context below is important here.

Here is the stack trace of the uncaught error :

/!\ Unhandled Exception /!\ Error: Unhandled "error" event. ([object Object])
Error: Unhandled "error" event. ([object Object])
Error   at Connection.emit (events.js:186:19)
Error   at emit (/home/nodejs/app/node_modules/rhea-promise/dist/lib/util/utils.js:129:24)
Error   at Object.emitEvent (/home/nodejs/app/node_modules/rhea-promise/dist/lib/util/utils.js:140:9)
Error   at Connection._connection.on (/home/nodejs/app/node_modules/rhea-promise/dist/lib/connection.js:378:25)
Error   at emitOne (events.js:116:13)
Error   at Connection.emit (events.js:211:7)
Error   at Connection.dispatch (/home/nodejs/app/node_modules/rhea/lib/connection.js:242:37)
Error   at Connection.input (/home/nodejs/app/node_modules/rhea/lib/connection.js:518:18)
Error   at emitOne (events.js:116:13)
Error   at TLSSocket.emit (events.js:211:7)

To Reproduce
Steps to reproduce the behavior:

  1. Listen to a partition with EPH
  2. Wait for a IoT Hub update

Expected behavior
Should raise the error throwing an error, via the onEphError function to handle re-connection or via a new onEphDisconnect function.

Additional context
We sent a ticket to the IoT Hub support in which they replied that an update was done at the time of the bug and that our Node.js application should handle the disconnection. Nevertheless, the fact that the error is not thrown properly with no real information makes it hard to handle. Even if caught, we have no clue on what to do to make sure EPH will start consuming normally again. We tried to make the application restart on the error but it did not work. Restarting th application after a while (we don't know how long) will make the consuming start properly again though.

Here is their full reply:

After further research is appears, based on our logs from our side, we had an upgrade in process at that time. During the upgrade, there might be cases where the clients will get disconnected due to our backend service moving resources around. This error should be transient. If the client retries the operation, it should succeed. It appears your application is not configured to handle a disconnect; Therefore, in order to prevent this from happening in the future, you are responsible for implementing a retry method within your code to be able to handle the exception and retry the operation. If you have any other questions or concerns please let me know. Otherwise, we can move forward and archive this case

Note that this bug has occurred only occurred twice in two month for one Node.js application, most probably always being linked to some kind of update in IoT Hub. But, the error occurred on all of our multiple instances deployed each time at around the same time (within a few minutes).

Client Event Hubs Iot customer-reported

All 13 comments

@Zell211 - thanks for giving the additional context on why this is hard to handle. It sounds like you already have established communication with support, I would recommend giving them this feedback as well.
We will route this feedback to the IotHub team to see if they can improve the feature.

@mozehgir This appears to be an issue with the EPH library @azure/event-processor-host. Changing labels appropriately.

@Zell211 Thanks for reporting, we will look into replicating a connection disconnect scenario and see how EPH behaves.

@ShivangiReja Can you run our EPH sample with simulation of a connection disconnect event like we did with Event Hubs a few months ago and see if EPH resumes receiving messages or not?

@ShivangiReja has confirmed that on simulating network loss EPH is able to resume receiving events when the network is back up. On hindsight, this is because network loss results in rhea firing the disconnected event which we listen to and respond appropriately.

From the call stack shared above, the event being fired by rhea is error. The relevant line of code from rhea is https://github.com/amqp/rhea/blob/1.0.4/lib/connection.js#L518

rhea does track this as one of the ConnectionEvents, and rhea-promise does emit the same on its connection, but Event Hubs does not listen for this event. Event Hus only listens for connection_open and disconnected event.

@amarzavery Any idea on what scenarios would make rhea to choose to fire the error event but not follow it up with a disconnected event? What is the expectation from the client here? Log/report the error and carry on?

Since the socket is closed, right after dispatching the error event, we might want to try and bring back the connection regardless of the error is retryable or not. But if the error event is followed by the disconnected event, then we only need to log the error.

Next steps:

  • Update @azure/event-hubs to add an event handler for the error event on the connection and log the error
  • Since we are not aware of what error is actually being thrown, we will work with IotHub folks to simulate the update scenario
  • Based on how the above goes we may decide to add the "bring back the connection" logic to the error event handler or be happy with just listening the event and logging it.

@ramya-rao-a thank you for your help.

I found that the problem occurred on an instance on which we activated debug logs. I don't know if this can be of any help but here is an extract around the issue that occurs at _2019-06-27T10:51:04.506Z_ (line 48).

extract-2019-07-03_08-43-52.txt

Thanks for the logs @Zell211, they will certainly help. Can you provide the logs for duration of say 2 minutes prior to what you have shared?

Hello,
Were the logs I gave you enough ? Is there anything new on this issue ?

@Zell211 We could see 2 separate things happening in the logs

  • The unhandled error as you mentioned in the issue description here
  • Inability to re-connect the connection with the error InvalidOperationError: A link to connection '...' $cbs node has already been opened

For the first issue, we plan to add an event handler that will just log the error. That should take care of the unhandled exception. We have other code already in place that will take care of re-connecting

We have seen the second issue in Event Hubs library (the one EPH depends on) and have fixed it in versions that are higher than the one used by EPH. Since that fix came after a breaking change, the major version for Event Hubs has changed and so, you wont get that fix for free yet. We will be releasing Event Hubs with an older version number with the fix backported.

In conclusion,

  • Please wait for us to release a version of Event Hubs with the relevant fixes.
  • After this, you will need to re-deploy your code so that the above version of Event Hubs will get installed and used.

@Zell211 Can you share which of the various static helpers are you using to create the Event Processor Host in your application?

@ramya-rao-a We use createFromIotHubConnectionString.

As I read your release note of 2.0, I updated the pacakge to 2.1.0 in a test environment seeing it should not make any breaking changes for us. The first tests seem to be conclusives.

Thanks a lot for the information and efforts.

@Zell211 @azure/event-processor-host version 2.0.0 has been released with relevant fixes. Please use this version and let us know how it goes.

For more details on the release, please refer to the changelog.

Thanks for working with Microsoft on GitHub! Tell us how you feel about your experience using the reactions on this comment.

Was this page helpful?
0 / 5 - 0 ratings