Azure-sdk-for-js: [Service Bus] Errors on lost connection for send operations do not bubble up in time

Created on 4 Jul 2019 · 7Comments · Source: Azure/azure-sdk-for-js

Is your feature request related to a problem? Please describe.
I request a way (or advise on how) to detect if the connection to the service bus is lost.

Currently, after a successful connection there is no easy way to detect that the connection is broken.

Consider this snippet:

const { ServiceBusClient } = require('@azure/service-bus')

const client = ServiceBusClient.createFromConnectionString(process.env.CONN_STR)

const [, ,topic, interval] = process.argv;

const sender = client.createTopicClient(topic).createSender()

let count = 0

setInterval(async () => {
  const timestamp = Date.now()
  const messageNumber = count++

  try {
    console.log("Send: ", messageNumber, timestamp)
    await sender.send({ body: { timestamp, messageNumber } })
    console.log("Sent: ", messageNumber, timestamp, `${(Date.now() - timestamp)/1000}s`)
  } catch(error) {
    console.log("Error: ", messageNumber, timestamp, `${(Date.now() - timestamp)/1000}s`)
    console.error(error.message)
  }
}, interval)

Run like this:

CONN_STR=<instert> node index.js some-topic 1000

Provided that the connection was successful, the topic exists and the connection string gives us sufficient rights. This will start sending a message to some-topic every second:

Send:  0 1562248189379
Sent:  0 1562248189379 0.403s
Send:  1 1562248190379
Sent:  1 1562248190379 0.018s
Send:  2 1562248191382
Sent:  2 1562248191382 0.018s
Send:  3 1562248192383
Sent:  3 1562248192383 0.018s

Then, pull your internet connection, for example by disabling the WIFI. After this, it will simply attempt to send message unsuccessfully for some time:

Send:  13 1562248202392
Send:  14 1562248203393
Send:  15 1562248204393
Send:  16 1562248205393
Send:  17 1562248206393
Send:  18 1562248207393

Wait for ~_3 minutes_, and it will start outputting timeout errors for the messages:

Error:  13 1562248202392 198.007s
[connection-1] Sender "lenkan-lenkpad-heartbeat-fe08909a-216b-7b48-8dd5-d40d7985bbf2" with address "lenkan-lenkpad-heartbeat", was not able to send the message right now, due to operation timeout.

Describe the solution you'd like

I expected that the message sending would simply fail instantly or at least much quicker than it did. What is the recommended approach here? Is the operation timeout configurable?

Also, if an application that is using a ServiceBusClient is not itself initiating any messages, it would be completely oblivious to the lost connection. Because there is no way to detect this without sending a message and wait for ~3 minutes.

I _really_ like the minimal and clean interface that @azure/service-bus is providing, but I would like to be able to gain more insight into what happens in error cases.

Client Service Bus customer-reported feature-request

Source

lenkan

👍4

Most helpful comment

My apologies @lenkan and @jfggdl, this issue is indeed with the client and was wrongly classified as service related.

When there is a connection issue, we try to recover the connection behind the scenes for quite a while before reporting the problem. While this is preferable on the receiving side (in case of the streaming receiver), I see that it is not ideal when sending.

The promise returned by send() method should be rejected in such cases.
We will look into this and get back with potential options.

ramya-rao-a on 7 Dec 2019

👍2

All 7 comments

Thanks for this feedback. I have routed to the appropriate team for follow-up.

maggiepint on 9 Jul 2019

👍1

@maggiepint, would you please provide any information as to why you think the service-side code would need to change to support what @lenkan is describing. It looks like a SDK-only concern.

jfggdl on 30 Jul 2019

My apologies @lenkan and @jfggdl, this issue is indeed with the client and was wrongly classified as service related.

The promise returned by send() method should be rejected in such cases.
We will look into this and get back with potential options.

ramya-rao-a on 7 Dec 2019

👍2

I think to support detecting timeouts earlier, we should do something similar to what we did in event-hubs v5 (https://github.com/Azure/azure-sdk-for-js/issues/6564).

Basically, instead of retrying up to 150 times behind the scenes when a network is disconnected, we would honor the retry policy specified by the user and error out if we exhaust our retries.

6737 includes the code changes to support this for sending and receiving in event hubs and can be used as a reference.

chradek on 16 Jan 2020

@chradek Doing what we did in #6564 may not help with the 3 minute delay here.

My guess for the cause for the 3 min delay is the retries for the send operation itself that we have in place. There are a total of 3 attempts, with each having a timeout of 60 seconds, resulting in the 3 min delay in the error bubbling up to the user.

But, the above is true only for the send() operations that were already fired by the time the network was disconnected. The send() operations that follow the network disconnection should actually report back an error in 60 seconds as the init() would timeout and we don't have retries around this.

ramya-rao-a on 10 Apr 2020

@richardpark-msft, @chradek,

The send() operations that follow the network disconnection should actually report back an error in 60 seconds as the init() would timeout and we don't have retries around this.

I found the reason for the above behavior. After we disconnect the network manually,

The isOpen() on the sender link returns true until rhea disconnects the connection object. This takes about a minute from the time of us removing the network connection.
- Need to check why rhea's isOpen() doesnt return false here
- Need to check why rhea takes a minute to fire the disconnect event
So, for the first minute, each new send() operation is getting queued up thinking the link is just fine. Each of these will go through the 3 retries, with 60 sec timeout on each attempt and 15 sec delay between attempts. Therefore, the final timeout error bubbles up to the user after about 200 seconds. This is what @lenkan is seeing

But, I then found another disturbing issue where the send() operations after the first minute take much much longer to have their errors bubbled up

When rhea fires disconnect event on the connection, we go around closing open links. So, the isOpen() on the sender link no longer returns true, resulting in subsequent send() operations trying to take a lock in order to initialize the sender link.
But, the disconnect event on the connection also results in us trying to bring up the sender link ourselves, so we take a lock for over 3 hours!! (150 retries with 60 second timeout on each attempt and 15 second delay between attempts)
This results in the send() operations fired after the first minute of network going down bubbling the error up only after 3 hours

A side effect of the above is that if the network does come back up in those 3 hours, then the queued up send operations do get to continue, but not the ones that were fired in the first minute after the network being disconnected

ramya-rao-a on 14 Apr 2020

The problem described in the previous comment on the the errors for send operations being bubbled up only after 3 hours in a worse case scenario is now fixed in the version 1.1.6 that we just released for @azure/service-bus

The original issue that was reported on having to wait for 3 minutes to have the error reported is expected behavior. We have retries set up with a total of 3 attempts, each having a timeout of 60 seconds which results in the 3 minute wait time.

In the next major version update for the @azure/service-bus library, we will allow the user to customize the retry policy where you can tweak the number of attempts, timeout value and delay between attempts.

Thanks for reporting @lenkan!

ramya-rao-a on 24 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings