Azure-sdk-for-js: [Service Bus] Attempt to settle msg after session lock expiry yields unhelpful error messages

Created on 31 Jan 2019 · 15Comments · Source: Azure/azure-sdk-for-js

In a session enabled queue, completing message right after session lock expiry results in the error This message has been already settled. from https://github.com/Azure/azure-sdk-for-js/blob/0f05661cced8b0ed9cd036b7a13030499d185aa1/packages/%40azure/servicebus/data-plane/lib/serviceBusMessage.ts#L891

Where as completing the message after a little while of the session lock expiry results in the error `Cannot find the receiver with name ... from https://github.com/Azure/azure-sdk-for-js/blob/0f05661cced8b0ed9cd036b7a13030499d185aa1/packages/%40azure/servicebus/data-plane/lib/serviceBusMessage.ts#L895

Neither is very helpful to the user.

Client Service Bus bug

Source

ramya-rao-a

All 15 comments

These errors are thrown by the SDK even before the message settle request is made to Service Bus.

We need to understand 2 things here

Understand how the remote_settled property got set to true
Understand how the there is no more receiver

One possible fix is to check the sessionLockedUntilUtc before either of these errors are thrown. If current time is greater than sessionLockedUntilUtc, then we can throw a better error message.

ramya-rao-a on 31 Jan 2019

👍1

To see this behavior try the test case at https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L90

See the delay at https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L593-L597

If you remove lockDurationInMilliseconds in that delay then you will see the This message has been already settled. error. Else you will see the Cannot find the receiver with name... error

ramya-rao-a on 31 Jan 2019

Understand how the there is no more receiver

We were not providing enough time in the delay here https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L614
before closing the session client at https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L615. This results in the receiver getting cleared out before completing the message.

Current observation:
If we provide enough time inside delay (enough for the session to expire and the code inside the callback to complete), it should give an error when completing the message. But Instead of giving an error it is completing the message successfully and removing it from the queue.

Sample test code

async function testAutoLockRenewalConfigBehavior(
  senderClient: QueueClient | TopicClient,
  receiverClient: QueueClient | SubscriptionClient,
  options: AutoLockRenewalTestOptions
): Promise<void> {
  let numOfMessagesReceived = 0;
  console.log("Test Started.............");

  await senderClient.getSender().send(testMessage);

  const sessionClient = await receiverClient.getSessionReceiver({
    sessionId: testSessionId,
    maxSessionAutoRenewLockDurationInSeconds: options.maxSessionAutoRenewLockDurationInSeconds
  });
  await sessionClient.receive(
    async (brokeredMessage: ServiceBusMessage) => {
      if (numOfMessagesReceived < 1) {
        numOfMessagesReceived++;

        should.equal(brokeredMessage.body, testMessage.body);
        should.equal(brokeredMessage.messageId, testMessage.messageId);

        // Sleeping...
        await delay(
          options.delayBeforeAttemptingToCompleteMessageInSeconds * 1000 +
            lockDurationInMilliseconds +
            1000
        );
        console.log("calling complete.............");
        await brokeredMessage.complete();
      }
    },
    onError,
    {
      autoComplete: false
    }
  );
  await delay(
    options.delayBeforeAttemptingToCompleteMessageInSeconds * 1000 +
      lockDurationInMilliseconds +
      1000 +
      4000
  );

  console.log("calling close............................");
  await sessionClient.close();

  if (uncaughtErrorFromHandlers) {
    chai.assert.fail(uncaughtErrorFromHandlers.message);
  }

  should.equal(numOfMessagesReceived, 1, "Mismatch in number of messages received");
}

ShivangiReja on 1 Feb 2019

So, looks like Service Bus is allowing us to complete a message even after the session has expired.

But, we dont see that happening in testBatchReceiverManualLockRenewalErrorOnLockExpiry which gives the Cannot find the receiver with name error.

When debugging testBatchReceiverManualLockRenewalErrorOnLockExpiry, I found that when the session expired, there was a receiver link close event fired and _onAmqpClose gets called which then closes the receiver object in the SDK resulting in the Cannot find the receiver with name

In conclusion, on session lock expiry

in receiveBatch(), the underlying receiver link is getting closed, thus stopping us from attempting to complete the message
in receive(), Service Bus is able to complete the message just fine

This needs more digging from our side to see what is the difference between the 2 cases

For now, I have removed the tests that were meant to test that complete() fails when lock expires in the streaming cases. See https://github.com/Azure/azure-sdk-for-js/pull/1070/commits/729b84cf9b4c1650c2c886f85838b2dedc56e509

cc @HarshaNalluru, @ramya0820, @ShivangiReja

ramya-rao-a on 1 Feb 2019

On enabling debug logs, I see the following error appear in logs that seems to be returned to us by service bus

rhea:frames [connection-2]:3 <- detach#16 {"closed":true,"error":{"condition":"com.microsoft:session-lock-lost","description":"The session lock has expired on the MessageSession. Accept a new MessageSession. TrackingId:e5f257dd0000d66e000377895c54f867_G22_B40, SystemTracker:standardfruitsservivebus:Queue:unpartitioned-sessions-queue, Timestamp:2019-02-02T01:55:17"}} 

azure:service-bus:error [connection-2] An error occurred for Receiver 'unpartitioned-sessions-queue-b9e14f9a-86da-b641-941f-a982c2f6e770': { SessionLockLostError: The session lock has expired on the MessageSession. Accept a new MessageSession. TrackingId:e5f257dd0000d66e00037ad45c54fc29_G38_B40, SystemTracker:standardfruitsservivebus:Queue:unpartitioned-sessions-queue, Timestamp:2019-02-02T02:11:19
    at Object.translate (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\@azure\amqp-common\dist\lib\errors.js:527:17)
    at Receiver.MessageSession._onAmqpError (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\test-dist\index.js:13415:44)
    at Receiver.emit (events.js:182:13)
    at emit (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea-promise\dist\lib\util\utils.js:129:24)
    at Object.emitEvent (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea-promise\dist\lib\util\utils.js:140:9)
    at Receiver._link.on (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea-promise\dist\lib\link.js:249:25)
    at Receiver.emit (events.js:182:13)
    at Receiver.link.dispatch (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\link.js:59:37)
    at Receiver.link.on_detach (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\link.js:159:32)
    at Session.on_detach (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\session.js:718:27)
    at Connection.(anonymous function) [as on_detach] (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\connection.js:737:30)
    at c.dispatch (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\types.js:909:33)
    at Transport.read (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\transport.js:108:36)
    at SaslClient.read (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\sasl.js:299:26)
    at Connection.input (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\connection.js:491:35)
    at TLSSocket.emit (events.js:182:13)
    at addChunk (_stream_readable.js:283:12)
    at readableAddChunk (_stream_readable.js:264:11)
    at TLSSocket.Readable.push (_stream_readable.js:219:10)
    at TLSWrap.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
  name: 'SessionLockLostError',
  translated: true,
  retryable: false,
  info: null,
  condition: 'com.microsoft:session-lock-lost' }.

However this appears only when running the test

it.only(` (First) Receive a msg using Batch Receiver, wait until its lock expires, completing it now results in error`, async function(): Promise<
        void
      > {
        await testBatchReceiverManualLockRenewalErrorOnLockExpiry(senderClient, receiverClient);
      });

first, followed by

it.only(" (Second) Receives a message using Streaming Receiver renewLock() resets lock duration each time.", async function(): Promise<
        void
      > {
        await testAutoLockRenewalConfigBehavior(senderClient, receiverClient, {
          maxSessionAutoRenewLockDurationInSeconds: 0,
          delayBeforeAttemptingToCompleteMessageInSeconds: 31
        });
        // Service bus completes the message even when the session lock expires.
      });

The tests would pass, however the logs reveal the error, but only in case of streaming receiver test.
Running the streaming receiver test alone did not surface the error.

Overall, need to take a more closer look at what caused service bus to return the SessionsLockLostError and when and why.

cc @HarshaNalluru, @ramya0820, @ShivangiReja

ramya0820 on 2 Feb 2019

@ramya0820 If I understood this correctly, you are saying that the SessionLockLostError error is seen only when the test that attempts to complete the message after waiting lock to expire when running the batching receiver is followed by the test that does the same with streaming receiver.

From what I have seen, the same error occurs when running the test for batching receiver independently as well.

The order of the tests being run here isnt the issue. The issue is that on lock expiry, the underlying session receiver gets closed after encountering the SessionLockLostError error. But this only happens when using the batching receiver and not the streaming receiver.

ramya-rao-a on 4 Feb 2019

Progress since last update:
Got streaming receiver to trigger the SessionLockLostError event by trialing changes that generally mimic batching receiver's flow.
With https://github.com/Azure/azure-sdk-for-js/pull/1128 ,it currently surfaces the SessionLockLostError upto the user's custom onError handler and is triggered asynchronously, not when invoking the complete() on message.
Program exits and message handler / .complete() gets executed after that, eventually erroring out with a receiver not found in background.

Testing done - streamingReceiverSessions.spec.ts and sessionTests.spec.ts pass 100%
(update: as on initial two commits - the sessionLockLostError was not getting thrown reliably, yet - investigating)

Overall, following are list of issues/concerns to be addressed (looking at them in given order):

Batching Receiver's onError handler is invoked, but that may need a re-visit as it's current error handling mechanism does not surface the appropriate error.
Session Lock renewal tests will need to be updated accordingly.
Re-visit error handling in streaming receiver with sessions, as try catch block around onMessage are no-op/not useful during asynchronous onError event triggers.
May require further deep-dive into how credits and credit window work.
- Closer look at link credits and related - https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-amqp-protocol-guide#flow-control
- https://github.com/amqp/rhea#open_receiveraddressoptions - credit window defaults to prefetch 500 messages automatically if not specified, but we are managing the link credits manually using the addCredit() API.
  - 1000* according to https://github.com/amqp/rhea/blob/3ef54378e67bd63be6f057d9b74132ae18d0abac/lib/link.js#L343

ramya0820 on 6 Feb 2019

Batching Receiver's onError handler is invoked, but that may need a re-visit as it's current error handling mechanism does not surface the appropriate error.

By the time the error occurs, the promise returned by receiveBatch is resolved. Therefore, there is literally no way to notifiy the user as soon as the session lock lost error occurs. Which is fine. As long as any further user engagement with the session receiver returns this error, we should be good.

We can probably do this by keeping track of all session ids for which session lock lost error is seen. And then the SDK throwing the SessionLockLost error when user tries to settle a message from such sessions

Re-visit error handling in streaming receiver with sessions, as try catch block around onMessage are no-op/not useful during asynchronous onError event triggers

The try/catch around onMessage is only useful when there is an incoming message. The session lock lost error cannot surface here. But, it should surface from the _onAmqpError which should call notifyError which should call the user's error handler

ramya-rao-a on 6 Feb 2019

@ramya-rao-a
receiveBatch() default timeout for first message is currently 60 seconds, so it seems reasonable to expect promise get rejected at 30 second mark or so if auto lock renewal is disabled?

About throwing the error on message settling, agree that seems reasonable.. or maybe we could make use of the sessionLockedUntilUtc property.

The try/catch around onMessage is only useful when there is an incoming message. The session lock lost error cannot surface here. But, it should surface from the _onAmqpError which should call notifyError which should call the user's error handler

Yeah, I think the streaming receiver implementation is good for most part actually.

ramya0820 on 6 Feb 2019

receiveBatch() default timeout for first message is currently 60 seconds, so it seems reasonable to expect promise get rejected at 30 second mark or so if auto lock renewal is disabled?

Yes, but that will only happen if the expiry happens in those 60 seconds. I believe our test cases run the receiveBatch() call before the expiry and then call complete() on the message after expiry.

or maybe we could make use of the sessionLockedUntilUtc property.

This has 2 issues

To use the sessionLockedUntilUtc property, you need to have access to the receiver. But our problem is that the reciever is already closed and nullified by the time complete() is called after session expiry
Even if we manage to cache the sessionLockedUntilUtc value for each session, this might not be accurate data. Apparently, if you try to settle a message before session expiry, Service Bus automatically renews the lock on the session. When this happens, the data structure for receiver on our side has no way of updating the sessionLockedUntilUtc

ramya-rao-a on 6 Feb 2019

Oh, the second issue seems new, good to know!

I was also thinking about what the consequences of losing session lock would actually mean to the end user.
Both theoretically and practically, it doesn't seem to be related to messages or settling of messages.
I would expect session lock to affect interactions with the session. So while it makes sense the session gets closed and is inaccessible once it expires (and hence settling fails), I would expect tests to center around testing that criteria (closing of session link?) than purely on settling of messages.

Would like to see more user stories/close to real world scenarios on what losing session lock would mean, maybe a look into how other SDks have it.

(Technically, settling should succeed if attempted via new session. That makes me rethink a little about our renaming of acceptSession to getSessionReceiver, because it's intended to be used to get a (new) session itself..)

ramya0820 on 6 Feb 2019

Update - with https://github.com/Azure/azure-sdk-for-js/pull/1128:

SessionLockLostError is surfaced on both Streaming and Batching receivers. The difference being, in case of Streaming - the error from Service Bus is surfaced to the user, whereas in Batching a new error is constructed and thrown with same name.
Testing done:
- streamingReceiverSessions.spec.ts and sessionTests.spec.ts pass 100%
- currently running all tests except for renewLock and renewLockSessions, most of them are passing so far.
- verified that SessionLockLostError is being surfaced reliably, however will continue deep-diving and thoroughly testing the changes.

Pending items:

Session Lock renewal tests will need to be updated accordingly.
Continued deep-dive on credits and credit_window
- credit_window is being passed as 0, but it's not clear if that truly disables the credit management
- the error does not get triggered if we add credit at the beginning of onSessionMessage handler. Could be related to - https://github.com/amqp/rhea-promise/blob/c82884b80852619d8ec31e976127a01174d412db/lib/util/utils.ts#L179
Error handling in SDK can be uniformized i.e., we could add an error constructor utility that stores names and descriptions of all error messages that we surface to user. This helps with:
- better structure, review the error messages that we expose
- not create random errors on the fly
- serve as documentation too on all errors that can be thrown
Error handling in streaming receiver - onError can be invoked asynchronously and does not stop execution of onMessage. Wonder if this has any side-effects that user may need to be aware of as such.

ramya0820 on 7 Feb 2019

credit_window is being passed as 0, but it's not clear if that truly disables the credit management

Looks like it does disable any incoming messages until addCredit or setCreditWindow is called.

Error handling in SDK can be uniformized

Yes, agreed on all counts. Logged #1145 to track this.

Error handling in streaming receiver - onError can be invoked asynchronously and does not stop execution of onMessage.

Thats an interesting scenario.

In case of error scenarios, the receiver link gets closed after notifying the user, there will be no more new messages incoming.

The interesting case is if the onMessage handler is called, and the code there is async and waiting for something, in which time the the error is notified. In such case, of the onMessage handler only user code, (i.e not interacting with the receiver) then they will complete running (as they should). If it has any code interacting with the receiver like settling a message, then that will throw an error. What happens to this error? Is the onError handler called again?

ramya-rao-a on 7 Feb 2019

That's right, the user's onError handler would be invoked in such cases and that needs to be understood by the user maybe. Unlike, say, regular promise's reject, resolve callbacks, our callbacks are not mutually exclusive. That may need to change or perhaps just documented?
Yes, whenever there's an error the onError handler gets invoked one way or the other. If onMessage throws error only then would onMessage stop executing.

Update - with https://github.com/Azure/azure-sdk-for-js/pull/1128:

Updated session lock renewal tests
Testing done:
- All the tests pass, including lock related ones (glad to see this! they took 1hr 18 min to complete)

Pending items:

Continued deep-dive on credits and credit_window
- the error does not get triggered if we add credit at the beginning of onSessionMessage handler. Could be related to - https://github.com/amqp/rhea-promise/blob/c82884b80852619d8ec31e976127a01174d412db/lib/util/utils.ts#L179
Error handling in streaming receiver revisit or documentation update perhaps

ramya0820 on 7 Feb 2019

👍1

Done with #1180

ramya-rao-a on 13 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings