In a session enabled queue, completing message right after session lock expiry results in the error This message has been already settled. from https://github.com/Azure/azure-sdk-for-js/blob/0f05661cced8b0ed9cd036b7a13030499d185aa1/packages/%40azure/servicebus/data-plane/lib/serviceBusMessage.ts#L891
Where as completing the message after a little while of the session lock expiry results in the error `Cannot find the receiver with name ... from https://github.com/Azure/azure-sdk-for-js/blob/0f05661cced8b0ed9cd036b7a13030499d185aa1/packages/%40azure/servicebus/data-plane/lib/serviceBusMessage.ts#L895
Neither is very helpful to the user.
These errors are thrown by the SDK even before the message settle request is made to Service Bus.
We need to understand 2 things here
remote_settled property got set to trueOne possible fix is to check the sessionLockedUntilUtc before either of these errors are thrown. If current time is greater than sessionLockedUntilUtc, then we can throw a better error message.
To see this behavior try the test case at https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L90
See the delay at https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L593-L597
If you remove lockDurationInMilliseconds in that delay then you will see the This message has been already settled. error. Else you will see the Cannot find the receiver with name... error
Understand how the there is no more receiver
We were not providing enough time in the delay here https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L614
before closing the session client at https://github.com/Azure/azure-sdk-for-js/blob/master/packages/%40azure/servicebus/data-plane/test/renewLockSessions.spec.ts#L615. This results in the receiver getting cleared out before completing the message.
Current observation:
If we provide enough time inside delay (enough for the session to expire and the code inside the callback to complete), it should give an error when completing the message. But Instead of giving an error it is completing the message successfully and removing it from the queue.
Sample test code
async function testAutoLockRenewalConfigBehavior(
senderClient: QueueClient | TopicClient,
receiverClient: QueueClient | SubscriptionClient,
options: AutoLockRenewalTestOptions
): Promise<void> {
let numOfMessagesReceived = 0;
console.log("Test Started.............");
await senderClient.getSender().send(testMessage);
const sessionClient = await receiverClient.getSessionReceiver({
sessionId: testSessionId,
maxSessionAutoRenewLockDurationInSeconds: options.maxSessionAutoRenewLockDurationInSeconds
});
await sessionClient.receive(
async (brokeredMessage: ServiceBusMessage) => {
if (numOfMessagesReceived < 1) {
numOfMessagesReceived++;
should.equal(brokeredMessage.body, testMessage.body);
should.equal(brokeredMessage.messageId, testMessage.messageId);
// Sleeping...
await delay(
options.delayBeforeAttemptingToCompleteMessageInSeconds * 1000 +
lockDurationInMilliseconds +
1000
);
console.log("calling complete.............");
await brokeredMessage.complete();
}
},
onError,
{
autoComplete: false
}
);
await delay(
options.delayBeforeAttemptingToCompleteMessageInSeconds * 1000 +
lockDurationInMilliseconds +
1000 +
4000
);
console.log("calling close............................");
await sessionClient.close();
if (uncaughtErrorFromHandlers) {
chai.assert.fail(uncaughtErrorFromHandlers.message);
}
should.equal(numOfMessagesReceived, 1, "Mismatch in number of messages received");
}
So, looks like Service Bus is allowing us to complete a message even after the session has expired.
But, we dont see that happening in testBatchReceiverManualLockRenewalErrorOnLockExpiry which gives the Cannot find the receiver with name error.
When debugging testBatchReceiverManualLockRenewalErrorOnLockExpiry, I found that when the session expired, there was a receiver link close event fired and _onAmqpClose gets called which then closes the receiver object in the SDK resulting in the Cannot find the receiver with name
In conclusion, on session lock expiry
This needs more digging from our side to see what is the difference between the 2 cases
For now, I have removed the tests that were meant to test that complete() fails when lock expires in the streaming cases. See https://github.com/Azure/azure-sdk-for-js/pull/1070/commits/729b84cf9b4c1650c2c886f85838b2dedc56e509
cc @HarshaNalluru, @ramya0820, @ShivangiReja
On enabling debug logs, I see the following error appear in logs that seems to be returned to us by service bus
rhea:frames [connection-2]:3 <- detach#16 {"closed":true,"error":{"condition":"com.microsoft:session-lock-lost","description":"The session lock has expired on the MessageSession. Accept a new MessageSession. TrackingId:e5f257dd0000d66e000377895c54f867_G22_B40, SystemTracker:standardfruitsservivebus:Queue:unpartitioned-sessions-queue, Timestamp:2019-02-02T01:55:17"}}
azure:service-bus:error [connection-2] An error occurred for Receiver 'unpartitioned-sessions-queue-b9e14f9a-86da-b641-941f-a982c2f6e770': { SessionLockLostError: The session lock has expired on the MessageSession. Accept a new MessageSession. TrackingId:e5f257dd0000d66e00037ad45c54fc29_G38_B40, SystemTracker:standardfruitsservivebus:Queue:unpartitioned-sessions-queue, Timestamp:2019-02-02T02:11:19
at Object.translate (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\@azure\amqp-common\dist\lib\errors.js:527:17)
at Receiver.MessageSession._onAmqpError (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\test-dist\index.js:13415:44)
at Receiver.emit (events.js:182:13)
at emit (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea-promise\dist\lib\util\utils.js:129:24)
at Object.emitEvent (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea-promise\dist\lib\util\utils.js:140:9)
at Receiver._link.on (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea-promise\dist\lib\link.js:249:25)
at Receiver.emit (events.js:182:13)
at Receiver.link.dispatch (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\link.js:59:37)
at Receiver.link.on_detach (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\link.js:159:32)
at Session.on_detach (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\session.js:718:27)
at Connection.(anonymous function) [as on_detach] (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\connection.js:737:30)
at c.dispatch (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\types.js:909:33)
at Transport.read (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\transport.js:108:36)
at SaslClient.read (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\sasl.js:299:26)
at Connection.input (c:\workspace\newrepo2\azure-sdk-for-js\packages\@azure\servicebus\data-plane\node_modules\rhea\lib\connection.js:491:35)
at TLSSocket.emit (events.js:182:13)
at addChunk (_stream_readable.js:283:12)
at readableAddChunk (_stream_readable.js:264:11)
at TLSSocket.Readable.push (_stream_readable.js:219:10)
at TLSWrap.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
name: 'SessionLockLostError',
translated: true,
retryable: false,
info: null,
condition: 'com.microsoft:session-lock-lost' }.
However this appears only when running the test
it.only(` (First) Receive a msg using Batch Receiver, wait until its lock expires, completing it now results in error`, async function(): Promise<
void
> {
await testBatchReceiverManualLockRenewalErrorOnLockExpiry(senderClient, receiverClient);
});
first, followed by
it.only(" (Second) Receives a message using Streaming Receiver renewLock() resets lock duration each time.", async function(): Promise<
void
> {
await testAutoLockRenewalConfigBehavior(senderClient, receiverClient, {
maxSessionAutoRenewLockDurationInSeconds: 0,
delayBeforeAttemptingToCompleteMessageInSeconds: 31
});
// Service bus completes the message even when the session lock expires.
});
The tests would pass, however the logs reveal the error, but only in case of streaming receiver test.
Running the streaming receiver test alone did not surface the error.
Overall, need to take a more closer look at what caused service bus to return the SessionsLockLostError and when and why.
cc @HarshaNalluru, @ramya0820, @ShivangiReja
@ramya0820 If I understood this correctly, you are saying that the SessionLockLostError error is seen only when the test that attempts to complete the message after waiting lock to expire when running the batching receiver is followed by the test that does the same with streaming receiver.
From what I have seen, the same error occurs when running the test for batching receiver independently as well.
The order of the tests being run here isnt the issue. The issue is that on lock expiry, the underlying session receiver gets closed after encountering the SessionLockLostError error. But this only happens when using the batching receiver and not the streaming receiver.
Progress since last update:
Got streaming receiver to trigger the SessionLockLostError event by trialing changes that generally mimic batching receiver's flow.
With https://github.com/Azure/azure-sdk-for-js/pull/1128 ,it currently surfaces the SessionLockLostError upto the user's custom onError handler and is triggered asynchronously, not when invoking the complete() on message.
Program exits and message handler / .complete() gets executed after that, eventually erroring out with a receiver not found in background.
Testing done - streamingReceiverSessions.spec.ts and sessionTests.spec.ts pass 100%
(update: as on initial two commits - the sessionLockLostError was not getting thrown reliably, yet - investigating)
Overall, following are list of issues/concerns to be addressed (looking at them in given order):
- Batching Receiver's onError handler is invoked, but that may need a re-visit as it's current error handling mechanism does not surface the appropriate error.
By the time the error occurs, the promise returned by receiveBatch is resolved. Therefore, there is literally no way to notifiy the user as soon as the session lock lost error occurs. Which is fine. As long as any further user engagement with the session receiver returns this error, we should be good.
We can probably do this by keeping track of all session ids for which session lock lost error is seen. And then the SDK throwing the SessionLockLost error when user tries to settle a message from such sessions
- Re-visit error handling in streaming receiver with sessions, as try catch block around onMessage are no-op/not useful during asynchronous onError event triggers
The try/catch around onMessage is only useful when there is an incoming message. The session lock lost error cannot surface here. But, it should surface from the _onAmqpError which should call notifyError which should call the user's error handler
@ramya-rao-a
receiveBatch() default timeout for first message is currently 60 seconds, so it seems reasonable to expect promise get rejected at 30 second mark or so if auto lock renewal is disabled?
About throwing the error on message settling, agree that seems reasonable.. or maybe we could make use of the sessionLockedUntilUtc property.
The try/catch around
onMessageis only useful when there is an incoming message. The session lock lost error cannot surface here. But, it should surface from the_onAmqpErrorwhich should callnotifyErrorwhich should call the user's error handler
Yeah, I think the streaming receiver implementation is good for most part actually.
receiveBatch() default timeout for first message is currently 60 seconds, so it seems reasonable to expect promise get rejected at 30 second mark or so if auto lock renewal is disabled?
Yes, but that will only happen if the expiry happens in those 60 seconds. I believe our test cases run the receiveBatch() call before the expiry and then call complete() on the message after expiry.
or maybe we could make use of the sessionLockedUntilUtc property.
This has 2 issues
sessionLockedUntilUtc property, you need to have access to the receiver. But our problem is that the reciever is already closed and nullified by the time complete() is called after session expirysessionLockedUntilUtc value for each session, this might not be accurate data. Apparently, if you try to settle a message before session expiry, Service Bus automatically renews the lock on the session. When this happens, the data structure for receiver on our side has no way of updating the sessionLockedUntilUtcOh, the second issue seems new, good to know!
I was also thinking about what the consequences of losing session lock would actually mean to the end user.
Both theoretically and practically, it doesn't seem to be related to messages or settling of messages.
I would expect session lock to affect interactions with the session. So while it makes sense the session gets closed and is inaccessible once it expires (and hence settling fails), I would expect tests to center around testing that criteria (closing of session link?) than purely on settling of messages.
Would like to see more user stories/close to real world scenarios on what losing session lock would mean, maybe a look into how other SDks have it.
(Technically, settling should succeed if attempted via new session. That makes me rethink a little about our renaming of acceptSession to getSessionReceiver, because it's intended to be used to get a (new) session itself..)
Update - with https://github.com/Azure/azure-sdk-for-js/pull/1128:
Pending items:
credit_window is being passed as 0, but it's not clear if that truly disables the credit management
Looks like it does disable any incoming messages until addCredit or setCreditWindow is called.
Error handling in SDK can be uniformized
Yes, agreed on all counts. Logged #1145 to track this.
Error handling in streaming receiver - onError can be invoked asynchronously and does not stop execution of onMessage.
Thats an interesting scenario.
In case of error scenarios, the receiver link gets closed after notifying the user, there will be no more new messages incoming.
The interesting case is if the onMessage handler is called, and the code there is async and waiting for something, in which time the the error is notified. In such case, of the onMessage handler only user code, (i.e not interacting with the receiver) then they will complete running (as they should). If it has any code interacting with the receiver like settling a message, then that will throw an error. What happens to this error? Is the onError handler called again?
That's right, the user's onError handler would be invoked in such cases and that needs to be understood by the user maybe. Unlike, say, regular promise's reject, resolve callbacks, our callbacks are not mutually exclusive. That may need to change or perhaps just documented?
Yes, whenever there's an error the onError handler gets invoked one way or the other. If onMessage throws error only then would onMessage stop executing.
Update - with https://github.com/Azure/azure-sdk-for-js/pull/1128:
Pending items:
Done with #1180