Following tests in misc.spec.ts are failing -
Both the tests are failing because receive() is yielding 0 events (as opposed to 5). The Node counterpart tests pass frequently, and sometimes fail with receiving just 4 and not 5.
But browser tests are failing consistently by receiving 0 events.
Expected behavior: Both Node and browser tests to pass consistently i.e., reflect reliability and consistency of receiving events in batch.
Steps to reproduce: Checkout latest commit from https://github.com/Azure/azure-sdk-for-js/pull/3445 if not already available on master.
The issue of Node tests sometimes failing by receiving just 4 instead of 5 events is being tracked in https://github.com/Azure/azure-sdk-for-js/issues/1751. Therefore, please use this issue to investigate why we are getting 0 events in browser mode. The 2 may or may not be related. If we can get the browser test to receive non zero events, then that is a good enough resolution of this issue and the 4 vs 5 events can be continued to be investigated as part of https://github.com/Azure/azure-sdk-for-js/issues/1751
Findings:
On enabling and analyzing rhea logs, found that the way credits are managed is different browser vs node.
Since the tests include an intermediary check/test of checking whether any events (5) are received before sending, 5 credits get added onto the link.
In case of Node, the credits remain unused until the next (main) receive() call.
However, in browser link gets an 'accepted' event after sending messages. And when messages are being sent, the link is then immediately using the credits and receives messages (events) before the subsequent receive() call.
Hence, no messages are available and count always ends up being 0.
Tested theory by altering credits and verifying logs that the credit number differences are noted.
Workaround for tests
However, we'll need to investigate the negative scenario case where multiple receive() calls are used within same period of execution. More analysis is required as to why rhea credit management is working differently in Node vs browser. @bterlson @ramya-rao-a @amarzavery any differences that we know off the bat about why Rhea would maybe work differently with WebSockets? Is there a different service side implementation for via WebSockets?
cc @ramya-rao-a @bterlson @amarzavery @AlexGhiondea
Below screenshots show dfferences observed so far - browser logs on left, Node on right.
Browser run receives 'accepted' event on link that Node doesn't:
Browser run receives messages immediately after sending. This doesn't happen in Node:
(For this run, modified test to have the intermediary check try to receive 2 messages, these 2 get received in browser here as shown and remaining 3 show up in the subsequent call)
Also, since the root cause is zeroed-in on occurrence of ‘accepted’ event on link early on for browsers (resulting in credits getting used up while send is in progress, as opposed to not getting used/drained)
And maybe this still works in Node because the send op is blocking, while that’s not being the case in browser? Then maybe the solution is to update code so we drain credits at end of each receive call.
@ramya-rao-a @bterlson @amarzavery @nemakam any thoughts/opinions to share on this approach?
Also, since the root cause is zeroed-in on occurrence of ‘accepted’ event on link early on for browsers (resulting in credits getting used up while send is in progress, as opposed to not getting used/drained)
And maybe this still works in Node because the send op is blocking, while that’s not being the case in browser? Then maybe the solution is to update code so we drain credits at end of each receive call.@ramya-rao-a @bterlson @amarzavery @nemakam any thoughts/opinions to share on this approach?
Update:
Tried adding code to drain left over credits at end of each receive call.
However, it did not work as the service did not seem to be draining credits. Following are the sequence of events noted in Service Bus when drain occurs
00:50:13.585Z azure:service-bus:receiverbatching [connection-1] Receiver 'unpartitioned-queue-5406ac3b-512d-c64c-a6ae-8c01e8b17878': Draining leftover credits(3).
2019-06-06T00:50:13.585Z rhea:frames [connection-1]:3 -> flow#13 {"incoming_window":2048,"outgoing_window":4294967295,"link_credit":4,"drain":true}
2019-06-06T00:50:13.585Z rhea:raw [connection-1] SENT: 38 0000002602000003005313d000000016000000094370000008004370ffffffff434352044041
2019-06-06T00:50:13.662Z rhea:io [connection-1] read 35 bytes
2019-06-06T00:50:13.662Z rhea:io [connection-1] got frame of size 35
2019-06-06T00:50:13.662Z rhea:raw [connection-1] RECV: 35 0000002302000003005313c0160b437000001388520170000007ff4352044343414240
2019-06-06T00:50:13.662Z rhea:frames [connection-1]:3 <- flow#13 {"incoming_window":5000,"next_outgoing_id":1,"outgoing_window":2047,"delivery_count":4,"drain":true}
2019-06-06T00:50:13.662Z rhea:events [connection-1] Link got event: receiver_flow
2019-06-06T00:50:13.662Z rhea-promise:receiver [connection-1] receiver got event: 'receiver_flow'. Re-emitting the translated context.
2019-06-06T00:50:13.662Z rhea-promise:translate [connection-1] Translating the context for event: 'receiver_flow'.
2019-06-06T00:50:13.662Z rhea:events [connection-1] Link got event: receiver_drained
2019-06-06T00:50:13.662Z rhea-promise:receiver [connection-1] receiver got event: 'receiver_drained'. Re-emitting the translated context.
2019-06-06T00:50:13.663Z rhea-promise:translate [connection-1] Translating the context for event: 'receiver_drained'.
2019-06-06T00:50:13.663Z azure:service-bus:receiverbatching [connection-1] Receiver 'unpartitioned-queue-5406ac3b-512d-c64c-a6ae-8c01e8b17878' drained. Resolving receiveMessages() with 0 messages.
However, none of these are happening with EventHubs i.e., there is no 'receiver_flow' or 'receiver_drained' events being raised on the link.
azure:event-hubs:receiverbatching [connection-1] Receiver 'ff6dc6e2-7260-4434-a70b-802cd930ed69': Draining leftover credits(5).
2019-06-06T01:02:19.231Z rhea:frames [connection-1]:2 -> flow#13 {"incoming_window":2048,"outgoing_window":4294967295,"link_credit":6,"drain":true}
2019-06-06T01:02:19.231Z rhea:raw [connection-1] SENT: 38 0000002602000002005313d000000016000000094370000008004370ffffffff434352064041
2019-06-06T01:02:19.232Z azure:event-hubs:error [connection-1] Sender 'ecc85fe6-318d-4bf8-afc4-cd77d7a09fe3' with address 'eventhubtest/Partitions/0' is open? -> undefined
2019-06-06T01:02:19.232Z azure:event-hubs:sender Acquiring lock sender-87276ec4-2a5e-4f40-9252-d4dd10920cc9 for initializing the session, sender and possibly the connection.
2019-06-06T01:02:19.232Z azure:event-hubs:error [connection-1] Sender 'ecc85fe6-318d-4bf8-afc4-cd77d7a09fe3' with address 'eventhubtest/Partitions/0' is open? -> undefined
2019-06-06T01:02:19.232Z azure:event-hubs:error [connection-1] The sender 'ecc85fe6-318d-4bf8-afc4-cd77d7a09fe3' with address 'eventhubtest/Partitions/0' is not open and is not currently establishing itself. Hence let's try to connect.
2019-06-06T01:02:19.232Z azure:event-hubs:linkEntity [connection-1] Acquiring cbs lock: 'negotiateCbs-a02fb3d2-616a-e64a-8bf2-8eed3ee963b7' for creating the cbs session while creating the EventHubSender: 'ecc85fe6-318d-4bf8-afc4-cd77d7a09fe3' with address: 'eventhubtest/Partitions/0'.
So the issue still remains unresolved as drain isn't working, changes made to enable draining of leftover credits can be reviewed at - https://github.com/Azure/azure-sdk-for-js/pull/3495/files
Also, tried disabling 'autoaccept', that did not make a difference either as run in browser was still receiving an accepted event where Node doesn't
azure:event-hubs:linkEntity [connection-1] Negotiated claim for EventHubSender '178c4309-147f-4426-9385-f66dba786d57' with with address: eventhubtest/Partitions/0 +81ms
index.js:12464 azure:event-hubs:error [connection-1] Trying to create sender '178c4309-147f-4426-9385-f66dba786d57'... +84ms
index.js:12464 azure:event-hubs:sender Creating sender with options: Object +84ms
index.js:12464 rhea-promise:eventhandler [connection-1] rhea-promise 'session' object is listening for events: (8)Â ["session_open", "session_error", "session_close", "settled", "sender_error", "sender_close", "receiver_error", "receiver_close"] emitted by rhea's 'session' object. +10s
index.js:12464 rhea-promise:session [connection-1] Calling amqp session.begin(). +10s
index.js:12464 rhea:events [connection-1] Link got event: accepted +3ms
index.js:12464 rhea-promise:sender [connection-1] sender got event: 'accepted'. Re-emitting the translated context. +10s
index.js:12464 rhea-promise:translate [connection-1] Translating the context for event: 'accepted'. +3ms
index.js:12464 rhea:events [connection-1] Link got event: settled +0ms
Update:
Overall, it looks like we need to engage Event Hubs service, as the events being raised on the link are being controlled by the service.
cc @AlexGhiondea @ramya-rao-a
Great job on the investigations @ramya0820
One question, have you tried running the same test for Service Bus and compare node vs browser behavior? This will help us narrow down on whether this browser vs node difference is due to rhea or the service. While doing this remove the logic we have for drain in Service Bus to get the right comparison.
The fact that the service is not respecting the drain request could be because they don't support it yet. This doesnt affect our customers yet, because they don't use the receive() function on the BatchingReceiver directly. Each receiveBatch() request from the user creates a whole new batching receiver and closes the same at the end of the operation.
Also, the links shared in https://github.com/Azure/azure-sdk-for-js/issues/3471#issuecomment-499297101 seem to be broken. Can you share the logs for browser vs node for the time between https://github.com/Azure/azure-sdk-for-js/blob/%40azure/service-bus_1.0.2/sdk/eventhub/event-hubs/test/misc.spec.ts#L168-L171 ?
DEBUG=rhea*,-rhea:io,-rhea:raw,azure:event-hubs
Also, to rule out the case that the Websocket end on the service side could be the culprit, we can run the Node tests using the Websockets
Refer to the useProxy.ts sample where we force rhea to use Websockets when talking to Event Hubs (Skip all the details related to proxy, just import the ws package in our test code and pass the Websocket in the options when creating the EventHubClient in the test code)
Great job on the investigations @ramya0820
One question, have you tried running the same test for Service Bus and compare node vs browser behavior? This will help us narrow down on whether this browser vs node difference is due to rhea or the service. While doing this remove the logic we have for drain in Service Bus to get the right comparison.
The fact that the service is not respecting the drain request could be because they don't support it yet. This doesnt affect our customers yet, because they don't use the
receive()function on theBatchingReceiverdirectly. EachreceiveBatch()request from the user creates a whole new batching receiver and closes the same at the end of the operation.
Yes, on running browser tests for Service Bus, found that 'accepted' event gets raised for it as well. So likely, this isn't the problem. It looks like draining of credits is the solution that we need to have in place as that would prevent messages from being received.
Also, the links shared in #3471 (comment) seem to be broken. Can you share the logs for browser vs node for the time between https://github.com/Azure/azure-sdk-for-js/blob/%40azure/service-bus_1.0.2/sdk/eventhub/event-hubs/test/misc.spec.ts#L168-L171 ?
DEBUG=rhea*,-rhea:io,-rhea:raw,azure:event-hubs
Updated the image references, are you able to see them now?
Also, to rule out the case that the Websocket end on the service side could be the culprit, we can run the Node tests using the Websockets
Got it, just tried this and verified logs - the tests pass and Node with using the WebSocket does not raise issue. So we can rule this one out.
Yes, on running browser tests for Service Bus, found that 'accepted' event gets raised for it as well. So likely, this isn't the problem. It looks like draining of credits is the solution that we need to have in place as that would prevent messages from being received.
What about the same test, but in Node for Service Bus. Does 'accepted' event get raised then?
Yes, on running browser tests for Service Bus, found that 'accepted' event gets raised for it as well. So likely, this isn't the problem. It looks like draining of credits is the solution that we need to have in place as that would prevent messages from being received.
What about the same test, but in Node for Service Bus. Does 'accepted' event get raised then?
Nope, that seems to be happening just in browser (for Service Bus as well).
Great, ok now to summarize
accepted event is fired when there are pending credits on the receiver link and the service gets new events/messages. This doesnt happen in NodeTherefore, our next stop is the rhea land :)
Great, ok now to summarize
- In browser mode, the
acceptedevent is fired when there are pending credits on the receiver link and the service gets new events/messages. This doesnt happen in Node- Use of websockets is not the problem, as we don't see a different behaviour in Node with and without Websockets
- This behaviour is common to both Service Bus and Event Hubs, and so the difference is most likely in rhea
Therefore, our next stop is the rhea land :)
So the accepted event is common to both and to do with Rhea. But it isn't creating a problem in Service Bus, so it doesn't seem like that is the problem. In Service Bus, we drain credits after receive Batch but not in Event Hubs. Is there any reason why we might not want to? (Because, it seems like this would most definitely solve the problem and other corner cases too maybe.)
Other than that, yes looking into Rhea and Rhea-promise would likely help get to why Node is not receiving messages while sending (because it has credits), and why is it that browser does. (fun. :s )
Yes, the reason Service Bus doesnt have this problem is because it drains credits.
Ideal solution is to implement draining in Event Hubs as well, but I am guessing that the service doesnt support the draining feature. We will definitely pick that up with the service team.
Following sample reproduce eventhubs' send and receive batched messages scenario (i.e., failed test case scenario referenced in description) using-rhea-promise samples, for both browser and node.
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the Apache License. See License in the project root for license information.
import {
Connection,
Sender,
EventContext,
Message,
ConnectionOptions,
Delivery,
SenderOptions,
ReceiverOptions,
delay,
ReceiverEvents,
types,
Receiver
} from "../lib";
import * as dotenv from "dotenv"; // Optional for loading environment configuration from a .env (config) file
dotenv.config();
// <your-namespace>.servicebus.windows.net
const host = "";
// SharedAccessKeyName (usually "RootManageSharedAccessKey")
const username = "RootManageSharedAccessKey";
// SharedAccessKey
const password = "";
const port = parseInt("5671");
const senderAddress = "<event-hub-name>/Partitions/0";
const receiverAddress = "<event-hub-name>/ConsumerGroups/$default/Partitions/0";
async function main(): Promise<void> {
const connectionOptions: ConnectionOptions = {
transport: "tls",
host: host,
hostname: host,
username: username,
password: password,
port: port,
reconnect: false
/* webSocketOptions: {
webSocket: WebSocket,
protocol: ["AMQPWSB10"],
url:
"wss://<your-namespace>.servicebus.windows.net:443/$servicebus/websocket",
options: {}
} */
};
const connection: Connection = new Connection(connectionOptions);
console.log("waiting");
await delay(10000);
console.log("done waiting");
await connection.open();
// Create a single receiver once
const receiverName = "receiver-1";
// receive messages from the past one hour
const filterClause = `amqp.annotation.x-opt-enqueued-time > '${Date.now() -
3600 * 1000}'`;
const receiverOptions: ReceiverOptions = {
name: receiverName,
source: {
address: receiverAddress,
filter: {
"apache.org:selector-filter:string": types.wrap_described(
filterClause,
0x468c00000004
)
}
},
onSessionError: (context: EventContext) => {
const sessionError = context.session && context.session.error;
if (sessionError) {
console.log(
">>>>> [%s] An error occurred for session of receiver '%s': %O.",
connection.id,
receiverName,
sessionError
);
}
}
};
receiverOptions.credit_window = 0;
const receiver: Receiver = await connection.createReceiver(receiverOptions);
// Simulate targeted scenario of receiving once before sending and once after
await receive(connection, 5, receiver);
await send(connection, "1234512345");
await send(connection, "6789067890");
await receive(connection, 5, receiver);
await receiver.close();
await connection.close();
}
async function send(connection: Connection, messageId: string): Promise<void> {
const senderName = "sender-1";
const senderOptions: SenderOptions = {
name: senderName,
target: {
address: senderAddress
},
onError: (context: EventContext) => {
const senderError = context.sender && context.sender.error;
if (senderError) {
console.log(
">>>>> [%s] An error occurred for sender '%s': %O.",
connection.id,
senderName,
senderError
);
}
},
onSessionError: (context: EventContext) => {
const sessionError = context.session && context.session.error;
if (sessionError) {
console.log(
">>>>> [%s] An error occurred for session of sender '%s': %O.",
connection.id,
senderName,
sessionError
);
}
}
};
const sender: Sender = await connection.createSender(senderOptions);
const message: Message = {
body: "Hello World!!",
message_id: messageId
};
const delivery: Delivery = await sender.send(message);
console.log(">>>>>[%s] Delivery id: ", connection.id, delivery.id); // delivery.id); // message);
await sender.close();
}
async function receive(
connection: Connection,
credits: number,
receiver: Receiver
): Promise<void> {
receiver.addCredit(credits);
receiver.on(ReceiverEvents.message, (context: EventContext) => {
console.log("Received message: %O", context.message);
});
receiver.on(ReceiverEvents.receiverError, (context: EventContext) => {
const receiverError = context.receiver && context.receiver.error;
if (receiverError) {
console.log(
">>>>> [%s] An error occurred for receiver '%s': %O.",
connection.id,
receiver.name,
receiverError
);
}
});
// sleeping for 10 seconds to let the receiver receive messages and then closing it.
await delay(10000);
}
main().catch(err => console.log(err));
For browser, the WebSocket related code is to be uncommented and localStorage.debug='azure*,rhea*'; can be added to see the logs.
Findings is that:
cc @ramya-rao-a
Discussed offline - one of the key findings from previous work on investigating maxConcurrentCalls parameter usage was that based on where we add the addCredit() line in code, the behavior is different and events end up not getting raised. (Description of https://github.com/Azure/azure-sdk-for-js/pull/1170#issue-251646550 has more context on this.)
Similar root cause analysis would need to be done for this when we pick this up next.
The general fix moving forward would be to better identify the usage of Node event loop framework (and this would likely result in changes across the SDKs.)
cc @ramya-rao-a @AlexGhiondea
Discussed offline with @ramya-rao-a
Current changes comment out some tests that need fixing.
Closing this issue as pending work for long term fix that will go as part of Track 2, and related are being looked into as part of re-designing the receiver implementation.
Clarifying a bit:
This issue was originally to track the tests in misc.spec.ts which were failing in browser mode due to the drain feature not being implemented on the service side. The tests were refactored to avoid the drain issue and therefore the decision to close the issue
The commented test was in receiver.spec.ts file which was not what this issue was tracking.