Azure-sdk-for-java: [QUERY] EventHub Throttle and TU relationship / Latency details / Partition and TU relationship

Created on 11 May 2020  路  21Comments  路  Source: Azure/azure-sdk-for-java

Query/Question

I've multiple queries:

  • 1 TU is 1 MB / sec or 1000 msgs / sec (I read it somewhere written as 1000 API calls / sec) whichever happens first. Imagine I've message of size 500 bytes, then I would be able to create an EventDataBatch containing ~2000 msgs (less than 2000 but close to it) and I would be able to send it to EventHub with 1 send call eventProducerClient.send(eventDataBatch). In this case, the size of eventDataBatch will be ~ 1MB (less than 1 MB but close to it) and I am making 1 API call (but sending ~2000 msgs in that call). Will my request be throttled?

    Or put in another way, If I know that my per message size is < 1 KB, should I still limit the eventDataBatch to only 1000 messages (and utilizing only half of 1MB / sec)?

    And if the requests are being throttled, how is the application supposed to know about this? There is only a WARNING log. I raised the relevant BUG here: https://github.com/Azure/azure-sdk-for-java/issues/11003

  • Is there a way to know the time EventHub SDK takes to push my Event (or EventDataBatch) to EventHub? I currently have no latency information from SDK. I am calculating it in my own code right now like this:

    try (com.codahale.metrics.Timer.Context ignored = latency.time()) { // dependency io.dropwizard.metrics:metrics-graphite:4.1.7
        eventHubProducerClient.send(eventDataBatch);
    }
    

    Is this how this is supposed to be done? Also, what is the expected latency while pushing data (one Event / EventDataBatch) to EH?

  • I am trying out EventHub SDK Consumer and Producer in a sample application where I consumer from an EventHub A (having 32 partitions, loads of data available, reading from EventPosition.earliest() and not storing checkpoints) and pushing the messages unmodified to another EventHub B having 5 partitions. Since each partition can only be maxed out with 1 TU, it should be pointless to have more than 5 TU on EventHub B. However, if I enable Auto-Inflate (with max TU allowed at 20) and keep my Consumer and Producer running, it inflates my EventHub to 20 TU and I can see significant gain in performance (more than double than keeping TU at 5).

    I am not able to understand this because no partition can utilize the 15 extra TU that are being allocated by Auto-Inflate feature. Just to point it out EventHub B is the only EventHub in that namespace. So the producer EventHub namespace overall only has 5 partitions.

Why is this not a Bug or a feature Request?
I couldn't categorize it as bug / feature request because I might be missing a few details in my understanding.

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [x] Query Added
  • [x] Setup information Added
Client Event Hubs Service Attention customer-reported question

All 21 comments

@srnagar @conniey PTAL

Is there a way to know the time EventHub SDK takes to push my Event (or EventDataBatch) to EventHub?

Event Hubs has support for OpenTelemetry tracing. For more details, you can take a look at Azure Core Tracing library. Here's a sample of enabling tracing for publishing an event.

@serkantkaraca and @JamesBirdsall could you please take a look at the other two questions @shubhambhattar has posted above?

Regarding batch throttling question; service doesn't throttle first message. So you can send 2000 messages in a batch with 1 TU just fine. Next send attempt however will be throttled as expected.

Regarding per partition throttling question; service doesn't enforce throttling per partition. 1 MB/sec per partition is just a design recommendation. Depending on various factors - such as network latency and speed, service and client side resource states - clients can achieve to send more than 1 MB/sec traffic to each partition just fine.

@serkantkaraca
For the first question: The limitation of 1000 msgs seem odd then, because I'll never be able to utilize my TU and will always have to provision more if the message size is less than 1 KB (which is in my case).

For the second question: So, it means that the behavior of #TU > #partitions (across the namespace) will vary and I just got lucky that in my case, performance improved?

@srnagar Thanks for letting me know about the Tracing library. I'll check that.

@serkantkaraca @JamesBirdsall Continuing my above comment (here), I also found this on FAQ section of eventhub: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-faq#what-are-event-hubs-throughput-units

where it states that:

Throughput in Event Hubs defines the amount of data in mega bytes or the number (in thousands) of 1-KB events that ingress and egress through Event Hubs.

Does this mean that the 1000 message limit is for those messages which are 1 KB in size?

Also, I don't know if this is the desired behavior but EventDataBatch can actually store more than 1000 events.

Any more clarification on this would be appreciated.

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl.

https://imgur.com/a/2Ztbakw

@serkantkaraca Also noticed this interesting trend where if I leave the application running for a long time, #messages being pushed decreases and throttling becomes 0. The new numbers still doesn't fit with the constraints (like I am pushing data in an EventHub with 16 partitions and 20 TU across the namespace, namespace has < 20 partitions) but I am still not able to push more than 13K messages / sec.

Sorry, seems I didn't get notifications for new answers in this thread and just saw them now.

With 20 TU, you should be able to push at minimum 20K messages per second. This can be a client side issue that needs investigation. Can you try couple things?

  • See if you can reproduce with some other SDK like .NET.
  • Try increasing number of clients. This will scale the traffic out on more number of connections.
  • Check bytes in and messages in metrics on the namespace dashboard at Azure portal. See if metrics are showing any anomalies.

@serkantkaraca

  • Can only work with Java SDK :(
  • Increased the number of producers to 2. The same trend is visible.
  • The same trend is visible there also. Please check this: https://imgur.com/a/GN34UCi

Can you measure with single client first and see if its performance degrades over time? It seems the publisher traffic drops instantly, not trends down over some time. I also wonder if publishers get stuck and stop sending completely. Are you able to chart traffic per client? See if any of the clients stopped sending completely.

@serkantkaraca This comment was actually for single client itself and the performance did degrade over time. Yes, the traffic drops suddenly but if you notice in the graph, at exactly same point throttling also stops. And everything kinds of achieves stability at around 13 K (which should atleast be >= 16K as I've 16 partitions and 20 TU).

I didn't found any trace of publisher getting stuck in general, the whole applications just keeps sending less and less traffic as time passes.

I didn't find an instance where application stops sending these days.

@serkantkaraca I can still see the trend. I restarted my client (only a single client this time) and let it run for 2 days and the graph keeps going down as the days pass.

But I believe this is the cause of this issue, as I'm consuming from one EventHub and pushing to another.

I'll hold this for some time, until the linked issue is resolved and start my experiment again after that.

Sorry for late response. I seem to have missed notifications in my inbox regarding you replies.

Since this is still being investigated, can you try couple more things which can help to point where the slowdown is happening?

  1. Add a monitor to you client to calculate events per minute rate. When rate drops below a certain point, recreate EH client in the same process and switch to this new client. See if throughput improves.

  2. Try running with a new namespace in some other region if possible.

  3. Try sending with Service Bus Explorer which uses .NET client. See if you can reproduce same slowdown there.

@shubhambhattar Have you tried out above suggestions from @serkantkaraca ?

@srnagar @serkantkaraca Sorry for the delay in response. This testing has been on and off lately. I can give you the current updates.

I've a test code running for EastUS region and the TU set on the namespace is 20. There's only one EH with 32 paritions and some test data is being put continuously.

https://imgur.com/a/wJcH0F9

Regarding point (1), details are available in the above image. The WARN and ERROR logs are below:

2020-08-15 01:08:08,971 [single-1] WARN  c.a.c.a.i.h.ConnectionHandler - onTransportError hostname[some-namespace.servicebus.windows.net:5671], connectionId[MF_e9cc2e_1597253463109], error[Connection reset by peer]
2020-08-15 01:08:08,972 [single-1] ERROR c.a.c.a.i.ReactorConnection - connectionId[MF_e9cc2e_1597253463109] Error occurred in connection handler.
Connection reset by peer, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-15 01:09:08,977 [single-1] WARN  c.a.m.e.i.EventHubConnectionProcessor - Retry #1. Transient error occurred. Retrying after 4511 ms.
Connection reset by peer, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-22 20:19:52,216 [single-1] WARN  c.a.c.a.i.ReactorSender - entityPath[dummy], linkName[dummy], deliveryTag[9080090bed6d44f4a27ac0369654e4e2]: Delivery rejected. [Rejected{error=Error{condition=amqp:internal-error, description='The service was unable to process the request; please retry the operation. For more information on exception types and proper exception handling, please refer to http://go.microsoft.com/fwlink/?LinkId=761101 Reference:09c23d8c-1dc9-414b-ba56-3c4a91c43500, TrackingId:6581a5f100009ae5000055315f3735bc_G6_B3, SystemTracker:some-namespace:eventhub:dummy~9215, Timestamp:2020-08-22T20:19:52', info=null}}]

In the above case, messages are being pushed in batches where each batch's maximum size is 1000000 bytes. A random byte[] array is being created and then pushed to the batch until the batch achieves the maximum size. The graph above accounts for all the messages in the batch (its no. of messages pushed / sec not no. of batches pushed / sec).

(2) and (3) couldn't be done.

@shubhambhattar, thanks for providing new test data. Can you send me your test namespace so I can check service side metrics and failures? You can reach me from [email protected]

Service side metrics also showing 15K events/sec ingress. Failures should be intermittent which you can ignore for now. Better if we focus on your performance concerns. Which part of the testing time frame you observed degraded performance?

@serkantkaraca In the newer SDK, didn't observe any significant degradation in performance (the producer is almost constantly sending at 15K events / sec, each event of size 300 bytes and pushed in batches). I did observe degradation in consumer rather than producer and for that I've opened up https://github.com/Azure/azure-sdk-for-java/issues/14652.

@shubhambhattar, so we are good to close this issue and track the new issue only?

Was this page helpful?
0 / 5 - 0 ratings