Confluent-kafka-dotnet: Latency in producing and consuming

Created on 6 Mar 2017  路  30Comments  路  Source: confluentinc/confluent-kafka-dotnet

Hi.
We were experiencing high end to end latency.
I was trying to do something like the kafka.tools.EndToEndLatency does, creating a producer and a consumer, and StopWatching the time it takes 1 item to be consumed.

Although the _kafka.tools.EndToEndLatency_ measures only 3 ms latency in my cluster, with the .NET library I get over 350 ms (with no load and nothing else interfering).

I have tried to change the settings as described here , and since I am running on Windows I tried also the _socket.blocking.max.ms_ (by this answer), but there was no improvement.

I was wondering if there is anything else we can do to reduce the latency?

Thanks

HIGH bug librdkafka

Most helpful comment

After playing around with the configs that you have provided we came to the conclusion that setting the linger.ms to anything else than 0 causes a rtt of ~15ms.

Here are some snippets from the logs (We have a TestClient and a TestServer which are sending messages back and forth between each other. We redirected the debug protocol output to our own tracing library

Configuration for both logs and both Client and Server:
Producer:
debug = protocol
socket.blocking.max.ms value: 1

Consumer:
debug = protocol
fetch.wait.max.ms value: 1
fetch.error.backoff.ms value: 0
socket.blocking.max.ms value: 1

First log snippet with linger.ms = 0
LingerMs_0_snd_to_rcv.txt

Second log snippet with linger.ms = 1
LingerMs_1_snd_to_rcv.txt

As you can see we are also logging the current timestamp. The time between the send and the receive in the second logs is 12ms, but libRdKafka is anyway reporting a rtt of 15ms.
That caused us to think that this is some time resolution issue on windows, so we took a look into the libRdKafka src.

We found out that in rdtime.h line 76 there is a call to GetTickCount64(), which has a resolution of about 10-16 ms (https://msdn.microsoft.com/de-de/library/windows/desktop/ms724411(v=vs.85).aspx)
We replaced it with QueryPerformanceCounter() which is more accurate and the latency issues with linger.ms > 0 were gone :) (unfortunately we don't have logs for this yet)

The problem is that this replacement was a quick and dirty fix. The QueryPerformanceCounter() function requires the result of QueryPerformanceFrequency(), which is a costly function and should be called only once at startup, but we just don't know where we should do this since we are not very familiar with the libRdKafka implementation.

All 30 comments

Hello,

which version of librdkafka are you using?
What did you set socket.blocking.max.ms to?

If you set the "debug" config property to "protocol" and look for the rtt (round-trip-time) printouts of ProduceRequest (producer) and FetchRequest (consumer), what is the minimum value you see for them?
This helps us identify how much of those 350ms are due to external (network & broker) latencyinternal or external.

Hi, and thanks for the quick reply.

  1. We are using the latest version from NuGet (0.9.4).
  2. We set the socket.blocking.max.ms to 1, and also tried with 10.
  3. I ran the "debug=protocol" as you instructed, and saw a few interesting things:

    • the rtt of the ProduceResponse is arround 31ms (sometimes returns after the FetchResponse has completed).
    • after I enqueue a message, it takes almost a second until the ProduceRequest is issued.
    • after the ProduceRequest is issued, the FetchResponse returns within a single ms, and the message immediately arrives to my code.

This leads me to think that there is some buffering configuration in the producer that I'm missing.

>2017-03-07 12:27:15.626 | Sending message
7|2017-03-07 12:27:15.626|rdkafka#producer-1|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received ProduceResponse (v1, 32 bytes, CorrId 196, rtt 31.00ms)
7|2017-03-07 12:27:15.626|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2191)
7|2017-03-07 12:27:15.766|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2191, rtt 141.00ms)
7|2017-03-07 12:27:15.772|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2192)
7|2017-03-07 12:27:15.788|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent HeartbeatRequest (v0, 97 bytes @ 0, CorrId 2193)
7|2017-03-07 12:27:15.873|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2192, rtt 93.00ms)
7|2017-03-07 12:27:15.873|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2194)
7|2017-03-07 12:27:15.873|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received HeartbeatResponse (v0, 2 bytes, CorrId 2193, rtt 78.00ms)
7|2017-03-07 12:27:16.015|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2194, rtt 141.00ms)
7|2017-03-07 12:27:16.015|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2195)
7|2017-03-07 12:27:16.116|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2195, rtt 109.00ms)
7|2017-03-07 12:27:16.117|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2196)
7|2017-03-07 12:27:16.217|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2196, rtt 94.00ms)
7|2017-03-07 12:27:16.218|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2197)
7|2017-03-07 12:27:16.319|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2197, rtt 109.00ms)
7|2017-03-07 12:27:16.319|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2198)
7|2017-03-07 12:27:16.421|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2198, rtt 94.00ms)
7|2017-03-07 12:27:16.421|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2199)
7|2017-03-07 12:27:16.522|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2199, rtt 110.00ms)
7|2017-03-07 12:27:16.523|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2200)
7|2017-03-07 12:27:16.616|rdkafka#producer-1|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent ProduceRequest (v1, 127 bytes @ 0, CorrId 197)
7|2017-03-07 12:27:16.623|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 36 bytes, CorrId 2200, rtt 93.00ms)
7|2017-03-07 12:27:16.623|rdkafka#consumer-2|SEND| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Sent FetchRequest (v1, 63 bytes @ 0, CorrId 2201)
7|2017-03-07 12:27:16.657|rdkafka#consumer-2|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received FetchResponse (v1, 114 bytes, CorrId 2201, rtt 32.00ms)
7|2017-03-07 12:27:16.657|rdkafka#producer-1|RECV| [thrd:kfkpoc-03:9092/3]: kfkpoc-03:9092/3: Received ProduceResponse (v1, 32 bytes, CorrId 197, rtt 32.00ms)
2017-03-07 12:27:16.657 | message received after 1031ms

The one second delay on the producer is probably queue.buffering.max.ms which defaults to 1000.ms

Try decreasing that to your liking (and socket.blocking.max.ms must be kept <= than this value on Windows)

Strange -

  1. Setting also the _queue.buffering.max.ms_ to 1 reduces the delay to 45ms
  2. Setting also the _fetch.wait.max.ms_ to 1 pushes it back up to 95ms

fetch.wait.max.ms won't do much, it only affects the time the broker waits with responding to a Fetch request if there is no data, and if there is no data you don't need to worry about latency.
A low fetch.wait.max.ms will increase network and CPU usage at no gain.

OK.
Now, after setting both _queue.buffering.max.ms_ and _socket.blocking.max.ms_ to 1 it's pretty consistent: it takes 45-49 ms.
Reading the protocol log, I can see that now the message is sent within 4ms, and that the Fetch and ProduceResponse have a 40 ms delay.

What other configuration can we tune to get to the 3ms latency in the Benchmark tool?

12:52:14.807 | Sending message
12:52:14.80#consumer-2|SEND| Sent FetchRequest (v1, 63 bytes @ 0, CorrId 381)
12:52:14.813|#producer-1|SEND| Sent ProduceRequest (v1, 127 bytes @ 0, CorrId 341)
12:52:14.853|#producer-1|RECV| Received ProduceResponse (v1, 32 bytes, CorrId 341, rtt 31.00ms)
12:52:14.853|#consumer-2|RECV| Received FetchResponse (v1, 114 bytes, CorrId 381, rtt 47.00ms)
12:52:14.854 | message received after 47ms

Can you provide your full client configuration?

Also; if you ping the broker from the host you are running the clients on, what is latency?

Ping to the server is less than 1ms

All configurations are default besides these -

Producer:
{"debug", "protocol" },
{"socket.blocking.max.ms",1},
{"queue.buffering.max.ms",1 }

Consumer:
{"debug", "protocol" },
{"socket.blocking.max.ms",1},
{"fetch.error.backoff.ms",1 },
{"enable.auto.commit", false },
{"statistics.interval.ms", 500000},

It may be helpful to rule in/out broker performance here, to do so could you capture the following JMX metrics for review?

kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=ThrottleTimeMs,request=Produce

Each Mbean has a few metrics, of which we should be concerned with the 75th, 95th, 99th and 999th percentiles

If you find that a large amount of time is spent on the request/response queues the broker may be under some fairly high load. The following two metrics offer a fairly good indication of just how loaded your broker is:

kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent

Values for 75/95/99/999:

kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce =>1/1/2/12
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce =>0/0/0/0
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce =>0/0/1/10
kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request=Produce =>0/0/1/5
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce =>0/0/1/1
kafka.network:type=RequestMetrics,name=ThrottleTimeMs,request=Produce =>0/0/1/1

My broker is not under _any_ kind of load - I'm publishing the messages one at a time.

Thank you, those stats look good.

Are you running the broker on Windows as well, or just the client?

Are you using SASL or SSL?

The broker is running on a Linux machine. The client is on Windows 8.1

Okay, it sounds like this is a problem with the interaction between librdkafka and the Windows TCP stack.
I'll reproduce and try to find the cause.

Thanks a lot.
If there is any additional information I can give you - I will be happy to.

@mohoch1 Could you try disabling Nagle by setting the socket.nagle.disable config property to true?

Well - that helped a lot.
The latency is now down to 15ms

Interesting!
It is still a bit much but we're getting there, maybe there are more TCP stack options to tweak

I encountered the same problem, looking forward to finding a solution.

I am hitting the same issue I think, 60ms

{"bootstrap.servers", brokerList },
{"security.protocol", "SSL"},
{"ssl.ca.location", @"C:\certs\certfile.pem"},
{"acks", 1},
//{"debug","protocol"},
{"socket.nagle.disable",true},
{"socket.blocking.max.ms",1},
{"queue.buffering.max.ms",1 }

Logs:
7|2018-01-29 16:02:24.826|rdkafka#producer-1|CONNECTED| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: Connected (#1)
7|2018-01-29 16:02:24.833|rdkafka#producer-1|FEATURE| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:24.834|rdkafka#producer-1|SEND| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:24.900|rdkafka#producer-1|RECV| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 78.00ms)
7|2018-01-29 16:02:24.901|rdkafka#producer-1|SEND| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: Sent MetadataRequest (v2, 46 bytes @ 0, CorrId 2)
7|2018-01-29 16:02:25.014|rdkafka#producer-1|RECV| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: Received MetadataResponse (v2, 374 bytes, CorrId 2, rtt 110.00ms)
7|2018-01-29 16:02:25.229|rdkafka#producer-1|CONNECTED| [thrd:ssl://172.19.69.38:9093/5]: ssl://172.19.69.38:9093/5: Connected (#1)
7|2018-01-29 16:02:25.229|rdkafka#producer-1|FEATURE| [thrd:ssl://172.19.69.38:9093/5]: ssl://172.19.69.38:9093/5: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:25.230|rdkafka#producer-1|SEND| [thrd:ssl://172.19.69.38:9093/5]: ssl://172.19.69.38:9093/5: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:25.235|rdkafka#producer-1|CONNECTED| [thrd:ssl://172.19.67.119:9093/4]: ssl://172.19.67.119:9093/4: Connected (#1)
7|2018-01-29 16:02:25.235|rdkafka#producer-1|FEATURE| [thrd:ssl://172.19.67.119:9093/4]: ssl://172.19.67.119:9093/4: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:25.236|rdkafka#producer-1|SEND| [thrd:ssl://172.19.67.119:9093/4]: ssl://172.19.67.119:9093/4: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:25.238|rdkafka#producer-1|CONNECTED| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Connected (#1)
7|2018-01-29 16:02:25.239|rdkafka#producer-1|FEATURE| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:25.240|rdkafka#producer-1|SEND| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:25.243|rdkafka#producer-1|CONNECTED| [thrd:ssl://172.19.67.172:9093/1]: ssl://172.19.67.172:9093/1: Connected (#1)
7|2018-01-29 16:02:25.243|rdkafka#producer-1|FEATURE| [thrd:ssl://172.19.67.172:9093/1]: ssl://172.19.67.172:9093/1: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:25.244|rdkafka#producer-1|SEND| [thrd:ssl://172.19.67.172:9093/1]: ssl://172.19.67.172:9093/1: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:25.250|rdkafka#producer-1|CONNECTED| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: Connected (#1)
7|2018-01-29 16:02:25.250|rdkafka#producer-1|FEATURE| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:25.251|rdkafka#producer-1|SEND| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:25.257|rdkafka#producer-1|CONNECTED| [thrd:ssl://172.19.68.16:9093/2]: ssl://172.19.68.16:9093/2: Connected (#1)
7|2018-01-29 16:02:25.257|rdkafka#producer-1|FEATURE| [thrd:ssl://172.19.68.16:9093/2]: ssl://172.19.68.16:9093/2: Updated enabled protocol features +ApiVersion to ApiVersion
7|2018-01-29 16:02:25.258|rdkafka#producer-1|SEND| [thrd:ssl://172.19.68.16:9093/2]: ssl://172.19.68.16:9093/2: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 1)
7|2018-01-29 16:02:25.295|rdkafka#producer-1|RECV| [thrd:ssl://172.19.69.38:9093/5]: ssl://172.19.69.38:9093/5: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 62.00ms)
7|2018-01-29 16:02:25.298|rdkafka#producer-1|RECV| [thrd:ssl://172.19.67.119:9093/4]: ssl://172.19.67.119:9093/4: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 62.00ms)
7|2018-01-29 16:02:25.304|rdkafka#producer-1|RECV| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 63.00ms)
7|2018-01-29 16:02:25.305|rdkafka#producer-1|SEND| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Sent ProduceRequest (v2, 1657 bytes @ 0, CorrId 2)
7|2018-01-29 16:02:25.310|rdkafka#producer-1|RECV| [thrd:ssl://172.19.67.172:9093/1]: ssl://172.19.67.172:9093/1: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 63.00ms)
7|2018-01-29 16:02:25.321|rdkafka#producer-1|RECV| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 78.00ms)
7|2018-01-29 16:02:25.327|rdkafka#producer-1|RECV| [thrd:ssl://172.19.68.16:9093/2]: ssl://172.19.68.16:9093/2: Received ApiVersionResponse (v0, 132 bytes, CorrId 1, rtt 62.00ms)
7|2018-01-29 16:02:25.371|rdkafka#producer-1|RECV| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Received ProduceResponse (v2, 55 bytes, CorrId 2, rtt 62.00ms)
Partition: 2, Offset: 41402
7|2018-01-29 16:02:25.396|rdkafka#producer-1|SEND| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Sent ProduceRequest (v2, 1657 bytes @ 0, CorrId 3)
7|2018-01-29 16:02:25.460|rdkafka#producer-1|RECV| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: Received ProduceResponse (v2, 55 bytes, CorrId 3, rtt 63.00ms)
Partition: 2, Offset: 41403
7|2018-01-29 16:02:25.505|rdkafka#producer-1|SEND| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: Sent ProduceRequest (v2, 1657 bytes @ 0, CorrId 2)
7|2018-01-29 16:02:25.578|rdkafka#producer-1|RECV| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: Received ProduceResponse (v2, 55 bytes, CorrId 2, rtt 62.00ms)
ElapsedTime:992
Partition: 0, Offset: 44995
7|2018-01-29 16:02:25.662|rdkafka#producer-1|DESTROY| [thrd:app]: Terminating instance
7|2018-01-29 16:02:25.662|rdkafka#producer-1|DESTROY| [thrd:main]: Destroy internal
7|2018-01-29 16:02:25.666|rdkafka#producer-1|DESTROY| [thrd:main]: Removing all topics
7|2018-01-29 16:02:25.668|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap]: ssl://kafka.nonprod.proton.r53.srvtest.net:9093/bootstrap: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.668|rdkafka#producer-1|BROKERFAIL| [thrd::0/internal]: :0/internal: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.768|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://172.19.67.119:9093/4]: ssl://172.19.67.119:9093/4: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.768|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://172.19.68.16:9093/2]: ssl://172.19.68.16:9093/2: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.769|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://172.19.70.12:9093/6]: ssl://172.19.70.12:9093/6: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.769|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://172.19.69.38:9093/5]: ssl://172.19.69.38:9093/5: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.769|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://172.19.71.56:9093/3]: ssl://172.19.71.56:9093/3: failed: err: Local: Broker handle destroyed: (errno: No error)
7|2018-01-29 16:02:25.772|rdkafka#producer-1|BROKERFAIL| [thrd:ssl://172.19.67.172:9093/1]: ssl://172.19.67.172:9093/1: failed: err: Local: Broker handle destroyed: (errno: No error)
Press any key to continue . . .

@nagarjuna119

  • What broker version?
  • What librdkafka version?
  • What is the network latency between the client and the broker? (use ping <broker> on the client host)
  • What is the broker load?

Seeing how all requests are at least about 60ms this looks more like a network issue.

Yes..looks like network

Broker: 0.10.1
librdkafka: 0.11.3.0
C:\WINDOWS\system32>nmap -Pn 172.19.67.172 -p 9093

Starting Nmap 7.60 ( https://nmap.org ) at 2018-01-30 12:09 Pacific Standard Time
Nmap scan report for 172.19.67.172
Host is up (0.066s latency).

Hi
We are also experiencing a rtt of 15-16 ms with libRdKafka. Broker, Zookeeper and client are all running on the same machine (we have tried out win7 and win10).
Kafka version: 1.0.0
libRdKafka version: 0.11.3

To be sure that it's not Apache Kafka itself, we wrote a small measurement program in java. It produces a small message (only carrying the current timestamp before produce is called as a long) and measures the time until it is consumed. The average there is about 1,4 ms.

Are there maybe any updates on this issue? Is it going to be fixed in the foreseeable future?

@Paul048124 Start with https://github.com/edenhill/librdkafka/wiki/How-to-decrease-message-latency

Set the debug config property to protocol and check for ..Response rtt printouts where rtt is the round-trip time latency.

After playing around with the configs that you have provided we came to the conclusion that setting the linger.ms to anything else than 0 causes a rtt of ~15ms.

Here are some snippets from the logs (We have a TestClient and a TestServer which are sending messages back and forth between each other. We redirected the debug protocol output to our own tracing library

Configuration for both logs and both Client and Server:
Producer:
debug = protocol
socket.blocking.max.ms value: 1

Consumer:
debug = protocol
fetch.wait.max.ms value: 1
fetch.error.backoff.ms value: 0
socket.blocking.max.ms value: 1

First log snippet with linger.ms = 0
LingerMs_0_snd_to_rcv.txt

Second log snippet with linger.ms = 1
LingerMs_1_snd_to_rcv.txt

As you can see we are also logging the current timestamp. The time between the send and the receive in the second logs is 12ms, but libRdKafka is anyway reporting a rtt of 15ms.
That caused us to think that this is some time resolution issue on windows, so we took a look into the libRdKafka src.

We found out that in rdtime.h line 76 there is a call to GetTickCount64(), which has a resolution of about 10-16 ms (https://msdn.microsoft.com/de-de/library/windows/desktop/ms724411(v=vs.85).aspx)
We replaced it with QueryPerformanceCounter() which is more accurate and the latency issues with linger.ms > 0 were gone :) (unfortunately we don't have logs for this yet)

The problem is that this replacement was a quick and dirty fix. The QueryPerformanceCounter() function requires the result of QueryPerformanceFrequency(), which is a costly function and should be called only once at startup, but we just don't know where we should do this since we are not very familiar with the libRdKafka implementation.

We replaced it with QueryPerformanceCounter() which is more accurate and the latency issues with linger.ms > 0 were gone :)

That is fantastic news!
Don't worry about the code, now that you've found the underlying issue we'll make the fix.

Thanks a lot for digging into this 馃槏

Hi,

we have experienced a side effect of the QueryPerformanceCounter change: with the QPC version of RdKafka our applications using kafka and also the kafka server seem to be more busy, they are using more cpu with the same throughput.
It is clear that they'll be more busy if we configure kafka for less latency, however, in our tests we use 15 ms for linger.ms / fetch.wait.max.ms / fetch.error.backoff.ms
Our expectation was that with these settings and with the same throughput the cpu usage of our applications shall be about the same with and without the GetTickCount64 -> QueryPerformanceCounter change.

One obvious reason for the increased cpu usage is that I did not configure the 15 ms for some timer which is affected by the more accurate resolution of the QPC, causing that that timer is triggered more frequently now with the QPC version of RdKafka.
I am using the following config:

producer:
linger.ms = 15
socket.blocking.max.ms = 5

consumer:
fetch.wait.max.ms = 15
fetch.error.backoff.ms = 15
socket.blocking.max.ms = 5

Our test setup:
apache kafka 1.1, rdkafka 0.11.4 - snapshot including the QPC change from Apr 18 (13befa7053c43593fd6f2effdffbd83c3ea52eb7) and the version immediately before (with GetTickCount64).
2 instances of a test service application are subscribed to a topic (myservice_request) and they are sharing the load. 2 test clients are running, each writes ~3000 messages per sec into this topic.
The test service writes a response message into a separate topic (myservice_response) for each processed request. The test clients are reading the response topic using low level consumer.
zookeeper, kafka server and all test services and clients are running on the same machine (windows, dual core with hyperthreading)

The ~3000 messages per sec is not a high load scenario, with the GetTickCount64 version of RdKafka we can handle this load with very low cpu usage. Cpu usage of each test service and test client is ~1%, kafka server cpu has some peaks, especially in the beginning of the test but then is stabilizes at ~1%

With the QPC version of RdKafka and with the same load and same settings, each test service is using ~10% cpu, each test client is using ~5% and the kafka server never drops below 15-20%

I have run this test with debug=all with both the GetTickCount64 and the QPC versions, in both cases about the same number of FetchRequests and the same number of ProduceRequests were sent, with identical request / response sizes.

However, the following events come from the QPC version much more frequently, see logs:
RdKafka_dbgAll_GetTickCount64_vs_QPC.zip

Kafka event: LOG: 7 TOPPAR: [thrd:myhost.mydomain:9092/0]: myhost.mydomain:9092/0: myservice_request [0] 100 message(s) in xmit queue (0 added from partition queue)
Kafka event: LOG: 7 TOPPAR: [thrd:myhost.mydomain:9092/0]: myhost.mydomain:9092/0: myservice_response [0] 100 message(s) in xmit queue (0 added from partition queue)

Best regards,
Joe

thanks for the detailed report @runave. we need to investigate this @edenhill.

tracking @runave's report in #607

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zoeysaurusrex picture zoeysaurusrex  路  4Comments

keggster101020 picture keggster101020  路  4Comments

alfhv picture alfhv  路  3Comments

Marusyk picture Marusyk  路  3Comments

maximecaron picture maximecaron  路  3Comments