I am trying to troubleshoot a high CPU utilization issue.
I have 3 brokers in our environment. I have noticed that when any one of the broker is bounced, the CPU on client machine spikes to around 40% and then stays at elevated level until the client process is restarted.
The topic I am testing has 12 partitions. I have 12 consumers with single group for that topic.
Because I have multiple processes running on that server. When each process starts consuming higher CPU, the server is becoming unresponsive.
I would like to know if re-balancing activity and High CPU utilization by client is a known issue and is there any work around?
Take one of the broker down and bring it back up. It will result in Client side process CPU utilization to spike. And the High CPU utilization will stay at the elevated level until the client process is restarted.
Please provide the following information:
We previously had an issue with high CPU on Windows: (see #87). This problem smells similar to me, it would be interesting to know if it's reproducible on linux. We weren't previously aware of this issue.
I forgot to mention, I am also using SASL/PLAIN for authentication.
Here is the config
{"sasl.mechanisms", "SCRAM-SHA-256"},
{"security.protocol", "SASL_PLAINTEXT"},
{"sasl.username", "testuser"},
{"sasl.password", "testpwd"},
{"api.version.request", "true"},
Authentication is working fine. But just want to mention about it.
I do not have a Linux client environment. Kafka broker is running on Linux.
I am assuming when you said test in Linux environment, you also meant me to use .NET Core instead of .NET 4.6.1 for client. I will see if I can test it in a Linux client environment.
I think there is probably a thread synchronization issue in librdkafka (that is only a problem on Windows, not linux). The same native build of librdkafka is used for both .net core and net46 so i expect the problem would still be apparent after switching to .net core on windows.
Probably not related to auth.
I did one more test using 0.9.5 client version (downgraded both Confluent.Kafka and librdkafka) and the CPU spike issue is NOT occurring. It appears like latest version 0.11.0 may have introduced this issue.
I switched to v0.11.0 to take advantage of security, but this CPU issue is making me think to go back to 0.9.5 for time being.
interesting - thanks for the info. it would be interesting to know if you get the CPU problem using the v0.9.5 librdkafka.dll with v0.11 of Confluent.Kafka. If you put the v.0.9.5 librdkafka.dll in the same directory as Confluent.Kafka.dll it should load that I believe. You can double check what version is being loaded using Library.Version.
If this works, we know it's a regression in librdkafka.
I tested following cases
A. Confluent Kafka v0.9.5 with librdkafka v0.9.5, no authentication- everything appears good
B. Confluent Kafka v0.9.5 with librdkafka v0.11.0, no authentication- everything appears good
C. Confluent Kafka v0.11.0 with librdkafka v0.11.0, no authentication- everything appears good
D. Confluent Kafka v0.11.0 with librdkafka v0.9.5, no authentication- everything appears good
D. Confluent Kafka v0.11.0 with librdkafka v0.11.0, with authentication (SASL/PlAINTEXT)- CPU spikes and stays at elevated level when one of the broker is bounced while client process is up an running.
Another thing I have observed is that if I use an invalid user name by mistake (typo etc..) for authentication. Then CPU spikes and stays at elevated level from the beginning. As soon as I start the client process, CPU spikes and it stays high. Not sure if these two issues are related, but just want to point it out.
Thanks very much for that investigation. cc: @edenhill - seems to be a problem with the new SASL/PLAINTEXT functionality on Windows.
@vinodres - note that magnus, who implemented this, is currently on vacation and won't respond immediately. However we will prioritize this to investigate / resolve.
There seems to be an issue on thread synchronization and the openssl library used in Windows - do you have some logs/errors with "sasl handshake fail" on clients?
I keep seeing this type of error on a continuous basis on client side when SASL/PLAINTEXT authentication is used.
Code [Local_Authentication], IsLocalError [True], IsBrokerError [False], Reason [sasl_plaintext://mycorpmessagebroker03:9092/3: Failed to initialize SASL authentication: SASL Handshake not supported by broker (required by mechanism SCRAM-SHA-256)]
This error was captured using consumer.OnError += (_, error) event. I have submitted different issue about this error https://github.com/confluentinc/confluent-kafka-dotnet/issues/277
This error starts after consumer instance is created. Not immediately thou, error starts after a small duration, like a min or two later and then continuously occurs. While this error is getting pumped, the CPU still appears to be normal.
Only time I am seeing and am able to reproduce the CPU spike is when we bounce any one of the broker. When one of the broker goes down, things still appear to be normal. When the broker that went down comes back up, that is when the client side CPU spikes and then stays at elevated levels.
Let me see if I can capture additional logs with debug="security".
I have attached additional logs I captured with debug=security. The captured log file has data from both debug=security output along with some of the test app log entries. Time stamp format will help you differentiate between both.
Test was launched @2017-08-04 14:21 with 6 consumer clientids and same group
Broker 1 was taken down @2017-08-04 1423
Broker 1 came back up @2017-08-04 14:25 and this is when CPU spiked.
thanks @vinodres, we'll take a look.
One other error I am seeing on somewhat continuous basis is
2017-08-11 130732.PM Error: Code [Local_Transport] HasError [True] IsBrokerError [False] IsLocalError [True] Reason [mycorpmessagebroker03:9092/3: Connection closed]
Appears like this error is occurring at consumer end when there are no more messages available to consume. But if a new message is published in the mist of above error, consumer is able to receive it. Does not appear like this error might be related to this issue, but just posting it anyway.
@edenhill : I am hoping you had a great vacation. Just want to check with you on this new high CPU usage issue. Just wondering if you had a chance to look into it.
Code [Local_Authentication], IsLocalError [True], IsBrokerError [False], Reason [sasl_plaintext://mycorpmessagebroker03:9092/3: Failed to initialize SASL authentication: SASL Handshake not supported by broker (required by mechanism SCRAM-SHA-256)]
This issue is most likely due to timeout of the ApiVersionRequest (10s) which makes it revert to the 0.9 broker feature set that does not have SASL Handshake support.
The workaround for that is to set:
api.version.fallback.ms=0 and broker.version.fallback=0.10.0
The random disconnects are most likely due to the idle connection reaper, see here:
https://github.com/edenhill/librdkafka/wiki/FAQ#why-am-i-seeing-receive-failed-disconnected
If you are using SSL, there was also a bug in librdkafka on Windows which caused connection failures. This has been fixed on librdkafka master, see #61
You seem to encounter actual network or kafka server problems though since the connection is refused:
2017-08-04 142340.PM Error: Code [Local_Transport] HasError [True] IsBrokerError [False] IsLocalError [True] Reason [sasl_plaintext://mycorpmessagebroker01:9092/1: Connect to ipv4#172.22.166.94:9092 failed: No connection could be made because the target machine actively refused it...]
I have not yet started using SSL.
I am only using SASL/Plaintext for authentication at this time.
I think the reason for this above error is due to 1 (mycorpmessagebroker01) of the 3 Kafka broker servers went down.
Actual issue for me is not the above error, but when the broker that went down comes back up, I see a CPU spike on consumer side and then the CPU stays at the spiked level. It does not go back to normal level. I am just wondering what could be causing the CPU to stay at spiked level. This behavior is only happening when I am using SASL/PLAINTEXT authentication on consumer side.
The SASL SCRAM handshake is a very CPU-intenstive operation consisting of thousands (depending on iteration count) of HMAC calculations, this would explain a shorter CPU spike as the client is reconnecting to the brokers.
But this CPU spike should go back to normal as soon as the client is has connected to all known brokers and passed the SASL SCRAM Handshake phase.
Ok, thank you for your input. Let me execute few more test runs and I will let you know the results.
I ran few rounds of testing and found that CPU utilization stays at elevated level. Even after 10 mins, the CPU stayed at elevated level. The trigger point for CPU spike is exactly when the broker that went down comes back up online. This is how I executed the test.
Repeated the above test 3 times and same result all 3 times.
Good stuff, very structured testing :+1:
Can you reproduce this with debug=fetch,protocol,broker,security enabled and provide us the logs?
Here is the log file with debug=fetch,protocol,broker,security enabled.
SASL_PLAINTEXT_CPUSPIKE_LOG.txt
Thank you very much.
I don't see anything in the logs that explains the increased CPU usage, what stands out though is that the SASL handshake frame from the client goes unanswered by the broker until you decide to take down the client a minute or later:
7|2017-09-06 10:57:52.364|Vinod_Test_PRP_Client.1#consumer-1|SASL| [thrd:sasl_plaintext://mycorpmessagebroker01:9092/bootstrap]: sasl_plaintext://mycorpmessagebroker01:9092/1: Send SASL frame to broker (62 bytes)
...
7|2017-09-06 10:58:50.787|Vinod_Test_PRP_Client.1#consumer-1|TERMINATE| [thrd:sasl_plaintext://mycorpmessagebroker01:9092/bootstrap]: sasl_plaintext://mycorpmessagebroker01:9092/1: Handle is terminating: failed 0 request(s) in retry+outbuf
Is there anything interesting in the broker logs during this minute?
Let me capture brokers logs as well to see if anything appears there.
I did 6 rounds of testing. 5 of them resulted in CPU spike and only once I did not see CPU spike.
For the 5 tests that failed, when one of the broker went down and came back up, in client side logs, I noticed
Send SASL frame to broker
But for the one test that passed (no CPU spike occurred), I noticed following two log entries
Send SASL frame to broker
Received SASL frame from broker
So not sure what caused that one test to be successful.
But again nothing interesting appeared on server side logs. I filtered the server side logs with username (testuser) or group name (Vinod_Test_PRP_Client.Group.1) and here are the log entries
kafkaServer.out:[2017-09-07 15:49:05,993] INFO Removing Produce quota for user testuser (kafka.server.ClientQuotaManager)
kafkaServer.out:[2017-09-07 15:49:05,993] INFO Removing Fetch quota for user testuser (kafka.server.ClientQuotaManager)
server.log:[2017-09-07 15:49:05,993] INFO Removing Produce quota for user testuser (kafka.server.ClientQuotaManager)
server.log:[2017-09-07 15:49:05,993] INFO Removing Fetch quota for user testuser (kafka.server.ClientQuotaManager)
server.log.2017-09-07-13:[2017-09-07 13:12:30,618] INFO Removing Produce quota for user testuser (kafka.server.ClientQuotaManager)
server.log.2017-09-07-13:[2017-09-07 13:12:30,618] INFO Removing Fetch quota for user testuser (kafka.server.ClientQuotaManager)
server.log.2017-09-07-14:[2017-09-07 14:00:52,940] INFO Removing Produce quota for user testuser (kafka.server.ClientQuotaManager)
server.log.2017-09-07-14:[2017-09-07 14:00:52,940] INFO Removing Fetch quota for user testuser (kafka.server.ClientQuotaManager)kafkaServer.out:[2017-09-07 15:45:30,509] INFO [GroupCoordinator 1]: Loading group metadata for Vinod_Test_PRP_Client.Group.1 with generation 24 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:48:15,936] INFO [GroupCoordinator 1]: Preparing to restabilize group Vinod_Test_PRP_Client.Group.1 with old generation 24 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:48:15,937] INFO [GroupCoordinator 1]: Stabilized group Vinod_Test_PRP_Client.Group.1 generation 25 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:48:16,035] INFO [GroupCoordinator 1]: Preparing to restabilize group Vinod_Test_PRP_Client.Group.1 with old generation 25 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:48:19,028] INFO [GroupCoordinator 1]: Stabilized group Vinod_Test_PRP_Client.Group.1 generation 26 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:48:21,084] INFO [GroupCoordinator 1]: Assignment received from leader for group Vinod_Test_PRP_Client.Group.1 for generation 26 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:49:20,880] INFO [GroupCoordinator 1]: Preparing to restabilize group Vinod_Test_PRP_Client.Group.1 with old generation 26 (kafka.coordinator.GroupCoordinator)
kafkaServer.out:[2017-09-07 15:49:22,934] INFO [GroupCoordinator 1]: Group Vinod_Test_PRP_Client.Group.1 with generation 27 is now empty (kafka.coordinator.GroupCoordinator)
Question: Is there a way to enable SASL related logging on server side?
@vinodres set debug=true in the jaas config file, something like:
org.apache.kafka.common.security.PlainLoginModule required debug=true {
user_adam=alsing;
};
OK @edenhill , I will try that and get back to you.
@edenhill : By the way, we have ScramLoginModule and I have asked to enabled debug for it.
Not much information captured after enabling debug, I see following few DEBUG messages on server side
[2017-09-27 11:22:35,542] DEBUG [IsrChangeNotificationListener on Controller 2]: ISR change notification listener fired (kafka.controller.IsrChangeNotificationListener)
[2017-09-27 11:22:37,230] DEBUG [IsrChangeNotificationListener on Controller 2]: ISR change notification listener fired (kafka.controller.IsrChangeNotificationListener)
Just as previously, I see following entry in client side log with no corresponding response for it.
Send SASL frame to broker
If I restart the client, then everything goes back to normal state, cpu stays normal until one of the broker goes down and comes back up.
@vinodres
[apologies. I'm slightly divulging from the actual issue here]
I just read your comment on august 3rd that you are using sasl/plain authentication on windows.
you also mention kafka broker is running on Linux.
can you briefly describe how is your set up and post sample code on how you testing it?
we have the same set up. added with cross domain. but couldn't connect using librdkafka .net wrapper so had to use java
can you kindly shed some light on this?
https://github.com/edenhill/librdkafka/issues/1259 is the one I had posted.
Kavya
@kavyashivakumar In my case the producer app is running on Windows Server 2016. Kafka brokers are running on Linux. I am able to successfully authenticate.
Please refer to confluentinc/confluent-kafka-dotnet#169 to see the config I am using for producer.
@vinodres,
I posted
https://github.com/confluentinc/confluent-kafka-dotnet/issues/331
as my issue. if you notice, i pretty much copied your description of the problem. except my mechanism is default gssapi.
i still can't seem to connect. would be really helpful if you can take a look at it. (have also mentioned yours Matt's and treziac names there.
(is there a chat I could open with you somewhere? just so I could get your help sorting this)
@vinodres is your set up windows to windows or windows to Linux? where kafka brokers are on Linux?
if so I have same config as yours but with default sasl mechanism.
can you or someone please shed some light on this? I keep getting authentication failure.
@edenhill @vinodres any update on resolving this issue? I'm having the same issue using version 0.11.4.
@justinamiller ... unfortunately I did not get a chance to take it any further. As a workaround for now I have reduced the number of consumer threads and to compensate that I have increased background tasks per consumer thread that does actual processing by taking the payload from consumer thread. For my use cases this did the trick, but may not be ideal solution for all use cases.
Even now I see CPU spikes during occasional broker bounces, but because I reduced number of consumer threads per topic, the CPU spike does not bring the server to its knees.
My goal is to try run the client on Linux server to see if the issue is isolated to Windows server.
Just wanted to let you know that I am affected by this issue and that I am able to reproduce it quite easily. I'm looking into the issue, trying to to diagnose it further and will add my findings to this issue. If there is anything I should try, I'll be happy to do so.
Current setup: 3 node cluster with SASLPLAIN/SCRAM on Linux, using rdkafka (via the .NET wrapper) on windows as a client with api.version.fallback.ms = 0 and broker.version.fallback = "0.10.0". The CPU is consumed in one of the non-managed threads. Problem starts immediately as a node, that previously was taken down, comes up again.
Just a few confirmations of things mentioned before with some additions:
I ran a tcpdump on the broker with some interesting results. I've annotated the trace with the entries from the debug=all client log (and some comments from my side)
TL;DR: Connection is closed by the broker, the client then waits for a response it will never get and seems to be busy looping while doing so.
// tcpdump started after broker is down - client is trying to connect but is rejected since service is down (TCP reset)
22:03:39.624256 IP 192.168.72.1.50512 > 192.168.72.144.9092: Flags [S], seq 538442745, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
22:03:39.624290 IP 192.168.72.144.9092 > 192.168.72.1.50512: Flags [R.], seq 0, ack 1, win 0, length 0
22:03:40.133240 IP 192.168.72.1.50512 > 192.168.72.144.9092: Flags [S], seq 538442745, win 8192, options [mss 1460,nop,nop,sackOK], length 0
22:03:40.133266 IP 192.168.72.144.9092 > 192.168.72.1.50512: Flags [R.], seq 0, ack 1, win 0, length 0
// broker is up - connect succeeds
22:03:42.122585 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [S], seq 2539147520, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
22:03:42.122647 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [S.], seq 852774782, ack 2539147521, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 1], length 0
22:03:42.123687 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [.], ack 1, win 256, length 0
// API version is requested: [thrd:sasl_plaintext://kafka3:9092/3]: sasl_plaintext://kafka3:9092/3: Sent ApiVersionRequest (v0, 25 bytes @ 0, CorrId 209)
22:03:42.123731 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [P.], seq 1:26, ack 1, win 256, length 25
22:03:42.123738 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [.], ack 26, win 14600, length 0
// API version response: [thrd:sasl_plaintext://kafka3:9092/3]: sasl_plaintext://kafka3:9092/3: Received ApiVersionResponse (v0, 264 bytes, CorrId 209, rtt 437.00ms)
22:03:42.128403 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [P.], seq 1:273, ack 26, win 14600, length 272
// [thrd:sasl_plaintext://kafka3:9092/3]: sasl_plaintext://kafka3:9092/3: Sent SaslHandshakeRequest (v0, 36 bytes @ 0, CorrId 210)
22:03:42.129543 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [P.], seq 26:62, ack 273, win 255, length 36
// [thrd:sasl_plaintext://kafka3:9092/3]: sasl_plaintext://kafka3:9092/3: Received SaslHandshakeResponse (v0, 21 bytes, CorrId 210, rtt 16.00ms)
22:03:42.130275 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [P.], seq 273:302, ack 62, win 14600, length 29
// not sure what this is...
22:03:42.131266 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [P.], seq 62:66, ack 302, win 255, length 4
22:03:42.175559 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [.], ack 66, win 14600, length 0
// this seems to be the sasl frame from the client to the broker [thrd:sasl_plaintext://kafka3:9092/3]: sasl_plaintext://kafka3:9092/3: Send SASL frame to broker (45 bytes)
22:03:42.176667 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [P.], seq 66:111, ack 302, win 255, length 45
// the broker acknowledges the packet at the network level and shuts down the connection (FIN bit)
22:03:42.176693 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [.], ack 111, win 14600, length 0
22:03:42.177619 IP 192.168.72.144.9092 > 192.168.72.1.50513: Flags [F.], seq 302, ack 111, win 14600, length 0
22:03:42.178687 IP 192.168.72.1.50513 > 192.168.72.144.9092: Flags [.], ack 303, win 255, length 0
// the client still seems to be busy waiting for a reply at this point - which causes the cpu to spike
Just looking at the code at github an initial take at the analysis (which means my analysis may be flawed in many ways because without a proper dev environment it is pretty hard to follow the code flow):
rd_kafka_transport_io_event() calls rd_kafka_sasl_io_event() when in state AUTH and checks for return value -1 to indicate an error (which would be logged). Everything else is assumed to be a success (?). rd_kafka_sasl_io_event() can return 0 when it gets 0 from rd_kafka_transport_socket_recvmsg() (via serveral additional functions in the call stack). The function containing the loop around all this (rd_kafka_broker_ua_idle()) does not have any intrinsic sleep and the functions down the call chain just try to set timeouts for the socket operations. Since we know that the socket is closed by the remote end all socket functions will immediately, thus creating a tight loop which explains the cpu spike.
While it is certainly worthwhile to fix this problem above (i.e. detect that remote has closed the socket and do the same) it is still unclear to me why the broker disconnects us in the first place (which is the root cause for the problem).
Either something is wrong with the SASL request we are sending or the other packet exchange I labeled with "not sure what this is" causes some disturbance.
I have nothing useful in the broker logs (also added the debug flag to the jaas file). The only errors I found were some instances of
[2018-08-11 21:18:26,252] ERROR Error while accepting connection (kafka.network.Acceptor)
java.lang.ArithmeticException: / by zero
at kafka.network.Acceptor.run(SocketServer.scala:354)
at java.base/java.lang.Thread.run(Thread.java:844)
But this is not consistent. I also had spikes without these messages. Generally they seem to be occurring if there are incoming connection attempts between the INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor) and INFO [SocketServer brokerId=3] Started 1 acceptor threads (kafka.network.SocketServer) while the broker starts up. Most likely something is not completely initialized at this stage.
@stefanseufert
Fantastic analysis!
I'll fix the librdkafka code to detect disconnects.
As for the disconnects themselves;
this is the broker's way (for older versions at least) to indicate that authentication failed.
The division by zero exception might indicate a bug instead, probably this one: https://issues.apache.org/jira/browse/KAFKA-6982
I've seen some activity in the linked issue. Anything to test for me?
I just tested the official 0.11.6-RC4 nuget package(s) and can positively confirm that the problem is now fixed. Thank you for including this into this release.
This is awesome. Really appreciate @stefanseufert and @edenhill for getting to the bottom of this issue.
All credits go to @stefanseufert !
Most helpful comment
I just tested the official 0.11.6-RC4 nuget package(s) and can positively confirm that the problem is now fixed. Thank you for including this into this release.