We are developing a medium sized application that uses Kafka as publish - subscribe message broker. The application crashes when we subscribe several consumers using wildcard topics. If we do not use wildcard topics everything works as expected.
Most of the times the crash happens a few seconds after subscribing the consumer. The crash cause varies among access violation, stack overflow and heap corruption, and it is usually located in librdkafka.DLL, although sometimes other dlls appear, like ntdll.DLL.
We have verified that exactly the same code running exactly the same versions of Confluent.Kafka nuget and dotnet only crashes in Windows. If the code is executed in a Linux host, it never crashes.
We have also verified that this crash happens at least with versions 0.11.4, 0.11.6 and 1.0.0-RC4 of Confluent.Kafka nuget.
We have prepared a toy example below that reproduces the crash.
With version 1.0.0-RC4 of the Nuget, using dotnet 2.2.105, the following code reproduces the error in Windows 10 x64:
static void Main(string[] args)
{
var conf1 = new ConsumerConfig
{
GroupId = "test-consumer-group1",
BootstrapServers = "localhost:9092",
AutoOffsetReset = AutoOffsetReset.Latest
};
var conf2 = new ConsumerConfig
{
GroupId = "test-consumer-group2",
BootstrapServers = "localhost:9092",
AutoOffsetReset = AutoOffsetReset.Latest
};
var conf3 = new ConsumerConfig
{
GroupId = "test-consumer-group3",
BootstrapServers = "localhost:9092",
AutoOffsetReset = AutoOffsetReset.Latest
};
var conf4 = new ConsumerConfig
{
GroupId = "test-consumer-group4",
BootstrapServers = "localhost:9092",
AutoOffsetReset = AutoOffsetReset.Latest
};
var c1 = new ConsumerBuilder<Ignore, string>(conf1).Build();
var c2 = new ConsumerBuilder<Ignore, string>(conf2).Build();
var c3 = new ConsumerBuilder<Ignore, string>(conf3).Build();
var c4 = new ConsumerBuilder<Ignore, string>(conf4).Build();
c1.Subscribe("^tenants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.plants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.notifications\\..*");
c2.Subscribe("^tenants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.plants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.notifications\\..*");
c3.Subscribe("^tenants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.plants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.notifications\\..*");
c4.Subscribe("^tenants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.plants\\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\\.notifications\\..*");
Console.WriteLine("All consumers subscribed.");
CancellationTokenSource cts = new CancellationTokenSource();
Console.CancelKeyPress += (_, e) => {
e.Cancel = true; // prevent the process from terminating.
cts.Cancel();
};
while (true)
{
try
{
var cr1 = c1.Consume(cts.Token);
Console.WriteLine($"1. Consumed message '{cr1.Value}' at: '{cr1.TopicPartitionOffset}'.");
var cr2 = c2.Consume(cts.Token);
Console.WriteLine($"2. Consumed message '{cr2.Value}' at: '{cr2.TopicPartitionOffset}'.");
var cr3 = c3.Consume(cts.Token);
Console.WriteLine($"3. Consumed message '{cr3.Value}' at: '{cr3.TopicPartitionOffset}'.");
var cr4 = c4.Consume(cts.Token);
Console.WriteLine($"4. Consumed message '{cr4.Value}' at: '{cr4.TopicPartitionOffset}'.");
}
catch (ConsumeException e)
{
Console.WriteLine($"Error occured: {e.Error.Reason}");
}
}
c1.Close();
c2.Close();
c3.Close();
c4.Close();
}
The output of the program shows the message "All consumers subscribed" and then dies after a few seconds. If the topic are replaced by "tenants\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\.plants\.a77b7bec-c9d5-468c-89d8-cc3dc293354f\.notifications\.a" (the other three ending respectively in b, c and d) then the application does not crash.
In the code of the project the way in which we setup the consumers and manage the messages is much more involved (we use several async methods), but this minimal example reproduces the error exactly in the same way.
We can provide more details on our setup and even a zip file with the project if needed.
Please provide the following information:
Name of the application with errors: dotnet.exe, versi贸n: 2.2.27207.3, timestamp: 0x5c0ab1b7
Name of the module with errors: librdkafka.DLL, versi贸n: 0.0.0.0, timestamp: 0x5c99628f
Exception code: 0xc00000fd
Error offset: 0x00000000000cb36e
Identifier of the process with errors: 0x6938
Application with errors start time: 0x01d4eeb93b8ac8e1
Path to the application with errors: C:\Program Files\dotnet\dotnet.exe
Path to the application module with errors: C:\Users\vmartin\.nuget\packages\librdkafka.redist\1.0.0\runtimes\win-x64\native\librdkafka.DLL
Report identifier: 879113b6-44ff-4e07-94ce-1094048db72a
Name of the application with errors: dotnet.exe, versi贸n: 2.2.27207.3, marca de tiempo: 0x5c0ab1b7
Name of the module with errors: ntdll.dll, versi贸n: 10.0.17763.404, marca de tiempo: 0xbf6ea104
Exception code: 0xc0000374
Error offset: 0x00000000000faf89
Identifier of the process with errors: 0x6a54
Application with errors start time: 0x01d4eedb61eb4cea
Path to the application with errors: C:\Program Files\dotnet\dotnet.exe
Path to the application module with errors: C:\WINDOWS\SYSTEM32\ntdll.dll
Report identifier: 00e2b779-0858-46e6-b15b-590d80884540
thanks for the detailed summary - seems like a Windows only bug in librdkafka. we will look into this.
Should I repost this issue to edenhill/librdkafka?
Are 4 consumers required for the program to crash, or can fewer be used?
I used 4 in the example for simplicity, in our production setup we use more than 4. However, I've been playing a bit with the toy example I provided above, using different number of consumers and these are the results:
|Consumers|Runs|Crashes|
|------------|------|--------|
|1|7|0|
|2|5|3|
|4|5|4|
|8|7|7|
So it seems the minimum number of consumers to get the program crashing is 2, but the chance of crashing increases with the number of consumers. With 8 consumers it happens always, with 2 it happens more or less half of the times. Again, if I remove regular expressions from topics, it never crashes.
Thank you! Will try to reproduce
Hi, is there any progress on this? I can't debug my application on windows machines due to this bug. If there is any workaround I could try while the fix is being developed, it will be appreciated. Thanks!
Hi again, is there any progress on this? I sent you a test program that reproduces the error, and did a bit of research that suggested that the problem was related to wildcard topics when there is more than one consumer. We are still unable to debug our programs on Windows. Have you been able to reproduce the error with the toy program I posted above? If you need any other information do not hesitate to ask me. Thanks!
thanks for the detailed feedback - sorry, we haven't got to it yet.
We hit this issue today. I can confirm the behavior - it crashes on windows if there are more than two consumers using wildcard subscription. As it crashes without any (managed) exception or stack trace it took me a while to figure out what's going.
I am seeing the same problem on a Windows 10 machine running version 1.0.1. We using lz4 compression on the producer side (as mentioned in #482 ). Our consumer application creates two consumer objects that use wildcard subscriptions (with different topic patterns) on startup. Shortly afterwards, the application crashes. Sometimes due to a StackOverflow. Other times due to a MemoryAccessViolation.
@edenhill Have you been able to reproduce this yet? I cannot see any issues in the librdkafka repo that seem related to this bug.
I have a same problem on a Windows 10 machine running version 1.4.0. My application creates two consumer objects that use wildcard subscriptions with different topic patterns on startup. After few seconds, the application crashes with a StackOverflow error.
I've tried the toy program above with version 1.4.0 and dotnet versi贸n 3.0.100 and the crash no longer happens. Even if I use the versi贸n 1.0.0-RC4 of the library, which before crashed, now it does not crash. So it seems the problem had to do with the dotnet sdk.
I encountered this issue today, and believe I've managed to locate the issue with the help of a crash dump.
Observations:
Occurs when there are two Consumers both subscribed to a regex topic
Does not occur when a Single Consumer subscribes to two regex topics
Does not occur when one Consumer subscribes to a regex topic and another Consumer subscribes to a non-regex topic
Occurs on Windows 10 64bit netcoreapp3.1 using Confluent.Kafka 1.3.0, 1.4.0 and 1.4.3
CallStack (Confluent.Kafka 1.4.3) :
ntdll.dll!RtlFreeHeap
() Unknown
librdkafka.DLL!free(void * pBlock) Line 51 C
librdkafka.DLL!re_regfree(Reprog * prog) Line 897 C
librdkafka.DLL!rd_regex_match(const char * pattern, const char * str, char * errstr, unsigned __int64 errstr_size) Line 152 C
> librdkafka.DLL!rd_kafka_topic_match(rd_kafka_s * rk, const char * pattern, const char * topic) Line 1476 C
librdkafka.DLL!rd_kafka_metadata_topic_match(rd_kafka_s * rk, rd_list_s * tinfos, const rd_kafka_topic_partition_list_s * match) Line 674 C
librdkafka.DLL!rd_kafka_cgrp_metadata_update_check(rd_kafka_cgrp_s * rkcg, int do_join) Line 3507 C
librdkafka.DLL!rd_kafka_parse_Metadata(rd_kafka_broker_s * rkb, rd_kafka_buf_s * request, rd_kafka_buf_s * rkbuf, rd_kafka_metadata * * mdp) Line 595 C
librdkafka.DLL!rd_kafka_handle_Metadata(rd_kafka_s * rk, rd_kafka_broker_s * rkb, rd_kafka_resp_err_t err, rd_kafka_buf_s * rkbuf, rd_kafka_buf_s * request, void * opaque) Line 1691 C
librdkafka.DLL!rd_kafka_buf_callback(rd_kafka_s * rk, rd_kafka_broker_s * rkb, rd_kafka_resp_err_t err, rd_kafka_buf_s * response, rd_kafka_buf_s * request) Line 464 C
librdkafka.DLL!rd_kafka_buf_handle_op(rd_kafka_op_s * rko, rd_kafka_resp_err_t err) Line 413 C
librdkafka.DLL!rd_kafka_op_handle_std(rd_kafka_s * rk, rd_kafka_q_s * rkq, rd_kafka_op_s * rko, int cb_type) Line 685 C
librdkafka.DLL!rd_kafka_op_handle(rd_kafka_s * rk, rd_kafka_q_s * rkq, rd_kafka_op_s * rko, rd_kafka_q_cb_type_t cb_type, void * opaque, rd_kafka_op_res_t(*)(rd_kafka_s *, rd_kafka_q_s *, rd_kafka_op_s *, rd_kafka_q_cb_type_t, void *) callback) Line 715 C
librdkafka.DLL!rd_kafka_q_serve(rd_kafka_q_s * rkq, int timeout_ms, int max_cnt, rd_kafka_q_cb_type_t cb_type, rd_kafka_op_res_t(*)(rd_kafka_s *, rd_kafka_q_s *, rd_kafka_op_s *, rd_kafka_q_cb_type_t, void *) callback, void * opaque) Line 500 C
librdkafka.DLL!rd_kafka_thread_main(void * arg) Line 1942 C
librdkafka.DLL!_thrd_wrapper_function(void * aArg) Line 579 C
[External Code]
It looks to me that two Consumers, both using regex, would assign, use, and free the g static variable https://github.com/edenhill/librdkafka/blob/master/src/regexp.c#L82 in a non-threadsafe manner, resulting in what I believe is a double free in the call stack above.
@JohnLampitt Great find! We'll fix this for the upcoming librdkafka v1.5.0 release.