Confluent-kafka-dotnet: OutOfOrderSequenceException

Created on 17 Dec 2018  路  33Comments  路  Source: confluentinc/confluent-kafka-dotnet

Description

Hi Team,

I am getting OutOfOrderSequenceException,

Please find my settings

"LingerMs": 20,
"BatchNumMessages": 1000,
"MessageMaxBytes": 1000000,
"MaxInFlight": 4,
"Acks": -1,
"MessageSendMaxRetries": 1000000,
"MessageTimeoutMs": 30000,
"EnableIdempotence": true

How to reproduce

I am doing a load test. I am generating a load of 100kb message string in a loop and sending 1 million messages. I am getting the Out of order sequence exception. Is it because of the settin gthe idempotence but not the transactional.id?.
Kindly help..

Checklist

Please provide the following information:

  • [1-beta2 ] Confluent.Kafka nuget version:
  • [2.0 ] Apache Kafka version:
  • [ ] Client configuration:
  • [ ] Operating system:
  • [ ] Provide logs (with "debug" : "..." as necessary in configuration)
  • [ ] Provide broker log excerpts
  • [ ] Critical issue
bug librdkafka

All 33 comments

first thing to try is reference the librdkafka.redist 1.0.0-RC4 package. 1.0-beta2 references 1.0.0-PRE1 iirc and the idempotent producer functionality in that version was buggy.

@mhowlett : Sure, I will update the library and let you know. Thanks

@mhowlett : After upgrading to the RC4, we get the same error.

@Tenseiga Can you reproduce this with debug set to eos,msg,broker and provide us with logs?
Thanks

Make sure to reproduce on librdkafka.redist 1.0.0-RC4

to be sure librdkafka.redist 1.0.0-RC4 is being used, it would be worth checking the value of Library.VersionString.

@mhowlett Library.VersionString is 1.0.0-Pre2. Does it equal to 1.0.0-RC4?

@mhowlett The Library.VersionString says 1.0.0-Pre2, in the nuget repo there is bo pre2, I see only RC4, assuming this is pre2 is RC4/RC5. I have collected the logs.
@edenhill Please find the statistics that you have asked.
statistic.txt

I don't think that's librdkafka v1.0.0-RC4.

@edenhill : I have updated the librdkafka.redist to RC4 and rebuilt the Confluent.Kafka.dll and using in my library. I still get 1.0.0-pre2
image

BTW you will see the COnfluent.Kafka version as 2.0.0,its my local change.
Am I doing know?

@mhowlett ?

@edenhill : internally it creates a delegate to _'rd_kafka_version_str_' method.

Thing is there is no 1.0.0-pre2 librdkafka version:
https://www.nuget.org/packages/librdkafka.redist/

@mhowlett any help on this :)

I hope to release -beta3 later today which will reference librdkafka RC5.

Alternatively you can use the Library.Load method to explicitly load the librdkafka dependency.

are you on .net framework? I'm not sure if nuget will get the right package in that case (if an explicit librdkafka.redist is referenced) - this may be the problem. It works in my usual environment which is .net core on macos.

Sorry, I forgot to update librdkafka's internal version number so v1.0.0-PRE2 could mean any version since v1.0.0-PRE1 to ..RC-4.
Please try to reproduce the issue on Matt's beta3, when it gets out.

@mhowlett : I am using dotnet core 2.1, win 10 pro, vs 2017

the issue may be related to memory management because we won't encounter the exception if we send the same 100kb message for 1 millions times. However, if we send 1 millions different messages (100kb per message), we will get the exception.

@edenhill : Then I am using RC4 release. Yet getting the same issue.

@Tenseiga Can you reproduce this with debug=msg,eos,broker,protocol and provide the complete debug logs? Please also provide your producer configs. Thanks

@edenhill : Please find the attached. My producer settings

"Topic": "debug-1m-5",
"MessageSize": 102400, 
"MessageCount": 1000000,
"Producer": {
  "LingerMs": 20,
  "BatchNumMessages": 1000,
  "MessageMaxBytes": 1000000,
  "MaxInFlight": 4,
  "Acks": -1,
  "MessageSendMaxRetries": 1000000,
  "MessageTimeoutMs": 30000,
  "EnableIdempotence": true
}

Statistics.txt

Thanks for the help

Those are statistics (without line breaks...), I need the debug logs (written to console/stderr).

@edenhill : Please find the Log.
Trace.txt

BTW My Partition count is 6 and Replication factor 3 , with 3 broker 3 zookeeper set up in Kubernetes.

Thank you, that's very helpful!

@Tenseiga Can you verify what the Debug property was set to for the Trace.txt log run?

Its "msg,eos,broker,protocol".

This a fresh set of traces, Might help you

Trace.zip

After thorough analysis of the log files and idempotence code I've come to the conclusion that this is caused by the message timing out due to message.timeout.ms.

What happens, at a high level is:

  1. messages 1,2,3,4 are sent to the broker
  2. message 1 is successfully acked by the broker
  3. the broker hangs, causing the request timeout to kick in for message 2 that is in-flight
  4. messages 2,3,4 are re-inserted on the producer queue
  5. the partition timeout scanner runs (once per second) and sees that message 2 has timed out, removes message 2. Producer queue is now 3,4
  6. the broker comes up again
  7. producer sends messages from producer queue to the broker: 3,4
  8. the broker expected message 2, not 3, so returns an OutOfOrderSequence error
  9. producer receives the error and raises a fatal exception

While this issue was discussed briefly in the KIP-91 discussion thread, the issue exists in the Java client as well.

I'll look into what the best solution is for this.

A lower message.timeout.ms will trigger this behaviour more frequently, the workaround is thus to increase message.timeout.ms towards the max value.

@Tenseiga I don't think you were running librdkafka v1.0.0-RC4.
The fatal error message was changed in RC3 to include the topic+partition, but in your log message it is missing, so I think you're using an older version.

FATAL [thrd:10.12.8.136:9092/bootstrap]: Fatal error: Broker: Broker received an out of order sequence number: ProduceRequest with 9 message(s) failed with rewound sequence number on broker 0 (PID{Id:240,Epoch:0}, base seq 25084 < next seq 26464): last error _TIMED_OUT (actions Retry,MsgPossiblyPersisted, base seq 26457, 644ms ago)

https://github.com/edenhill/librdkafka/commit/c2af9b3c49cad80a40da1847068c400f71f37b1b#diff-28f3a6d6e1e2b74d67ec19602f826162R2011

I've tried to reproduce the issue on master (RC5 equivalent) without luck, so I think this specific issue has been fixed.
Can you verify by upgrading to confluent-kafka-dotnet v1.0-beta2 (which contains librdkafka v1.0.0-RC5) ?

@edenhill : I have already provided screenshot where I rebuilt the library using the RC4 release. Even my publish folder had RC4 release. BTW I got it from Nuget .

Yes, but the logs indicate a prior version was being run, so it must have been picked up an older version at runtime for some reason.

I had checked the my csproj as well, it was pointing to RC4 release, I cant understand how it can pick a old version.
image

Was this page helpful?
0 / 5 - 0 ratings