Certain broker communication failures are not bubbled up as exceptions on the producing side, when producing via ProduceAsync(). This means the failure to produce to a topic goes unreported, leading to potential loss of data.
Unfortunately I do not have clear replication steps. We stumbled upon the issue after a few days of normal Kafka operation, no idea what caused it. Restarting the machine where our Kafka brokers live fixed the communication problem.
After we figured out no messages are getting across, the only indication of a problem was obtained by adding the debug flag to the producer config ({ "debug", "broker,topic,msg"}):
BROKERFAIL| [thrd:xxx.xxx.xxx.xxx:9092/bootstrap]: xxxx.xxx.xxx.xxx:9092/0: failed: err: Local: Broker transport failure: (errno: No error)
This is very cryptic and not very useful, further than signaling a problem. It is also buried in a sea of other messages.
Please provide the following information:
Are you checking the Error property on the Message returned from ProduceAsync? In 0.11.x, produce errors do not result in exceptions - the task completes successfully, with the Error field set to something other than NoError. In 1.0-beta, this has been changed to behave as you are expecting.
@mhowlett that is a very good point, I'm not testing the result message right now, I was assuming exceptions would be raised. Thank you for pointing out how it actually works, I'll start checking the message for now.
another thing that's not obvious/confusing: OnError messages should mostly not be acted on (in 0.11.x, no events emitted by this event should be considered fatal - the client will recover from them all). 1.0-beta adds an IsFatal flag to this event to make how these should be handled clear.