Is your feature request related to a problem? Please describe.
The current implementation of the redelivery count only increases the counter when either:
However, the ack timeout is tracked by the consumer so consider the following situation:
All of the sudden all consumers are getting OOM killed and the only solution is to clear the backlog for the topic.
Describe the solution you'd like
A possible solution is to increase the redelivery count when the message is received by the consumer. That way the count will increase if any of these happen:
Describe alternatives you've considered
The solution I'm taking now (which is specific for our use case) is to check the publish timestamp of messages before they're processed in the consumer. That way if a message has been sitting for a long time it should be removed from the queue. This is not a generic solution and not very robust.
The redelivery count is used by DLQ, therefore we'd better increase it after ack timeout expires or a nack happens. Otherwise, if user enable the DLQ, messages may be put into the dead letter topic in advance
@codelipenghui But doesn't DLQ suffer the same problem that I describe? A "bad" message will never go to the DLQ since it kills consumers before the redelivery count goes up.
If the redelivery count is increased when the message is received, can't we change the implementation of DLQ so it checks whether the message should go to DLQ:
@frejonb I may not understand this issue. what do you mean by "bring down" all the consumers?
From the example situation, the consumer got OOM is because the timeout too long, and hold too many messages in consumer?
@jiazhai the reason a consumer can get OOM killed is because it received a "bad message". The processing of this bad message can lead to an application crash (the consumer) for many reasons. For example if the processing of the bad message requires too much RAM, then it may be OOM killed. But the application can also be unresponsive for many reasons.
The particular cause is not important here. My point is that the motivation for implementing DLQ was to prevent this https://github.com/apache/pulsar/issues/189, https://github.com/apache/pulsar/pull/2400. But in the situation when the consumer is not responsive, the redelivery count is not increased after ack timeout and therefore there's no way to get rid of the message.
@frejonb Thanks for the explanation. I get it now. When message redeliver around consumers, A killer message will kill all the consumers.
It make sense.
@codelipenghui What do you think of change the behavior?
@jiazhai @frejonb If i understand correctly, the problem is the redelivery count not increase but the message was actually redelivered since only increase redelivery count when client send redelivery request. Currently, we maintain the redelivery count at broker side, i think we'd fix it at broker side, the brokers need to guarantee the correctness of redelivery count cannot completely depends on client redelivery un-ack messages requests.
@frejonb Does this can solve your problem?
@codelipenghui sounds good
Most helpful comment
@jiazhai @frejonb If i understand correctly, the problem is the redelivery count not increase but the message was actually redelivered since only increase redelivery count when client send redelivery request. Currently, we maintain the redelivery count at broker side, i think we'd fix it at broker side, the brokers need to guarantee the correctness of redelivery count cannot completely depends on client redelivery un-ack messages requests.
@frejonb Does this can solve your problem?