I am having troubles with the two methods in the Producer: GetWatermarkOffsets and QueryWatermarkOffsets.
From initial testing:
I am just curious if I am not implementing GetWatermarkOffsets correctly, as well as if these methods are supposedly doing the same thing.
While QueryWatermarkOffsets() queries the broker for the requested offsets, GetWatermarkOffsets provides an optimized alternative that uses the latest cached known values for the High and Low watermarks, but this only works if the consumer is actively consuming the requested partitions.
If there is no cached watermark it will return the invalid/unset offset which is -1001.
I have to say that this is an awesome! Now that I understand the difference that is. Thank you for giving me an efficient feature like this!
@edenhill By latest cached values, do you mean the consumer lag ?
The high watermark offset will be the last stored offset on the topic/partition (not the current consumer position).
I'm going to mark this method as unstable in the upcoming release because I find it a bit confusing that the call doesn't depend on the specific consumer (and people, quite logically, often expect that it does). At some point we will likely have an Admin client (in addition to Producer and Consumer), and this functionality might be better restricted to being there.
UPDATE: Oh, that won't work for GetWatermarkOffsets - it needs to be exposed via a consumer client (and we want to retain the feature). Still, I will probably flag this as unstable as I find it a little confusing as is.
@Dexu5 Each time the consumer fetches messages for a partition it will also receive the current high watermark of that partition - i.e., the offset of the last/newest message in the partition + 1.
The low watermark is the oldest message in the partition and there is no automatic functionality to get updated on when this changes so it is polled at every statistics.interval.ms instead - but since old messages are only removed when log segments are rolled over and retention policy kicks in the low watermark changes very infrequently (compared to the high watermark).
The consumer lag is the difference between the high watermark and the consumer's current position. So the main use of the cached GetWatermarkOffsets API is to allow a consumer application to calculate its consumer lag in an efficient manner.
The non-cached QueryWatermarkOffsets API will ask the broker for the low and high watermark offsets and is thus much slower, the typical use-case is something more one-off:ish, like initiating a bsearch or similar.
@edenhill @mhowlett Thank you so much for clearing the doubts. This surely got me leads on further processing on my project.