google-cloud-go 🚀 - pubsub: Large number of duplicate messages suddenly

@cristiangraz We have a working reproduction in Java. I'll see if I can also repro it here. Could you let us know how long the deadline on your subscription is?

pongad on 3 Oct 2017

@pongad Is the deadline different from MaxExtension in the ReceiveSettings struct? Just want to confirm I give you the right thing, MaxExtension is 10 minutes and MaxOutstandingMessages in this instance was set to 10.

cristiangraz on 3 Oct 2017

@cristiangraz Sorry for the confusion, I was referring to deadline configured on pubsub subscription. You can get it from either the cloud console, or run

cfg, err := subscription.Config(ctx)
_ = err // handle err
fmt.Println(cfg.AckDeadline)

I sent a PR which should fix the problem to the Java repo. I plan to fix all the desgin concerns there first, then I'll replicate it here.

pongad on 4 Oct 2017

@pongad the deadline on the subscription is set to 10 seconds (from cloud console). Also in this instance all of the workers are completing in ~2-3 seconds once they start receive the message.

cristiangraz on 4 Oct 2017

Same here. We tried using MaxOutstandingMessages = 10 and got lots of duplicates. Switching back to the default was much better. AckDeadline was also set to 10sec.

jfbus on 5 Oct 2017

This symptom suggests that this is the same problem we're seeing in Java.

Our diagnosis is this: The perceived communication latency between pubsub server and the client can be high, so the client thinks messages expires further into the future than they actually do. By the time the client send "please give me more time" request to the server, it's too late: the server already considered the messages expired and resent them.

pongad on 6 Oct 2017

@pongad With a pubsub deadline (in cloud console of 10s): Pubsub starts counting the 10s before client starts counting 10s (due to latency) -- so pubsub has already resent the message before client could request an extension? Am I understanding that correctly?

If a worker on our end is processing a message in < 2s, does that mean there could up to 8s of latency between the pubsub server and our client? Is there something on the Pubsub server that has caused latency to increase recently? We're running on Google Container Engine, so trying to grasp what the latency issues might be inside the datacenter from GKE to Pubsub.

If we changed the subscription deadline to 30s for example, would that solve this issue?

We have MaxOustandingMessages (of 10 in this instance) equal to the number of goroutines processing messages (in order to avoid pulling messages before they are ready to be processed).

Thanks for all your help so far.

cristiangraz on 6 Oct 2017

@cristiangraz There are a few details, so I need to answer out of order.

pubsub has already resent the message before client could request an extension? Am I understanding that correctly?

Yes, we believe that's the root cause.

If a worker on our end is processing a message in < 2s, does that mean there could up to 8s of latency between the pubsub server and our client?

There are a few variables at play. According to this comment each response contains "a single batch as published". So, if you set MaxOutstandingMessages = 10, the server might send us 11 messages in the same response. The client will process up to 10 messages concurrently, but the 11th message needs to wait until one of the first 10 finishes. So the 11th message might appear to take 4 seconds to process, etc.

Is there something on the Pubsub server that has caused latency to increase recently? We're running on Google Container Engine, so trying to grasp what the latency issues might be inside the datacenter from GKE to Pubsub.

I'm actually not 100% sure. After I reproduced the problem, the logs showed that there has to be some latency going on, but I cannot tell which component (gRPC? pubsub server? firewall? etc) is introducing the latency.

If we changed the subscription deadline to 30s for example, would that solve this issue?

That would certainly help. I should note that if you set it too high, it will take longer for the message to be redelivered if your machine crashes. If you decide to increase the deadline, could you let us know if it's helping?

I created a PR to Java client (after merging, I can replicate in other langs) that will make the client send "I need more time" message to the server earlier to account for the latency.

pongad on 6 Oct 2017

I am running into the same issue and it's incredibly frustrating. In my debugging, I set ReceiveSettings.NumGoroutines = 1, MaxOutstandingMessages = 1 and ReceiveSettings.MaxExtension = 10 * time.Minute.

With the default subscription.AckDeadline = 10seconds:

If I ack the msg as soon as I receive it, I get 100% deletion rate.
If I wait 100ms after (just sleep), I deleted 196 of 200 msgs.
If I wait 200ms after, I deleted 102 of 200 msgs.
If I wait 300ms after, I deleted only 68 of 200 msgs.

I also created a subscription w/ AckDeadline: 5 * time.Minute. the deletion/ack rate is slightly better. The last run I did w/ 300ms delay ack'ed 123 of 200 msgs.

Did Google Pubsub change something on their end? Seems like it's all happening in the last couple of days.

zhenjl on 6 Oct 2017

👍3 😕2

@pongad I tried to change the subscription deadline in the cloud console, but I didn't see any option for it. Is there any way to get more info internally from the Pubsub team about possible latency issues? It's definitely worrisome to see latency > 10-20ms inside of GKE, let alone the possibility of latency in the several seconds (if it's not related to streaming pull).

w.r.t. MaxOutstandingMessages and streaming pulls -- So if I set a value of 10, I should receive 10 messages? The only exception being that messages published in batch are also fetched in batch causing the streaming pull to possibly retrieve more than 10 messages and wait to process them? If that's correct, in our instance we were seeing duplicate messages mostly in an our domain system from an order. The order items are batch published in a single call to the fulfillment system, but the fulfillment system handles each message individually and publishes one at a time to the domain system (no batch publishing). They all are published around the same time, but not in a batch call. The domain system is where I am seeing the largest number of duplicates. If that's the only exception to more than MaxOutstandingMessages being received, then in our case there shouldn't be any messages that have been pulled but are waiting to begin processing? Is it possible something else is causing the latency?

This is only anecdotal as I haven't setup a test as thorough as @zhenjl, but from what I can observe duplicates are more likely when multiple messages are coming in around the same time (although not originating from a batch publish). When a worker only has a single message to process, I don't see any signs of duplicates so far -- even in the same workers.

cristiangraz on 7 Oct 2017

I made a CL that should help with this.

@zhenjl Thank you for the measurements. This is in line with our expectation. If you ack the message very quickly, the acks will reach the server before the server expires the message, so there'd be fewer duplicates.

@cristiangraz You're right, I'm sorry. The only way to "change" the deadline right now is to delete the subscription and recreate. Hopefully the CL would mean you don't have to do that.

The fact that you're seeing this problem in your domain system is indeed suspicious. Could you let us know how you're publishing? If you're publishing by topic.Publish in the pubsub package, the publisher tries to batch messages together to increase performance. Could that be the problem?

I'll continue to work with the pubsub team to fix this.

pongad on 9 Oct 2017

@pongad Yeah we are using topic.Publish -- I suppose if several independent goroutines are calling Publish the client could be batching them.

cristiangraz on 10 Oct 2017

We investigated this more. We believe the messages are buffered on a machine between your VM and the pubsub server. This explains the latency: pubsub server can send messages into the buffer faster than we can read it out, so some messages spend a few seconds in there. We're working to fix this.

In the immediate term, you can try reverting to v0.11.0. That version pulls messages from a slower endpoint that shouldn't be affected by this problem. If you see compile error like undefined: gax.MustCompilePathTemplate, you need to also revert gax-go to version v1.0.0. Please let me know if this helps.

pongad on 10 Oct 2017

Thanks @pongad. We rolled out a fix for all of our workers on Friday so they are unique by (subscription name, message id) using SQL locks, so we're currently protected from duplicate concurrent messages. I had some issues with v0.11.0 so am going to stay pinned to the current version until a fix is ready.

I appreciate all of your help digging into this, it sounds like you may have found the cause behind the latency. If you need any additional info please let me know.

cristiangraz on 10 Oct 2017

+1 on this issue. We've been trying to determine for days why we aren't processing through our queues and reverting to v0.11.0 seems to have resolved the issue. We were showing a ton of Google Cloud Pub/Sub API errors in our cloud console.

cmoad on 17 Oct 2017

👍1

I'm also seeing this issue. I had actually changed my ack deadline in the Google Cloud UI to 300 seconds, but noticed my messages being redelivered every 10. The UI showed me the deadline was 300 seconds though.

Recreating my subscription fixed that problem, but is there a reason that edit option isn't disabled if it isn't supported?

ProfessorBeekums on 17 Oct 2017

@ProfessorBeekums Subscription.Receive gets the ack deadline from the service at the beginning, and then remembers it for the life of the call. You have to call Receive again to get the change.

jba on 17 Oct 2017

Any updates on this issue? We are facing the same issue and this is a real pain.

tecbot on 27 Oct 2017

It's an active area of work internally. There should be fixes rolled out in the next couple of weeks.

jba on 27 Oct 2017

I'm getting the same problem with this version: https://github.com/GoogleCloudPlatform/google-cloud-go/commit/8c4ed1f54434ff9ea67929c91a4a10db57a52780

My theory of this problem is the following:

This line:
https://github.com/GoogleCloudPlatform/google-cloud-go/blob/8dff92c85f4225d90bdd100ea741d9903acc259e/pubsub/iterator.go#L58

It always uses the 5 seconds before the deadline. Let's say we have 10 seconds ack deadline on the subscription.

OK case:

| Time | Client | Server |
| ------- | ------------- | ------------- |
| 0 | Pull a message | Send a message |
| 5 | Ask for extension | Quickly accept the extension |

Bad case (assuming it retries on failure):

| Time | Client | Server |
| ------- | ------------- | ------------- |
| 0 | Pull a message | Send a message |
| 5 | Ask for extension | Somehow Erroring, hanging almost 5 seconds |
| 10 | Retrying modifyAckDeadline | Got it, but it's too late |

Originally I started to experience this problem with this issue:
https://github.com/GoogleCloudPlatform/google-cloud-go/issues/382

As I recall, when I used the legacy client lib, I manually send modifyAckDeadline by myself, with a grace period of 15 seconds with 60 seconds of AckDeadline. At that time, it works much much better.

That makes me think of something like:

keepAlivePeriod := Max(po.ackDeadline - 5*time.Second, po.ackDeadline * 1/4)

for an easy mitigation. WDYT?

tmatsuo on 8 Nov 2017

@tmatsuo From our analysis, things look slightly different from what you just mentioned.

In your table, the server sends and the client receives at about the same time. Our analysis shows that this isn't really the case. There could be significant buffering between client and server. So things might look kind of like this instead:

Your solution might help to a degree. However, it's always possible for enough messages to back up that changing keep alive period won't help.

We're fixing this partly server-side so things look like this:

The change [1] is already made to the client here.

Last week, we estimated that the server-side fix should be rolling out this week. I'll keep this thread updated.

pongad on 8 Nov 2017

🎉2

@pongad Alright, it sounds promising :) Thanks!

tmatsuo on 8 Nov 2017

@pongad

FWIW, although your scenario looks promising, I think the suggestion of mine or similar approach will also be needed, because I often observe duplicated messages certain time after the pull request was made.

tmatsuo on 8 Nov 2017

duplicated messages certain time after the pull request was made

I'm not 100% sure I parse this right. Are you saying the duplicate arrive a certain time after you ack it?

pongad on 8 Nov 2017

@pongad

No. I mean that, it seems that the dups are happening for me at not only the first iteration of modack, but also on one of the subsequent iterations of modack. I'm not 100% confident though.

tmatsuo on 8 Nov 2017

@pongad
I think some of the modack failures in our case might have been because of network congestion (and short grace period of course), sorry for the noise.

tmatsuo on 15 Nov 2017

The server-side fix has landed. Could you try again and let us know if you're seeing fewer dups?

There is an untagged change on master that takes full advantage of the server feature, but we believe the fix should also work on the latest tagged version.

pongad on 22 Nov 2017

Still having the same number of duplicates here (europe-west-1)...

jfbus on 22 Nov 2017

What commit are you running the Go client at?

jba on 22 Nov 2017

Still using v0.12.0

jfbus on 22 Nov 2017

Thank you for letting us know @jfbus . I have reported this to the pubsub team.

pongad on 23 Nov 2017

@pongad
Is it expected to be fixed in v0.16.0 ?
I have tried the v0.16.0 Go client but still see message duplicated.
What I am doing is
sleep 10 seconds and ACK.
the "Acknowledgment Deadline" was set to 300 seconds.

dengliu on 29 Nov 2017

@dengliu I believe the fix hasn't been tagged yet. Would it be possible for you to try from master?

pongad on 6 Dec 2017

It looks to me like the change was added in v0.16.0 with https://github.com/GoogleCloudPlatform/google-cloud-go/commit/c9fc9dd1cdea3f1cfd34d266f624232ddf462083.

I am still having this issue on v0.17.0. In my case I suspect that my upload speed is getting saturated. I think some clarification on connection delays would be beneficial on this article: https://cloud.google.com/pubsub/docs/faq#duplicates

roger6106 on 15 Dec 2017

I tested the go client with v0.16.0 this week , but it didn't seem to be fixed.

dengliu on 15 Dec 2017

@roger6106 @dengliu The pubsub team just rolled out more fixes that should further reduce the duplicate. The fix landed only a couple of days ago. Could you let us know if you're still seeing this problem?

If you are, could you let us know how long you're taking to process each message? It'd help us reproduce the problem.

pongad on 18 Dec 2017

@pongad what did u rolled out, anything changed on the server side?
We are seeing issues that pull rate sparodically become much lower than normal since 3 days ago.

dengliu on 19 Dec 2017

@dengliu I checked with the pubsub team; this is indeed a problem on the server. They are working to resolve this.

pongad on 20 Dec 2017

We're also seeing strange pull/push ratios: ~3k published to a topic, with a single sub (single & several consumers) yielding ~200 msgs/sec consumed.

marklr on 4 Jan 2018

@marklr I'm not completely sure I follow. How quickly do you process the messages? Receive limits the number of concurrently processing messages to make sure your computer doesn't run out of resources. So if you take a while to process messages, it will pull slowly.

pongad on 5 Jan 2018

Hey @pongad,

I've experimented with multiple consumers, both within GCP's network and
outside it, with multiple variations of the consumer code (Python and Go).

I tried truncating the handler to simply ack the message and still end up
with a consumption rate of 10% of the publishing rate.

I also experimented with the Cloud Dataflow pubsub to bigquery template and
its consumption rate was similar.

I have a single sub, with a 10m deadline and consumers with a minimal
handler - checking the publishTime field vs now yields 0 so there's no
delivery delay... And yet I'm seeing less than 100 messages per second vs
3k being published.

marklr on 5 Jan 2018

👍2

@marklr Can you see if suggestions in this thread helps you? If you have further question, let's discuss on the linked thread instead. Your concern seems to be around performance, not message duplication.

pongad on 5 Jan 2018

Exact same issue as marklr. There is something seriously messed up with the current clients. The Go client is at least reliable compared to the Python client, but right now (with 10 pubsub consumer replicas) in GKE we get spikes to 700ish pull operations per second (but looking at the graph it has stabilized to 70/s), spike of 70 acks/s (stabilized to 1/s) and around 10,000 modify ack deadline operations per second (constant, no correlation to consumption rate). We have the same behavior with an ack deadline set to 10 seconds, and to 100 seconds.

What the hell. This has been an issue since September. Why hasn't it been fixed.

anorth2 on 25 Jan 2018

@anorth2, the issue @marklr is having has to do with low message processing rates in Subscription.Receive. Is that what you're experiencing? If so, what are your ReceiveSettings?

jba on 25 Jan 2018

@jba sub.ReceiveSettings.MaxOutstandingMessages = 25

anorth2 on 25 Jan 2018

As mentioned in #824, you can turn on the firehose by setting both MaxOutstanding fields to -1.

I'm curious about the stats you quoted. How did you measure the number of modify ack deadline operations, for example? When you use Receive, there shouldn't be any ModifyAckDeadline RPCs. Nor should there be Acknowledge or Pull RPCs. Those are all replaced by the StreamingPull call.

What version of the Go client are you using?

jba on 25 Jan 2018

@jba apologies, they are all streamingpull operations. Measured via stackdriver. Why would setting MaxOutstanding to unlimited increase the ack rate when the message pull rate is already close to 1000/s?

anorth2 on 25 Jan 2018

@jba, if you go to "Monitoring" in cloud console then resources -> pub/sub, you can pull up metrics about your subscription. It separates acks from the rest.

@anorth2 Is there any details you can provide? How long you need to process a message? Does the issue occur right at start up or after 10 hours? etc. We have a load test we periodically run and we are able to consistently pull and ack ~300MB worth of messages every second. Perhaps it's a bug that doesn't act up in our test cases.

I (or Jean) can investigate this.

pongad on 26 Jan 2018

@pongad We changed our consumer to ack immediately (to remove any possibility of time out). It behaves this way immediately. Notably this behavior only exists in kubernetes. Running locally (inside a docker container) it executes perfectly.

anorth2 on 26 Jan 2018

👍1

I don't know if it's relevant, but our consumers (affected by the duplicates issue) also run on kubernetes (GKE).

jfbus on 26 Jan 2018

@marklr, @anorth2, @jfbus and anyone else experiencing problems with Receive: we've been working on understanding the problem better.

One important fact I noticed when looking carefully at the Stackdriver Monitoring UI: the "StreamingPull message operations" and "StreamingPull Acknowledge message operations" are delta metrics, and the graphs are created with a "mean" aligner by default, so each data point represents the average number of messages since the last data point. But the graph is a line graph, which (to me, at least) suggests a rate. For example, I just published about 6000 messages over the course of a minute, or 60 msgs/sec. They were received by a simple program that used the default MaxOutstandingMessages of 1000 and acked each message after 20 seconds, for a maximum ack rate of 50 msgs/sec. These are the graphs I got:

streamingpull message operations 2
streamingpull acknowledge message operations
It's not clear where the numbers on the y-axis come from, and it looks like there's a gap of about 1,000 messages between pulls and acks, but that is misleading: the acks happen over a longer period of time. You have to integrate under the graph to compare. I find looking at rates to be much clearer. When I switched the aligner to "rate" (three-dot menu, Edit, Show more options, Aligner), I got these graphs:

streamingpull message operations 3
streamingpull acknowledge message operations 1
These clearly show that I was publishing at 60 msgs/sec and acking at a peak 50 msgs/sec, exactly as expected.

Apologies if this is obvious to you all, but it wasn't to me.

None of that implies that there isn't a problem with PubSub on GKE. To try to collect more data, I added OpenCensus instrumentation to the client, with several measures related to streaming pull. The measures include stream opens and RPC retries as well as pull, ack, nack and modack counts. The code is in the latest commit of this repo. The program I linked to above exports the measures to StackDriver, and also logs them. When adding a graph to the UI, search for metrics containing "cloud.google.com/go/pubsub". Avoid the ones that end in "/cumulative"; they are obsolete.

I ran that program as a standalone pod on GKE, using kubectl logs -f to watch the logs directly. I didn't see anything that looked unusual. Every message eventually was acked. The number of modacks was five times the number of messages, but that is expected: we modack immediately when we get a message and then every 5s (when the ack deadline is 10s), so 5 modacks for 20s of processing is about right. (The number would be much lower if the ack deadline were longer). I did see Unavailable errors cause the stream to reopen every 90s or so, but crucially, that was only during idle periods: while messages were actively being pulled, the stream never failed.

So in short, I'm unable to reproduce the problem, and in fact I'm still not really clear on what the symptoms are. I hope the new metrics will shed some light.

jba on 6 Feb 2018

We haven't heard any updates from those affected by this, and we currently have no way to reproduce. Closing until we have actionable information.

jba on 5 Mar 2018

@jba We updated pubsub vendor to the latest version of one service which was affected by this bug in the past to test it again. The issue still exists. The unacked msg counter increases linear and we get always the same message over and over again regardless we ack them immediately on the client side. How we can help you to debug this problem?

tecbot on 7 Mar 2018

@tecbot:

Are you also running on GKE? If so, does the same problem occur if you run your docker container directly on GCE?

Please share as much of your Receive code as you're comfortable with. Especially useful are the ReceiveSettings you're using.
Could you enable the client instrumentation we recently added? Details are in my comment above.

jba on 7 Mar 2018

I'll unassign myself from this, but I'll keep an eye on it.

pongad on 8 Mar 2018

👍1

@jba:

Are you also running on GKE? If so, does the same problem occur if you run your docker container directly on GCE?

Yep, running on GKE but we have not tested it on GCE yet, but we had this service in the past in our own datacenter with the same error, so it shouldn't depend on it.

Please share as much of your Receive code as you're comfortable with. Especially useful are the ReceiveSettings you're using.

Our Receive looks simplified like that (removed code is protobuf parsing and uses a different fn to execute), maybe one important point is that we delay the execution for 20s, also the real fn can take 2min to complete (depends on the data).

const delay = 20 * time.Second

func Do(ctx context.Context, sub *pubsub.Subscription, fn func(byte[]) error) error {
    err := sub.Receive(ctx, func(ctx context.Context, m *pubsub.Message) {
        // delay the execution
        select {
        case <-ctx.Done():
            m.Nack()
            return
        case <-time.NewTimer(delay).C:
        }
        if err := fn(m.Data); err != nil {
            m.Nack()
            return
        }
        m.Ack()
    })
    if err != context.Canceled {
        return err
    }
    return nil
}

Creating the subscription, we set MaxOutstandingMessages to 30 to limit throughput:

        ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
    defer cancel()

    client, err := pubsub.NewClient(ctx, projectID)
    if err != nil {
        ...
    }
    sub := client.Subscription(subID)
    ok, err := sub.Exists(ctx)
    if err != nil {
        ...
    }
    if !ok {
        sub, err = client.CreateSubscription(ctx, subID, pubsub.SubscriptionConfig{
            Topic:       client.Topic(topicID),
            AckDeadline: 60 * time.Second,
        })
        if err != nil {
            ...
        }
    }
    sub.ReceiveSettings.MaxOutstandingMessages = 30
    return sub

Could you enable the client instrumentation we recently added? Details are in my comment above.

We added the instrumentation, here you have multiple graphs. What we can see so far is that there are intervals without any Acks, then afterwards a burst happens of Acks. Doesn't look right to me.

Graph for 12 hours:
pubsub_12hours

Graph for 3 hours:
pubsub_3hours

Graph for a time frame with out peaks:
pubsub_extract

Here are graph from stackdriver:
bildschirmfoto 2018-03-08 um 10 40 26

tecbot on 8 Mar 2018

@jba we tested a different service without any delay and there is the same issue. There is always only a short period of time where message comes in, and then there is ~20 min break without any messages. This repeats every time. Maybe to mentioned both topics have a low throughput on the publisher side, ~1 new msg/s. Do you bundle/buffer msgs on the server side before publish them to stream subscriptions?

tecbot on 8 Mar 2018

@jba any updates? We tested more scenarios and for us it's clear that it depends on the publish rate of the topic. We have always problems with subscriptions if the publish rate is slow. We can see that subscribers don't receive messages at all or with a really big delay ~10-20 minutes but at the same time stackdriver unacked messages increases. Subscriptions with a publish rate ~1000/s have no problems.

Edit: We added now a "ping" msg to publish every 100ms a message in the topic, which "resolves" the stuck subscribers. If we stop the pinger, the subscribers stuck instantly.

tecbot on 13 Mar 2018

Thanks for all the info. We don't have a theory yet about what you're seeing. The server doesn't buffer, and it doesn't rebundle—if the publisher sends 3 messages in one Publish RPC, then the subscriber will receive those three messages in one chunk.

The fact that your "ping" messages unsticks the subscribers is very interesting—we haven't observed that before.

I'm trying to look at the black graphs but am having trouble interpreting the y-axis. The raw counters always increment, so I don't think you could be using those directly since your graphs go down. Are you showing a delta? For example, in the third graph (between peaks), the average value for Ack is about 88. Is that 88 acks over the last 10s sampling interval? (OpenCensus sends sample data every 10s, unless you call SetReportingPeriod.) Is it 88 acks per second? Neither makes sense with the numbers you provided: if you're handling 30 messages every 2 minutes or so, your ack rate should be .25/s.

jba on 14 Mar 2018

@tecbot Something is funky, or I'm misunderstanding.

For the top graphs: I think your graphs are saying that acks are happening steadily at around 90 per minute (or per second? I think it's saying per minute... but not sure), which seems good.

For the bottom graphs: Could you provide a also a graph showing num_undelivered_messages, as well as _instrumentation around the time it takes to ack messages_? The two reasons for the spikiness I could think of: there actually aren't any undelivered messages, or the processing time is ~20-40m. I think it is really important to have instrumentation around process time for us to understand what's going on with your app.

Furthermore, I've taken your code and replicated it (see program here). I'm not sure how you're sending, so I used the gcloud console to send a message each second (actually, initially 2s, then 1s - you'll see a slight uptick early on). Here are the charts:

screen shot 2018-03-14 at 10 27 02 am

So unfortunately I'm unable to repro. Could you describe the way you're publishing? I'm interested in the following:

Are you using a program to publish? If so which language? If not which tool?
Could you describe any publish settings you've configured?
Could you provide a repro, if publishing via code?

edit: If you need any help instrumenting your app to show time to process the message, please let me know - happy to provide code snippet!

jadekler on 14 Mar 2018

FWIW we also encounter a similar experience using PubSub on GKE.

We reproduce this behavior when we suddenly stop consuming a subscription and then starting to acknowledge messages again.
When we notice such behavior the number of undelivered messages is in the order of the dozen of thousands at least.

When we restart consuming the subscription again by acking the messages, we see a surge in duplicated messages until the number of unack'ed messages drop to 0.
We also notice that the rate of duplicate message decreases as the number of undelivered messages drop.
Once the messages no longer accumulate in the queue we see almost no duplicate messages.

sfriquet on 7 Apr 2018

@sfriquet:

stop consuming a subscription and then starting to acknowledge messages again

So you exit from the call to Receive, then wait a while, then call Receive again?

When we restart consuming the subscription again by acking the messages, we see a surge in duplicated messages...

The timer for a message's ack deadline begins when the message arrives on the client, or after 10 minutes, whichever comes first. Messages arrive on the client in bulk, and may be buffered (by proxies or the client's network layer) before they arrive. If you ack slowly, messages may time out, and then you will see duplicates.

For example, say there are 10,000 undelivered messages when you call Receive, and you process them one at a time (one subscriber process, NumGoroutines=1, MaxOutstandingMessages=1). You ack each message after 1 second. After 10 minutes you have seen 600 messages, but there may be many more that have left the server and are in a buffer somewhere. Ten seconds later (assuming the default ack deadline of 10s), these messages will time out. You will see and ack them as you continue processing, but the server will have already marked them as expired and will redeliver them.

The bottom line is that it can be hard to distinguish this behavior—which is "normal," though undesirable—from a bug in the PubSub client or service.

It would help if you could give us more data.

jba on 10 Apr 2018

👍1

@jba

So you exit from the call to Receive, then wait a while, then call Receive again?

Yes and no. We notice duplicates in 2 circumstances.

Indeed stopping the client then restarting it later
When we have to Nack messages. e.g validation issue on our end, we suddenly start Nack'ing messages to a point that the queue isn't being consumed at all.
Note that we see duplicates when we start consuming the queue again. What I mean is we only see duplicates when the queue has accumulated and we're trying to dequeue it.

pubsub_dup_rate

Also, slightly unrelated I guess, but it looks like Nack'ed messages are put back at the front of the queue, so that they are immediately fetched again by the clients. We saw a few times our pipeline completely stalling because of relatively few number of "invalid" messages that kept being retried.

For example [...]

Thanks for the example, if I understand this right:

When the client pulls messages, an undefined? amount of messages are actually pulled from the server (more than MaxOustandingMessages I presume)
after 10 minutes + subscription deadline, if they haven't been seen by the client, they will timeout and will eventually be sent again, even though they should eventually be processed by the client, which can lead to duplicates.

So that'd mean that if a queue has accumulated too many undelivered messages, such that they can't be processed within (10 minutes + subscription deadline), then duplicates are to be expected. Is that right?

That'd match what we see then. What could we possibly do to mitigate this?

It would help if you could give us more data.

What would help?

In terms of settings, we use:

Subscription has a 600s deadline
Receiver settings has NumGoroutines=1, MaxOutstandingMessages=8.
Message processing time: p99 is < 10s, p95 is < 1s

Thanks a lot for the clarification
In the meantime what we did is we added a cache layer to filter out messages based on their ID.

sfriquet on 10 Apr 2018

👍1

When the client pulls messages, an undefined? amount of messages are actually pulled from the server

The number of messages pulled will usually be the same as the number published together. For very high publish rates, the server may additionally batch separate messages that arrive close in time.

We saw a few times our pipeline completely stalling because of relatively few number of "invalid" messages that kept being retried.

Nacking a message will indeed cause it to be redelivered, perhaps promptly. One solution to this is to have a separate topic for invalid messages. If the messages are permanently invalid, then that topic should probably be subscribed to by something that alerts humans. If they are temporarily invalid, then the subscriber should ack the message, sleep on it a while, then republish it to the original topic. (There is no feature that publishes a message after a delay, or at a particular time in the future).

So that'd mean that if a queue has accumulated too many undelivered messages, such that they can't be processed within (10 minutes + subscription deadline), then duplicates are to be expected. Is that right?

Almost. They don't have to be _processed_ by your code in that time, but they do have to make it onto the client. That may be a minor distinction or a large one, depending on a number of factors.

You should be able to mitigate the problem by

increasing the ack deadline
increasing throughput (adding processes, or increasing MaxOutstandingMessages).

Discarding the dups, as you're doing, is also fine. When you say you added a "cache layer," do you mean a separate process/server? Because you may be able to get away with having each client process get rid of just the duplicates it sees. That won't be perfect, but it may be enough, since your system has to tolerate duplicates anyway.

jba on 10 Apr 2018

Thanks for the advice.

When you say you added a "cache layer," do you mean a separate process/server? Because you may be able to get away with having each client process get rid of just the duplicates it sees.

Reading your explanations again it indeed seems that the duplicates would be on a per client basis. In such case 'caching' at the client level should work too and be much simpler to implement/maintain then.

sfriquet on 11 Apr 2018

duplicates would be on a per client basis

I don't think that's right. The service will load balance messages (including re-sent ones) across all streams in a subscription. However, if you only have a handful of streams, a significant fraction will end up on the same one. For instance, if you have two streams (processes), then each will see half the messages, and get half the dups. So a per-client solution will weed out a quarter of the dups. I guess that's not a great ratio, now that I do the math. In any case, my point was that a per-client solution is much simpler architecturally, and maybe it gives you enough de-duping to make your system perform adequately.

jba on 12 Apr 2018

👍1

It seems like this issue is haunting us as we have it in multiple projects now.
After having followed this whole discussion and also spent time studying the internals of this library, we came to the conclusion that the current library is designed for pipelines having a setup with a somewhat stable incoming messages, and a high throughput in terms of processing.

In our case, we receive spikes of messages every 5 minutes, and the processing time of 1 message can vary and take up to a couple of seconds sometimes and there is not expected correlation between the number of incoming messages and the speed at which we want to process them.

If our understanding is correct, the streaming pull strategy used in this library can eventually fetch more messages than the MaxOutstandingMessages, which from a developer experience point of view is a bit hard to understand. I do understand now that this allows for a very high throughput in some scenarios. However it also introduces all issues discussed in this thread.

On our side, we tried leveraging the non streaming pull approach and so far it seems to address the problems. However our solution required us to re-implement parts of this pubsub client in order to re-create some of the needed features.

Is there any chance you could introduce a parameter letting the user choose whether to use the experimental streaming subscription pulling, or using the API endpoint? It seems like the latter respects the MaxOutstandingMessages and would work very fine in our use case.

Otherwise, if you plan to somehow deprecate the Pull endpoint in favour of the StreamingPull, is there any chances we could implement an option forcing the client to respect the max outstanding messages? Even a hack in the beginning, for example if the client was to Nack directly all messages after MaxOutstandingMessages amount has been received could help us solve our issue.

I hope this all makes sense. We feel like our current implementation re-invents the wheel, and given that you mentioned earlier that you were working on this case, I wanted to share our experimentations and expectations. I hope this is somewhat useful.

rayrutjes on 22 May 2018

👍4

@rayrutjes, we'll pass your comments along to the PubSub team.

jba on 23 May 2018

@cristiangraz I'm curious why is this issue closed?

sfriquet on 13 Jul 2018

👍4

@jba Can we reopen this since people are still experiencing problems? I don't want this issue to lose visibility.

johnreutersward on 25 Jul 2018

👍1

@sfriquet @johnreutersward The initial issue I was having regarding excessive duplicate messages has been fixed. There were lots of other unrelated comments on this issue that went quiet, but looks like there are some additional cases (like this one https://github.com/GoogleCloudPlatform/google-cloud-go/issues/778#issuecomment-380049492) related to duplicate messages that I missed. Apologies, reopening.

@jba Will leave this up to you or the team to close this whenever it's ready to be closed.

cristiangraz on 25 Jul 2018

As of afb80096eae340697e1153d7af9a5a418ba75067, we support synchronous mode for Receive. If you want more control over throughput, please try it. (It ensures that you will never pull more than MaxOutstandingMessages from the Pub/Sub service at a time.)

I'm going to close this now. Reopen if you are still experiencing many duplicates and synchronous mode isn't helping.

jba on 2 Oct 2018

👍2 🎉1

Google-cloud-go: pubsub: Large number of duplicate messages suddenly

Most helpful comment

All 72 comments

Related issues