Google-cloud-ruby: Pubsub stops pulling messages

Created on 18 Dec 2020  路  43Comments  路  Source: googleapis/google-cloud-ruby

Thanks for stopping by to let us know something could be better!

Environment details

  • OS: Heroku app
  • Ruby version: 2.6.6
  • Gem name and version: google-cloud-pubsub (1.10.0)

Problem

We are using Google Pubsub as a backend for our background jobs processing. For some months we've been having some issues with the system in which it suddenly stops processing jobs. We suspect that this happens when there's an elevated error rate in our system.

The last time it happened I added some logs to see if I could find something. We are logging for each subscriber's stream pool these things:

I will share the information for one topic, but mostly the whole system was behaving the same way. For one topic we have this configuration:

{ streams: 3, inventory: 1000, callback_threads: 6, push_threads: 3 }

Here are the logs that I previously mentioned:

[{"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>4, "stream_status"=>"running"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>293, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>331, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>329, "stream_status"=>"running"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>267, "stream_status"=>"running"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>277, "stream_status"=>"running"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>286, "stream_status"=>"running"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>299, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>300, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>256, "stream_status"=>"running"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>298, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>289, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>327, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>329, "stream_status"=>"running"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>278, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>304, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}},
 {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>282, "stream_status"=>"paused"},
  "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>317, "stream_status"=>"paused"},
  "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>318, "stream_status"=>"paused"}}]

As you can see the last logs got stuck with these values as inventory: 282, 317, 318. And they were like that until we restarted the process.

We've been with this issue for lots of months and we can't figure out what's the problem. The only thing we know is that it usually happens when there's an elevated error rate in our jobs.

I saw some other issues related but none of them helped.

pubsub question

All 43 comments

@alanhala Thanks for reporting this issue, and for including the helpful log output.

Are you able to update to the latest version of google-cloud-pubsub, currently 2.3.0?

https://rubygems.org/gems/google-cloud-pubsub/versions

Do you see any similarity with #7068?

@quartzmo yes. We are working on the same system. So it's the same issue :/

@alanhala Thanks for reporting this issue, and for including the helpful log output.

Are you able to update to the latest version of google-cloud-pubsub, currently 2.3.0?

https://rubygems.org/gems/google-cloud-pubsub/versions

I can try updating. But it's hard for us to reproduce the issue. Did something change that might fix the problem?

Did something change that might fix the problem?

Not that I can see, but it's definitely best to get on the latest version as we try to identify the problem, since the fix will be applied on the trunk.

As you can see the last logs got stuck with these values as inventory: 282, 317, 318. And they were like that until we restarted the process.

When is the last date and time this occurred?

As you can see the last logs got stuck with these values as inventory: 282, 317, 318. And they were like that until we restarted the process.

When is the last date and time this occurred?

It was that Friday around 17:00 UTC

Have you been able to update to google-cloud-pubsub v2.3.0?

Have you been able to update to google-cloud-pubsub v2.3.0?

Yes, we released it today. But I don't know when we will be able to see the problem again :/

Yes, we released it today. But I don't know when we will be able to see the problem again :/

Great. I just want to make sure there are no other issues blocking your update, in case we have a good guess at the problem and want to push out a potential fix.

The only thing we know is that it usually happens when there's an elevated error rate in our jobs.

Did you mention what errors you're getting?

Did you mention what errors you're getting?

They were just application errors. Nothing related to gcp pubsub.

More information that might be useful. We have a logic in our background job system that detects jobs that are already in progress (duplicated messages). Around that day we can see a spike on the duplicated jobs.

Thanks. Are there any memory constraints or other limits in your Heroku environment that the inventory of messages might be hitting? (I don't know anything about the message payloads.)

No, the process uses the half of the memory more or less constantly. I checked that day and I couldn't find any weird things on the memory usage.

@quartzmo we've just experienced another outage in our system. Similar situation. Inventory gets stuck and the stream status is paused. Fixed after restarting the dyno.

I'm uploading the logs from this outage too:

{
  "2020-12-28T17:50:00.133676Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:51:00.135089Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:52:00.136075Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:53:00.136653Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:54:00.137387Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:55:00.138621Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:56:00.139280Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:57:00.139900Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:58:00.140471Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T17:59:00.143907Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>4, "stream_status"=>"running"}},
 "2020-12-28T18:00:00.162226Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>4, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>5, "stream_status"=>"running"}},
 "2020-12-28T18:01:00.419795Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>167, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>167, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>164, "stream_status"=>"running"}},
 "2020-12-28T18:02:01.324135Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>251, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>256, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>254, "stream_status"=>"running"}},
 "2020-12-28T18:03:00.498053Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>170, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>173, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>186, "stream_status"=>"running"}},
 "2020-12-28T18:04:00.552828Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>4, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>5, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:05:00.553394Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:06:00.554284Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>1, "stream_status"=>"running"}},
 "2020-12-28T18:07:00.554904Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:08:00.555404Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:09:00.555951Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:10:00.556550Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:11:00.557117Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:13:00.558265Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:14:00.558834Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:15:00.572242Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>0, "stream_status"=>"running"}},
 "2020-12-28T18:16:00.587310Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>213, "stream_status"=>"running"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>238, "stream_status"=>"running"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>236, "stream_status"=>"running"}},
 "2020-12-28T18:17:00.588111Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:18:00.588889Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:19:00.589642Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:20:00.590403Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:21:00.591136Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:22:00.593101Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:23:00.594592Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:24:00.595936Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:25:00.597653Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}},
 "2020-12-28T18:26:00.599246Z"=>
  {"stream_0"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>334, "stream_status"=>"paused"},
   "stream_1"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"},
   "stream_2"=>{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>337, "stream_status"=>"paused"}}
}

As you can see, there are some spikes on the inventory and it gets recovered. But then at 2020-12-28T18:16:00.587310Z it goes up again and it gets stuck there. Is there any other information that I can provide to you that would be helpful for debugging the issue?

@quartzmo any news? Do you need something from our side?

@alanhala Sorry for the delay. I'll try to look at this again as soon as I can.

@quartzmo we've just had another outage. Did you have some time to take a look at this?

I have been looking but so far haven't had any insights. You wrote that your configuration is:

{ streams: 3, inventory: 1000, callback_threads: 6, push_threads: 3 }

But callback_threads and push_threads are not parameters for Subscription#listen. Can you share your actual source code call(s) to this method and parameter values? Have you ever tried setting the inventory below 300? I was also wondering if you have experimented with other inventory configuration settings, such as max_outstanding_bytes. I know you wrote that memory is not a problem, but what is the typical size of message data?

Please share as much relevant source code as possible.

How long does the block passed to listen take to complete? For example, in my own attempts to reproduce your issue, I have introduced a sleep 20 in the block I pass to listen. By saturating the topic with messages, I can get the stream pool to print log statements similar to yours:

streams: [{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>425, "stream_status"=>"paused"}, {"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>409, "stream_status"=>"paused"}]

But they always resume working again:

streams: [{"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>456, "stream_status"=>"running"}, {"background_thread_status"=>"sleep", "callback_thread_pool_running"=>true, "inventory"=>464, "stream_status"=>"running"}]

I have been looking but so far haven't had any insights. You wrote that your configuration is:

{ streams: 3, inventory: 1000, callback_threads: 6, push_threads: 3 }

But callback_threads and push_threads are not parameters for Subscription#listen. Can you share your actual source code call(s) to this method and parameter values? Have you ever tried setting the inventory below 300? I was also wondering if you have experimented with other inventory configuration settings, such as max_outstanding_bytes. I know you wrote that memory is not a problem, but what is the typical size of message data?

callback_threads and push_threads are the values we use for the threads key.

subscriber_options = {
  streams: streams,
  inventory: inventory,
  threads: { callback: callback_threads, push: push_threads }
}
subscriber =
  subscription.listen(subscriber_options) do |message|
    process(message)
  end
subscriber.on_error do |error|
  send_error(error)
end

As for the inventory size or max_outstanding_bytes, we haven't tried with other values. There's another subscription that has an inventory of 200, but nothing less than that.

Here's a chart of the jobs duration for the last 30 days.

image

Surprisingly, the top one is not the topic that stops pulling messages :/. Also unfortunately, Appsignal doesn't track the duration of the failed transactions (in this case jobs).

@alanhala The bottom line (green) is a graph of successful transactions only then? So these jobs succeed very quickly? What do you know about the failures?

More code that might be relevant. As I mentioned before, we have some logic around checking whether a job is already being processed or not (detect duplicated messages). This is how we handle it:

begin
  job.perform_now
  @gcp_message.ack!
rescue JobAlreadyInProgress
  @gcp_message.modify_ack_deadline!(10.minutes.to_i)
end

So when we detect a duplicated message, we modify the ack deadline and let the other process to ack the message. We don't want to ack the message because the other process might fail processing the message and we don't want to lose messages. By doing this, we leave the responsibility of acking the message to the other process. Is this logic correct? I couldn't figure out how to replicate the duplication locally, but looking at our logs I found this:

[
  {
    "insertId": "6eba10f6-f152-424b-8403-b6c4bb841581",
    "jsonPayload": {
      "job_id": "c1fa2eac-f188-4f44-b42a-36840ba6b438",
      "exception": null,
    },
    "timestamp": "2020-11-05T18:40:30.732168Z",
  },
  {
    "insertId": "0ce3ccd3-8dda-430c-9363-7594fd601e3f",
    "jsonPayload": {
      "exception": null,
      "job_id": "c1fa2eac-f188-4f44-b42a-36840ba6b438",
    },
    "timestamp": "2020-11-05T18:30:34.397501Z",
  },
  {
    "insertId": "2b293280-ba51-4d34-8fda-13fc8f826667",
    "jsonPayload": {
      "exception": "ApplicationJob::Exclusion::JobAlreadyInProgress",
      "job_id": "c1fa2eac-f188-4f44-b42a-36840ba6b438"
    },
    "timestamp": "2020-11-05T18:30:29.412439Z",
  }
]

The second log acked the message and as you can see, after 10 minutes, there's another entry. When there's an outage, there's an increase in the JobAlreadyInProgress errors.

@alanhala The bottom line (green) is a graph of successful transactions only then? So these jobs succeed very quickly? What do you know about the failures?

Each line represents a subscription. So only the purple is the one that takes some time. But as mentioned, that subscription is not the one that gets stuck. As for the failures, I don't think they take a lot more time. I'll try to get more information about them.

When there's an outage, there's an increase in the JobAlreadyInProgress errors.

Why do you think this increase in duplicate jobs occurs?

@alanhala Where/when do you ack the message? Is there any chance that if the request to ack the message failed (I believe ack is done on a best-effort basis), this could result in endless extensions of the ack deadline?

@quartzmo sorry for the delay. I wasn't able to reply last week.

Why do you think this increase in duplicate jobs occurs?

I don't know. There must be something going on that, under some conditions, we start receiving more duplicated messages.

@alanhala Where/when do you ack the message? Is there any chance that if the request to ack the message failed (I believe ack is done on a best-effort basis), this could result in endless extensions of the ack deadline?

I'll share some charts from gcp to see if they help. They are from the last outage. At 12:38, we stopped processing jobs.

image
image
image
image
image
image

Still wondering: Where/when do you ack the message? Is there any chance that if the request to ack the message failed (I believe ack is done on a best-effort basis), this could result in endless extensions of the ack deadline?

If the job succeeds we call ack the message. We use ActiveJob and we have a retry policy for any StandardError. That handler rescues the exception and ends up publishing a new message with the attempts value updated. The original message is acked (because the exception is not bubbled up).
Any other unexpected error that wasn't caught by ActiveJob, it bubbles up and reaches a global rescue that ends up modifying the ack deadline of the received message.

Apart from checking if the job is already in progress, we also check if the job was already finished. If the job was already finished, then we ack the message. So in your scenario, if there's a failure in the ack then we would receive the same message after some time. And when that happens we would ack the message again. Is that what you meant?

Exactly, I was wondering if the original ack request was not processed by Cloud Pub/Sub (without returning an error), would you continue to try to ack the message in your logic for detecting duplicates, or would you indefinitely extend the ack deadline? Since I haven't seen the mechanism for detecting duplicates, I just don't know how it works.

No, we keep trying to ack the message because the job was marked as finished.

So, for a given message, which you determine the job has been finished, you repeatedly try to ack it, and yet pubsub keeps sending it?

if the scenario that you described happens, then yes, we would keep trying to ack it. But I don't think it's what going on here. Looking at the logs we can see that for some reason there's a spike in the inventory. We also see that the message callback is not called (if the job is already finished we would see a log entry for that and we don't have any).

So the scenario is:

  • Inventories get filled.
  • Message callback is not called
  • No more jobs are processed and no more messages are pulled from pubsub because the inventory is paused.

we can see that for some reason there's a spike in the inventory.

I know you're concerned about deadlock, but I haven't been able to reproduce it. What do you think about trying to solve this problem by addressing the cause of the spike?

You wrote earlier:

When there's an outage, there's an increase in the JobAlreadyInProgress errors.

I'm still trying to understand why this might be happening.

What do you think about trying to solve this problem by addressing the cause of the spike?

Sure! I just checked the last outage that we had.
Jobs performed chart
image

Duplicated jobs
image

As you can see there, we didn't have any duplicated messages before the outage. That's why I think it's not the cause.

BTW is there a way to reproduce the scenario in which I receive the same message twice locally? I tried creating the subscriber using two streams and one of them retaining a message, but I couldn't make it work.

is there a way to reproduce the scenario in which I receive the same message twice locally?

As described in At-Least-Once delivery:

Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged. While a message is outstanding to a subscriber, however, Pub/Sub tries not to deliver it to any other subscriber on the same subscription. The subscriber has a configurable, limited amount of time -- known as the ackDeadline -- to acknowledge the outstanding message. Once the deadline passes, the message is no longer considered outstanding, and Pub/Sub will attempt to redeliver the message.

Thus, to increase your chances of redelivery, I think you want to send just a few messages, shorten the ack deadline for those messages, and do not ack them.

What do you think about trying to solve this problem by addressing the cause of the spike?

Sure! I just checked the last outage that we had.
Jobs performed chart
image

Duplicated jobs
image

As you can see there, we didn't have any duplicated messages before the outage. That's why I think it's not the cause.

BTW is there a way to reproduce the scenario in which I receive the same message twice locally? I tried creating the subscriber using two streams and one of them retaining a message, but I couldn't make it work.

Just want to mention that we started to perform jobs again when we restarted the process

I think that in order to debug this issue further, we will need to access some logs and server-side data tied specifically to your project, topic, and subscription. Please open a support ticket from the Cloud console with these pieces of information that references this GitHub issue so we can continue the investigation. Thanks!

is there a way to reproduce the scenario in which I receive the same message twice locally?

As described in At-Least-Once delivery:

Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged. While a message is outstanding to a subscriber, however, Pub/Sub tries not to deliver it to any other subscriber on the same subscription. The subscriber has a configurable, limited amount of time -- known as the ackDeadline -- to acknowledge the outstanding message. Once the deadline passes, the message is no longer considered outstanding, and Pub/Sub will attempt to redeliver the message.

Thus, to increase your chances of redelivery, I think you want to send just a few messages, shorten the ack deadline for those messages, and do not ack them.

Ok, I was running some tests with this and I have one question. I have a subscription with 10 seconds as ack deadline. The subscription has a subscriber with two streams. One stream receives the message and it has 10 seconds to get that message out from the inventory and ack it. If it doesn't manage to do that, then gcp will redeliver the message to the same subscription, probably to the other stream. At that point I have the same message in two different inventories/streams.
If one inventory acks the message but the other one modifies the deadline, then what happens? It depends on the order in which both things happen?

If one inventory acks the message but the other one modifies the deadline, then what happens? It depends on the order in which both things happen?

It probably does not always depend on the order in which the client sends the AcknowledgeRequest and ModifyAckDeadlineRequest. In any case, you should anticipate the possibility of receiving the message again before the service removes the message from the subscription.

From the REST API docs for acknowledge:

Acknowledges the messages associated with the ackIds in the AcknowledgeRequest. The Pub/Sub system can remove the relevant messages from the subscription.

Acknowledging a message whose ack deadline has expired may succeed, but such a message may be redelivered later. Acknowledging a message more than once will not result in an error.

Closing due to lack of activity. Please re-open if any new information becomes available/.

Was this page helpful?
0 / 5 - 0 ratings