Google-cloud-python: Pub/Sub: publish message hangs waiting for previous publish to timeout

Created on 20 May 2019  路  2Comments  路  Source: googleapis/google-cloud-python

Environment details

OS type and version: 4.9.125-linuxkit GNU/Linux
Python version and virtual environment information: Python 2.7.16
google-cloud-pubsub package version: 0.41.0

Steps to reproduce

  1. Publish a (first) message to PubSub that fails
  2. Timeout on the future object returned by the call to publish(), _before_ the grpc publish call in the batch thread returns
  3. Publish a (second) message to the same topic. This call hangs until the previous call to grpc publish in the batch thread returns. As the default timeout is currently 10 minutes (!) this can take take 10 minutes to return.

Hanging for 10 minutes is surprising behaviour for an asynchronous API.

Code example

See this gist for code and instructions on how to reproduce this issue.

Wot I think is going on here

This is wild conjecture that I have no supporting evidence for.

That being said, I think the issue starts when the batch thread gets stuck in this call to grpc publish. At this point it is holding onto the lock _state_lock and will continue to hold on to it for 10 minutes until it the call to grpc publish times out.

When the client application calls publish() in the main thread for the second time, it will try to acquire the same lock _state_lock. As this lock is already being held by the batch thread, the main thread hangs and doesn't return from the call to publish().

question pubsub triaged for GA

Most helpful comment

@Dan4London Just FYI, the pull request for this issue that you reported in the other thread has been created.

All 2 comments

@asnr Thank you for the effort and the detailed steps to reproduce the issue.

I can confirm that the issue is reproducible, either by using the linked Docker application, or by simply disabling the internet connection and running the test publisher script (without creating the topic and subscription, that is).

The cause of the long delay is that the lock in the underlying batch (an object that batches publish requests) is held for too long. It also turned out that the fix for it is essentially the same as #7686.

I will open a follow-up PR that also includes tests, and mention the creators of the original PR as co-authors.

@Dan4London Just FYI, the pull request for this issue that you reported in the other thread has been created.

Was this page helpful?
0 / 5 - 0 ratings