Google-cloud-node: PubSub deadlock with flushQueue / ConnectionPool.isConnected() always false

Created on 20 Sep 2017  Β·  20Comments  Β·  Source: googleapis/google-cloud-node

Environment details

  • OS: Debian Jessie
  • Node.js version: 8.5.0
  • google-cloud-node version: 0.14.2

Steps to reproduce

I think there is an issue with queue flushing and reconnecting in the PubSub client. Here's a minimal example to reproduce it (locally, with the latest version of PubSub emulator):

const pubsub = require('@google-cloud/pubsub')();

let last = Date.now();

function onError(err) {
  console.log('ON ERROR', err);
}
function onMessage(message) {
  console.log('ON MESSAGE', message.attributes, (Date.now() - last) / 1000);
  last = Date.now();
  message.ack();
}

const topicName = `my-topic-${Math.floor(Math.random() * 10000)}`;
const subscriptionName = `my-subscription-${Math.floor(Math.random() * 10000)}`;

pubsub.createTopic(topicName, () => {
  const topic = pubsub.topic(topicName);
  const publisher = topic.publisher();
  topic.createSubscription(subscriptionName, () => {
    const subscription = pubsub.subscription(subscriptionName);
    subscription.on('error', onError);
    subscription.on('message', onMessage);

    publisher.publish(new Buffer('x'.repeat(4000)), { batch: 'first' }, (err, messageId) => {
      console.log('DONE publishing immediately', err, messageId);
    });
    setTimeout(() => {
      publisher.publish(new Buffer('x'.repeat(4000)), { batch: 'second' }, (err, messageId) => {
        console.log('DONE publishing after timeout', err, messageId);
      });
    }, 2000);
  });
});

process.on('SIGINT', () => {
  const subscription = pubsub.subscription(subscriptionName);
  subscription.removeListener('message', onMessage);
  subscription.removeListener('error', onError);

  process.exit();
});

Output:

DONE publishing immediately null 229
ON MESSAGE { batch: 'first' } 1.36
DONE publishing after timeout null 230
ON MESSAGE { batch: 'second' } 1.988
ON MESSAGE { batch: 'second' } 10.008
ON MESSAGE { batch: 'second' } 10.006

As you can see the second message is re-delivered over and over again.

What I could find out so far is the following:

In
https://github.com/GoogleCloudPlatform/google-cloud-node/blob/badd1b8e32cf0bbcab6265fa4975c8c25f7116d0/packages/pubsub/src/subscription.js#L261
this.connectionPool.isConnected() always returns false for me.

Digging deeper I found out that in
https://github.com/GoogleCloudPlatform/google-cloud-node/blob/badd1b8e32cf0bbcab6265fa4975c8c25f7116d0/packages/pubsub/src/connection-pool.js#L364
connection.isConnected is always undefined for me, hence ConnectionPool.isConnected() always returns false.

So since connectionPool.isConnected() always returns false, the first ack() call will call setFlushTimeout_() in:
https://github.com/GoogleCloudPlatform/google-cloud-node/blob/badd1b8e32cf0bbcab6265fa4975c8c25f7116d0/packages/pubsub/src/subscription.js#L263

This will in fact run this.flushQueues_() exactly once (so the first message is actually acknowledged), but since this.flushQueues_() does not reset this.flushTimeoutHandle_ all subsequent calls to this.setFlushTimeout_() will bail out here:
https://github.com/GoogleCloudPlatform/google-cloud-node/blob/badd1b8e32cf0bbcab6265fa4975c8c25f7116d0/packages/pubsub/src/subscription.js#L980-L982

So basically I end up with two questions:

  • Shouldn't this.flushQueues_() make sure to always unset this.flushTimeoutHandle_ after the queues have been flushed? Otherwise this.setFlushTimeout_() can effectively be called only exactly once during the runtime of the process to trigger this.flushQueues_(), all subsequent calls will have no effect. I assume this is a bug.
  • Why is connection.isConnected always undefined? Is that another bug in the connectionPool.isConnected() method?
pubsub

Most helpful comment

Nice catch! I'll update the version number tonight and hopefully we can finally put this to bed :). Thanks again for all your help

All 20 comments

Shouldn't this.flushQueues_() make sure to always unset this.flushTimeoutHandle_ after the queues have been flushed?

Yep, that's a bug alright!

Why is connection.isConnected always undefined? Is that another bug in the connectionPool.isConnected() method?

That sounds bad, we set that value to true once we receive the metadata event, which is essentially just the headers. Are you seeing this in production or on the emulator?

I only tested with the emulator so far...

@callmehiphop I have tested this on a real GKE instance with real Pub/Sub and the problem doesn't appear there. So it seems to be related to PubSub Emulator again.

@callmehiphop @ctavan we had the problem last week on production using the last version of the library & had to downgrade to the previous one.

OK good to know. I wasn't able to deploy our code to a real environment with pubsub>=0.14.x so far because it doesn't pass our CI (where the emulator is used).

@ctavan I've opened a PR (#2627) which I think will resolve your issue, if you have the bandwidth would you mind testing it out? We really appreciate all the feedback and help you've given us!

@callmehiphop with the branch from #2627 I currently get:

/app/node_modules/@PRIVATE/pubsub/src/connection-pool.js:231
    client.waitForReady(deadline, function(err) {
           ^

TypeError: client.waitForReady is not a function
    at /app/node_modules/@PRIVATE/pubsub/src/connection-pool.js:231:12
    at Immediate._onImmediate (/app/node_modules/@PRIVATE/pubsub/src/connection-pool.js:330:5)
    at runCallback (timers.js:785:20)
    at tryOnImmediate (timers.js:743:5)
    at processImmediate [as _immediateCallback] (timers.js:714:5)

Strange! Can you tell me what version of grpc is installed on your project?

$ yarn list | grep grpc
β”œβ”€ @google-cloud/[email protected]
β”‚  β”œβ”€ grpc@^1.3.1
β”‚  β”œβ”€ @google-cloud/common-grpc@^0.4.0
β”‚  β”œβ”€ @google-cloud/common-grpc@^0.4.0
β”‚  β”œβ”€ grpc@^1.2
β”œβ”€ [email protected]

Yeah that seems like it should be correct, I'm not quite sure what's happening here because I can't reproduce that error against both the emulator and upstream api.

Hmmm. For debugging I added

    console.log(util.inspect(client, { depth: null, showHidden: true, showProxy: true }));

before https://github.com/callmehiphop/gcloud-node/blob/3013edb9fcc28269629db7780df45241b3dea9ca/packages/pubsub/src/connection-pool.js#L230-L231

and the output is:

Client { '$channel': Channel {} }

Any idea how to further debug this?

I'm fairly certain that anything on the prototype chain will not be shown via util.inspect, so you may want to use a for..in loop or similar to see the available members.

Are you testing with the example code you provided in the issue overview? I've been trying to
reproduce with that but have been unsuccessful.

@callmehiphop you're obviously right. Iterating over the client instance shows me the following keys:

$channel
createSubscription
getSubscription
updateSubscription
listSubscriptions
deleteSubscription
modifyAckDeadline
acknowledge
pull
streamingPull
modifyPushConfig
listSnapshots
createSnapshot
deleteSnapshot
seek

Does that give you any hint on why waitForReady is not there?

I'm testing with the exact code from the initial issue description (except that I'm using your branch as pubsub package).

@callmehiphop I started digging a bit deeper and manually upgraded to [email protected], now I get:

/app/node_modules/google-gax/node_modules/grpc/src/node/src/client.js:753
    this.$channel = new grpc.Channel(address, credentials, options);
                    ^

TypeError: Channel's second argument must be a ChannelCredentials
    at new Client (/app/node_modules/google-gax/node_modules/grpc/src/node/src/client.js:753:21)
    at Immediate._onImmediate (/app/node_modules/@PRIVATE/pubsub/src/connection-pool.js:324:19)
    at runCallback (timers.js:785:20)
    at tryOnImmediate (timers.js:743:5)
    at processImmediate [as _immediateCallback] (timers.js:714:5)

When iterating over the credentials argument I see that it only contains one property: compose?!

Which version of grpc are you using?

Argh, ok. Now I was messing up with yarn and grpc versions. I finally managed to make sure that only [email protected] is installed which is pulled in through the grpc@^1.3.1 dependency of @google-cloud/[email protected] and the grpc@^1.2 dependency of [email protected].

So with [email protected] and your branch from #2627 the issue described in here is effectively fixed.

Nonetheless I believe that the grpc dependency version needs to be updated to prevent the issues that I was still seeing due to older versions of grpc that were still in the specified dependency version range and were not updated due to being fixed by the yarn.lock file and were consequently breaking the whole thing.

So it did not work with [email protected] (client.waitForReady is not a function) and it does work with [email protected].

Nice catch! I'll update the version number tonight and hopefully we can finally put this to bed :). Thanks again for all your help

Just for the record: I figured out that [email protected] was the first grpc release where your patch works well.

Anyways thanks for fixing the issue! Can't wait to finally upgrade to the new pubsub client πŸ˜‰ …

@ctavan this fix should be published, thanks for the help.

@callmehiphop I can confirm that @google-cloud/[email protected] fixes the issue outlined in here.

Just FTR in my specific setup I had to uninstall @google-cloud/datastore and @google-cloud/pubsub and then re-install both packages due to a limitation in yarn's module resolution.

yarn remove @google-cloud/datastore @google-cloud/pubsub
yarn add @google-cloud/datastore @google-cloud/pubsub

Just upgrading @google-cloud/pubsub still left [email protected] installed in node_modules and for whatever reason node was not picking up [email protected] from node_modules/@google-cloud/pubsub/node_modules/grpc but instead was continuing to use [email protected] from node_modules, probably because I import @google-cloud/datastore before @google-cloud/pubsub in my code…

In general I believe it would be a good thing to make sure that all of the google-cloud packages get their grpc dependency updated soon.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

positlabs picture positlabs  Β·  3Comments

VikramTiwari picture VikramTiwari  Β·  3Comments

bamapookie picture bamapookie  Β·  5Comments

jgeewax picture jgeewax  Β·  4Comments

jackzampolin picture jackzampolin  Β·  3Comments