Google-cloud-node: Unacceptable cold start get() performance

Created on 24 Jan 2019  Â·  97Comments  Â·  Source: googleapis/google-cloud-node

Environment details

  • OS: Cloud Functions runtime
  • Node.js version: all
  • npm version: all
  • @google-cloud/firestore version: all

Steps to reproduce

The first get() performed against Cloud Firestore may take multiple seconds (5-8s) to return. In a serverless environment with cold starts, this is an introduction of unacceptable latency and is additionally very difficult to mitigate or work around (since cold starts in Cloud Functions cannot be fully prevented).

Perhaps a mechanism should be added to use the REST API for simple calls while the gRPC connection is being established to avoid/mitigate this issue.

/cc @hiranya911 @samcorcos @schmidt-sebastian

feature request performance

Most helpful comment

Our team has created a thorough breakdown and minimal benchmark scenarios for this issue.

Direct comparison using Google SDKS via REST vs gRPC: google-cloud-grpc-issue

Please see the full repo for context, but here are the main takeaways:

Takeaways

We would happily tradeoff slightly slower query times for consistent query times 10 times out of 10. The REST version especially when deployed to serverless is significantly more consistent and reasonable than the gRPC version with it not being uncommon for our end users to wait 10-30s for initial results which is absolutely unacceptable.

This problem is compounded significantly by our serverless business logic using multiple Google APIs to handle requests (like Firestore + Logs), each of which has the same redundant slow initialization issue.

This is a serious performance issue and there have been some similar reports, but it's hard to believe that this hasn't been acknowledged as a fundamental weakness of Google Cloud, at least with the happy path on Node.js using their official gRPC-based SDKs.

Our options at this point are:

  1. Switch all of our business logic and future applications away from Google Cloud.
  2. Switch all of our business logic to use REST versions of GCP APIs which would be really awkward to maintain and work with and is likely a non-starter.
  3. Resolve this core issue and move on to more important things.

@schmidt-sebastian @mbleigh @JustinBeckwith we've done our part in breaking down as simple and clear of a repro example as possible. Please help us and other customers choose option #3.

cc @samcorcos @cielo who have also investigated this issue independently.

All 97 comments

Greetings! Can you share how you're measuring this? It would be cool to have a consistent benchmark so we can work from the same baseline.

Filed this on behalf of @samcorcos who reached out to me -- if he's willing he can maybe share the code that's producing the issue.

FWIW- I wouldn't expect gRPC to be slower than REST for get() calls in the nodejs-firestore SDK (it might be slower in the mobile firebase SDK which uses a listener under the covers for get(), but nodejs-firestore should do a direct gRPC call).

5-8s sounds quite slow. Has this been observed recently?

Yes, within the last few days (again, @samcorcos shared .an example of this happening with me).

@mikelehen @JustinBeckwith I'm happy to share the repo and the logs with you as well :) What's a good email address for you guys?

As for the measurement, this is a screenshot from the logs:

screen shot 2019-01-23 at 09 48 08

Function execution start to first log: 0.2s
First db call start to finish: 7s
second db call start to finish: 0.1s

I initially assumed it was a cold start issue, but it looks like it's actually something to do with the DB. This is above my pay grade, so I don't know the specifics, but the function itself is pretty simple:

const onAssign = async (snap, context) => {
console.log('function start ', Date.now())
  const { userId, contactId } = context.params
  const smallestCadence = 1

  console.log('first db start ', Date.now())
  const doc = await db
    .collection('users')
    .doc(userId)
    .collection('contacts')
    .doc(contactId)
    .get()
  console.log('first db end', Date.now())

  const contact = {
    contactId,
    ...doc.data()
  }

  console.log('second db call ', Date.now())
  if (!contact.lastContact) {
    await db
      .collection('users')
      .doc(userId)
      .collection('contacts')
      .doc(contactId)
      .set({ urgency: 'red' }, { merge: true })
  } else {
    // otherwise, determine the urgency and assign
    const lastContact = moment(contact.lastContact)
    const today = moment()
    const delta = lastContact.diff(today, 'days')
    const urgency = determineUrgency(delta, smallestCadence)

    console.info(`Set urgency of user ${userId} to ${urgency}`)

    await db
      .collection('users')
      .doc(userId)
      .collection('contacts')
      .doc(contactId)
      .set({ urgency }, { merge: true })
  }
  console.log('second db end ', Date.now())
}

Currently running:

    "firebase-admin": "~6.0.0",
    "firebase-functions": "^2.1.0",

On Node 8

I first noticed this on the first call after a deploy, but I imagine it could also be an issue on cold start even if it's not soon after a deploy.

At @JustinBeckwith's request, I rewrote the query to use the googleapis node client and the issue went away.

As you can see from the screenshots below, function start to first db end took 0.5 seconds, as opposed to the version that uses the Firebase package which can take 6-10 seconds to execute the first query.

screen shot 2019-01-27 at 16 25 01

Here's the updated code:

const { google } = require('googleapis')
const firestore = google.firestore('v1beta1')

const myFunction = async (snap, context) => {
  const { userId, contactId } = context.params

  // NOTE need to export these within the function because Firebase Functions does not allow upper-case
  // variable names for some reason?
  // Error: Invalid config name onCadenceAssignToContact.GOOGLE_APPLICATION_CREDENTIALS, cannot use upper case.
  process.env.GOOGLE_APPLICATION_CREDENTIALS = './credentials/myapp-firebase-adminsdk-3jp1h-56c62c3fbd.json'
  process.env.GCLOUD_PROJECT = 'myapp'

  console.log('function start ', Date.now())

  const auth = await google.auth.getClient({
    // Scopes can be specified either as an array or as a single, space-delimited string.
    scopes: [
      'https://www.googleapis.com/auth/cloud-platform',
      'https://www.googleapis.com/auth/datastore'
    ]
  })

  console.log('first db start ', Date.now())
  // https://cloud.google.com/firestore/docs/reference/rest/v1/projects.databases.documents/get
  const name = `projects/myapp/databases/(default)/documents/users/${userId}/contacts/${contactId}`
  const result = await firestore.projects.databases.documents.get({
    name,
    auth
  })
  console.log('first db end ', Date.now())

  console.log(JSON.stringify(result.data))
}

Thank you so much for testing this with the googleapis npm module. This gives us a pretty clear picture that it's something in either this module, gax, or grpc :/

We have looked at the cold start performance with the Cloud Functions team before (@laurenzlong and others). One of the changes we made is that we now lazy-load google-gax and grpc. This loading happens just before the first request is sent to the Firestore backend. Back when we did this, we were able to shave off 5-6 seconds in the module initialization time. Instead, this time is now "hidden" in the async request flow.

I don't know if there are a ton of options that we have until we can switch to the pure-JS GRPC client. The GRPC dependency is huge, and rewriting the client using a different network stack is likely out of scope for these server libraries.

@schmidt-sebastian What about the possibility of using the REST API as a fallback while gRPC is loading as @mbleigh suggested? The googleapis package doesn't have the same issue.

Perhaps a mechanism should be added to use the REST API for simple calls while the gRPC connection is being established to avoid/mitigate this issue.

"Using the REST API as a fallback while gRPC is loading" is "rewriting the client to use a different network stack" and is infeasible. Several of the APIs we expose here are based on streaming RPCs, which don't have an equivalent via REST. The only way to fix this is to make it cheaper to depend upon gRPC from Cloud Functions. That will additionally benefit all One Platform-based APIs, not just Firestore.

It turns out that google-gax has some experimental support for depending on a pure javascript version of gRPC if you set the GOOGLE_CLOUD_USE_GRPC_JS environment variable. This will work under the experimental Node 8 runtime for Cloud Functions (but not Node 6). This could help, but may not help much until google-gax breaks its dependency on the native grpc package.

In any case, the cold start issue isn't anything we have the resources to fix ourselves. The grpc-node project was essentially created to address this problem and we'll let them work through it.

Hi folks, there is no actionable way this project can resolve this issue. It appears relevant bugs on other projects have been linked, so I'm going to go ahead and close this issue. You're still welcome to discuss workarounds and impact here, but tracking the resolution will happen in the other projects.

Thanks, everyone!

Note that the 1.0.2 release includes support for GRPC JS, which reduces the load time of Firestore on GCF to around 3 seconds. This can be enabled on the Node 8 runtime with the GOOGLE_CLOUD_USE_GRPC_JS flag mentioned above.

@schmidt-sebastian would you be willing to post some data here? It would be super helpful.

I am not getting any reproducible results at the moment and see loading time with both GRPC and GRPC JS at around 3 to 4 seconds.

Can someone tell me how to set the GOOGLE_CLOUD_USE_GRPC_JS environment variable in Firestore? I don't see any args for setting environment variables to firebase deploy and I don't see anywhere in the Firestore web console. Your help is very much appreciated.

@hitchingsh If you use the latest version of firebase-admin or @google-cloud/firestore, grpc-js will be used automatically. There's no need to set any environment variable.

Good news! The latest release uses grpc-js by default now.

@mikelehen, Thanks!

Note, its still getting a 3 second delay on my first call to the admin firestore database in node JS. Is there anyway to speed it up

const admin = require('firebase-admin');
const functions = require('firebase-functions');

var initializedFirestore = false

if (!initializedFirestore) {
  admin.initializeApp();
  initializedFirestore = true
}

const firestore = admin.firestore();

// Note, this is highly abbreviated and my actual code is a lot more 
// complex but I also get a 3 second hit after uploading the function 
// on the first call to Firestore:

firestore.collection('myCollection').doc(docId).get()

@hitchingsh Can you verify the versions that you are using? firebase-functions hasn't yet been updated to require firebase-admin version 8.x, which is the release that pulls in the smaller GRPC core (@grpc/grpc-js instead of grpc). It would help us if you can show us your output of npm list.

With regards to the 3 second delay on my first call to the admin firestore database in node JS 8. Here's my package.json

{
"name": "functions",
"description": "Cloud Functions for Firebase",
"engines": {
"node": "8"
},
"scripts": {
"lint": "eslint .",
"serve": "firebase serve --only functions",
"shell": "firebase functions:shell",
"start": "npm run shell",
"deploy": "firebase deploy --only functions",
"logs": "firebase functions:log"
},
"dependencies": {
"firebase-admin": "^8.0.0",
"firebase-functions": "^2.3.1",
"twilio": "^3.31.1"
},
"devDependencies": {
"eslint": "^5.16.0",
"eslint-plugin-promise": "^4.1.1"
},
"private": true
}

and here's my npm list
npm-list.txt

This did catch my eye but could use your help on it
UNMET PEER DEPENDENCY [email protected]

I'm having basically the same issue - read to the bottom here with some hope for fix. I've manually specified [email protected] to try to fix - but I'm not sure if it's being used, or if its bundling 7.0, since thats what it seems firebase-functions 2.3.1 requires.

You have the right dependencies locally, so it should be faster than with grpc. One possibility is that the dependencies installed on GCF do not match your local dependencies. To help us look at that, it would help to get the actual version number of the SDK you are using. This is logged when you invoke setLogFunction(console.log).

Note that there might not be too much we can do if you are indeed running on @grpc/grps-js. My hope is that somehow you still are not, and that we can get you on @grpc/grps-js :)

Ok, it says, when I do admin.firestore.setLogFunction(console.log)

Firestore (2.1.1) 2019-06-08T00:23:35.069Z Rgzqk [WriteBatch.commit]: Using transaction for commit

Is Firestore 2.1.1 the correct version of Firestore? Also, note these lines:

Firestore (2.1.1) 2019-06-08T00:23:35.069Z Rgzqk [WriteBatch.commit]: Using transaction for commit
Firestore (2.1.1) 2019-06-08T00:23:37.276Z ##### [Firestore]: Initialized Firestore GAPIC Client

It appears it took over two seconds to initialize the GAPIC Client.

@hitchingsh Thanks for getting back to us! The latest version of @google-cloud/firestore is 2.2, but there are no changes between 2.1.1 and 2.2 that affect the performance. 2.1.1 (and in fact all 2.x releases) already contains the much lighter GRPC JS networking layer. We will look further into improving startup time, but at this point, I unfortunately don't have a short term fix for you.

If you are not already initializing Firestore in the module scope, I would advise you to update your code to follow the pattern here: https://cloud.google.com/functions/docs/bestpractices/tips#performance ("Use global variables to reuse objects in future invocations"). Once a Firestore instance is initialized, you can use it across function invocations.

Is there a way to kickoff the gRPC loading _without_ making a .get() call in Firestore?

Since App Engine has support for warmup requests, it would be great to be able to complete the loading/initialization of gRPC and other required modules as part of the warmup. That way when the first .get() call is made, everything would already be loaded and the call would return quickly as expected.

Sure, we can just make a .get() call to an arbitrary document during the warmup, but it's really just the loading/init we're after so I was curious if there was some other way to kick it off.

I don't know of one offhand. You could probably get the grpc code loaded into memory by just adding require('gprc-js') or whatever... but doing an actual get() would be a much better warmup, since it would actually start the connection to the backend, etc.

The cold start problem for Firestore continues to be a major performance issue for us. The basic problem is that there is no minimum time for Cloud functions to timeout so even though I have a scheduler hitting everyone of my cloud functions every half hour and hit them all whenever I upload, I still get cold start times which are much slower. I already have optimized my code so unnecessary code is not loaded and everything is saved. However, for cold starts gRPC is adding at least 3 seconds, which creates a poor user interface experience.

I re-opened this issue, and moved it over to the google-cloud-node repository so we could investigate the grpc slow start issue more holistically. This isn't just an issue for firestore, it seems to be an interaction with GCF and gRPC.

@JustinBeckwith Just an FYI, I created this package as a workaround to keep the same syntax but use the REST API and it's been working without a hitch.

If there's a way to check, "Has gRPC loaded yet?" you could run "If yes, use gRPC. If no, use REST" as @mbleigh recommended in the original post.

https://github.com/samcorcos/firestore-rest

Any updates on this issue?

@arsen @samcorcos, since this issue was opened (as @schmidt-sebastian indicates) we've moved to @grpc/grpc-js for most of our libraries (such as @google-cloud/firestore).

We've also addressed some issues in our authentication library, which was leading to potentially slow start times in some specific environments.

I would be curious if you're still seeing slow initial calls to get() when using @google-cloud/firestore?

Hi @bcoe,
Yes, I still see ~3 seconds initial call to get(), even with the latest @google-clouse/firestore:2.5 version.

Could you please point to the issue in @grpc/grpc-js that is related to this, I couldn't find one.

Thanks.

While I love many aspects of Firebase Firestore, and Cloud Functions, Firestore's use of 'grpc' protocol under the hood does not simply go well with Cloud Functions' serverless architecture. It is very expensive to create a new 'grpc' client, and I do not think its protocol was ever meant to be used in short-lived processes or vms such as in cloud functions. Fortunately, if you are owner of an application that receives constant volume of traffic, and none of them come in burst, then constant number of cloud functions vms will always be active, and your clients will likely not see any latency issue.

The sad thing is that many app developers who only use high level frameworks are not aware of this negative aspect, and jump in, until they face this issue down in the road. When you see a simple Firestore transaction request, that takes 300ms ~ 500ms in traditional server or dev environment, taking more than 6,000ms ~ 9000ms in cloud functions (10~12 times longer), you somewhat lose the benefit of "You pay by 100ms increments compute time" that Firebase Function team advertises because compute resources are spent on 'grpc' initialization.

The issue is somewhat severe because each virtual instance of cloud function can handle at most 1 request at any given time, and if you happen to receive multiple burst requests coming in short period of time, all those requests will end up spinning up new virtual instance of cloud function, which all of them will start up a new 'grpc' client from the scratch, and they will all end up taking few seconds.

This "slow first request" issue is not just 'grpc' specific issue, and is also present in setting up HTTP connection pool. However, it is not as severe as setting up 'grpc'. Also, DynamoDB & Lambda from AWS side also has "slow first request" issue, but not as severe as Firestore & Cloud functions because DynamoDB SDK uses HTTP under the hood & DynamoDB itself provides well around 10~20ms latency compared to Firestore's 100ms latency.

My recommendation is that you should not try to build HTTP endpoint that use Cloud Firestore in it using Cloud Functions or in any serverless architecture. You will end up with bad user experience due to high latency. Instead, use Firestore Client side SDK and let your client side applications directly interact with Firestore. Firestore client side SDK is available in many different languages, and it also provides REST endpoint for those languages which they do not cover.

If you must have a HTTP endpoint with Firestore and want serverless environment, one possible mitigation is to use Cloud Run, which is similar to Cloud Functions, except that each instance of Cloud Run will be able to simultaneously handle up to 80 requests at the same time. It comes with Docker startup time overhead, but at least, each VM instance will be able to handle 80 requests simultaneously reducing chance of "slow first request" when requests number increases. (However, if requests come in bulk in a short period of time when there are no Cloud Run instances running, you will still end up creating many Cloud Run instances and all of them will end up with slow responses)

If you must have a HTTP endpoint with Firestore, and if low response time is absolutely necessary, you will have to go with traditional server setup.

If you are doing event based work, one possible mitigation is to utilize Cloud Functions Cron support (https://firebase.googleblog.com/2019/04/schedule-cloud-functions-firebase-cron.html) to get the jobs done in batch.

And Yes, 'grpc' works perfectly fine in regular long-lived traditional servers or docker instances, and I use it in my server applications. I just don't think it works well for short-lived processes.

@bcoe I'm also still experiencing significant slow cold start times (>4 seconds) on a very simple cloud function that involves firestore. Using [email protected], @google-cloud/[email protected] and @grpc/[email protected]. Should we expect any changes on this front, or simply switch away from cloud functions for this usage?

@schmidt-sebastian have you done any digging into profiling cold starts recently, I'm curious if there remain any improvements we could make at the client library layer.

Unfortunately, this has not been a priority for us lately. An area of improvement that we carved out a while ago is https://github.com/googleapis/nodejs-firestore/blob/master/dev/protos/protos.json, which is loaded at startup and contains types for the v1beta1 and the Admin Protos that are not used in the client.

@cielo
Thank you for your comment, it helps for me.
One question, you said do not use HTTP endpoint that use Cloud Firestore in it using Cloud Functions, and use Firestore client side SDK instead.
But I think most of Firebase users use SDK in Cloud Functions which contains 'get()' and it still made a delay for ~3 sec. Doesn't it use gRPC?
My project use HTTP endpoint from client to Cloud Functions, and connect Cloud Firestore from Cloud Functions. Is this path inefficient? It's just for security. Anyone can't connect DB and get data directly from client.

Cloud Firestore provides security rules which can be used to prevent unauthenticated access. You could use that in trivial firestore operations (e.g. Updating user's login record, updating user's own data, etc). Firestore Client SDK provides low latency for users for better user experience.

For operations that shouldn't be permitted by users (e.g. purchase/transaction related), you could put them behind a server. If you (or your users) can withstand (3~4s added latency due to cold start), using cloud functions is fine. With Cloud Run, that latency can be reduced further due to better vcpu resources, and its ability to handle up to 80 requests in a single instance.

Our team has created a thorough breakdown and minimal benchmark scenarios for this issue.

Direct comparison using Google SDKS via REST vs gRPC: google-cloud-grpc-issue

Please see the full repo for context, but here are the main takeaways:

Takeaways

We would happily tradeoff slightly slower query times for consistent query times 10 times out of 10. The REST version especially when deployed to serverless is significantly more consistent and reasonable than the gRPC version with it not being uncommon for our end users to wait 10-30s for initial results which is absolutely unacceptable.

This problem is compounded significantly by our serverless business logic using multiple Google APIs to handle requests (like Firestore + Logs), each of which has the same redundant slow initialization issue.

This is a serious performance issue and there have been some similar reports, but it's hard to believe that this hasn't been acknowledged as a fundamental weakness of Google Cloud, at least with the happy path on Node.js using their official gRPC-based SDKs.

Our options at this point are:

  1. Switch all of our business logic and future applications away from Google Cloud.
  2. Switch all of our business logic to use REST versions of GCP APIs which would be really awkward to maintain and work with and is likely a non-starter.
  3. Resolve this core issue and move on to more important things.

@schmidt-sebastian @mbleigh @JustinBeckwith we've done our part in breaking down as simple and clear of a repro example as possible. Please help us and other customers choose option #3.

cc @samcorcos @cielo who have also investigated this issue independently.

Greetings folks! And thank you for continuing to push us here. We absolutely hear you, and we're actively looking again at the issues going on. We don't know what the outcome is going to be, but we are picking this back up again. I know this is less than ideal, and appreciate y'all being patient.

Specifically - @transitive-bullshit - that repository you all created is amazing, and is going to go a long way to helping us debug what's happening.

@transitive-bullshit Do you mind turning on logging for one of your GRPC calls? While I do see slightly lower numbers for GRPC on my MacBook Pro as well (running on the generally pretty terrible network connection in my home), I am not getting numbers anywhere close to what you reported.

Single GET (GRPC with logging):

main: 707.532ms
main: 681.069ms
main: 670.499ms
main: 643.907ms
main: 622.395ms

Single GET Rest:

main: 503.748ms
main: 598.759ms
main: 560.120ms
main: 471.086ms
main: 586.161ms

Multiple GET (GRPC with logging):

main: 1136.914ms 
main: 968.463ms
main: 1049.292ms
main: 990.815ms
main: 1017.922ms

Multiple GET REST:

main: 1047.665ms
main: 969.549ms
main: 990.563ms
main: 990.316ms
main: 941.437ms

The main slowdown for GRPC is the loading of the client during the initial request.

@JustinBeckwith I'm glad that you're picking this issue back up! @transitive-bullshit thank you so much for putting that repo together. I just wanted to add my voice to the choir in emphasizing that this is a deal-breaking issue for me. I would also happily take slightly slower query times if it meant the first query was brought down to a reasonable execution time. Firestore in serverless is essentially broken right now for anything but background tasks that don't have to happen in real time.

@schmidt-sebastian I think the slowdowns are being reported when running in Cloud Functions environment specifically -- may not repro locally.

@transitive-bullshit Thank you for putting it together! While @schmidt-sebastian is looking at the Firestore library, I will see how google-gax + @grpc/grpc-js could contribute to low latencies of the first request.

One change that we already made, that can actually affect the cold start time on GCF (not locally), is that we moved asynchronous authentication pipeline away from the constructor (GCF does not expect a lot of async stuff happening in the constructor, so sometimes the auth got effectively throttled which caused a set of latency issues). It may or may not help fix the huge numbers you're reporting, but this is one fix that I will check first (it's already in master for all libraries, but not yet released for both Firestore and Logging). update: it's actually released in Firestore 3.6 but not yet in logging.

I will keep you folks updated here with the findings!

@mbleigh The benchmarks pointed to in their GitHub repo are meant to be run locally.

@transitive-bullshit We are wondering if you are hitting a 10 seconds timeout during credential detection. This should not happen if you are using environment variables to configure Application Default Credentials, but it seems like an avenue we should explore.

It would help us if you could enable additional logging for this by setting the environment variable DEBUG_AUTH=true. You can do this in addition to invoking setLogFunction(console.log).

@schmidt-sebastian I think the slowdowns are being reported when running in Cloud Functions environment specifically -- may not repro locally.

@mbleigh No, this is happening locally for me on my Macbook Pro and a solid network. It occurs every time I restart my local server which mimics a cold start from serverless functions.

My local network is a strong 1GB down btw, and given that these results don't repro with the REST version and that I'm also seeing the same latency on my AWS lambda deployments, I don't believe it's a network issue.

@schmidt-sebastian your numbers look much more reasonable -- I definitely thought that a consistent 10s+ seemed to point to something being very wrong.

Unfortunately, when I run these benchmarks locally with DEBUG_AUTH=true node benchmark-multiple-gets.js I don't get any additional logs. Is there another way to enable logs?

We are wondering if you are hitting a 10 seconds timeout during credential detection. This should not happen if you are using environment variables to configure Application Default Credentials, but it seems like an avenue we should explore.

As described in the repo, I'm using GOOGLE_APPLICATION_CREDENTIALS to point to a service account json file.

Also, just wanted to thank @JustinBeckwith and @schmidt-sebastian for weighing in here -- if this is such an important 95% happy path issue for a vocal few of us here, you can imagine how many potential users it's affecting overall. Looking forward to getting this thing debugged / resolved and let me know what I can do on my end to help.

Thanks!

I tracked down the grpc logging environment variables. Here are the detailed results:

GRPC_TRACE=all GRPC_VERBOSITY=DEBUG node benchmark-multiple-gets.js 

using gRPC via @google-cloud/firestore
2020-03-26T20:37:00.441Z | resolving_load_balancer | firestore.googleapis.com:443 IDLE -> IDLE
2020-03-26T20:37:00.447Z | connectivity_state | firestore.googleapis.com:443 IDLE -> IDLE
2020-03-26T20:37:00.447Z | dns_resolver | Resolver constructed for target firestore.googleapis.com:443
2020-03-26T20:37:00.450Z | channel | firestore.googleapis.com:443 createCall [0] method="/google.firestore.v1.Firestore/BatchGetDocuments", deadline=Thu Mar 26 2020 16:42:00 GMT-0400 (Eastern Daylight Time)
2020-03-26T20:37:00.452Z | call_stream | [0] Sending metadata
2020-03-26T20:37:00.453Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:00.453Z | call_stream | [0] write() called with message of length 117
2020-03-26T20:37:00.455Z | call_stream | [0] deferring writing data chunk of length 122
2020-03-26T20:37:00.455Z | dns_resolver | Resolution update requested for target firestore.googleapis.com:443
2020-03-26T20:37:00.457Z | resolving_load_balancer | firestore.googleapis.com:443 IDLE -> CONNECTING
2020-03-26T20:37:00.457Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:00.457Z | connectivity_state | firestore.googleapis.com:443 IDLE -> CONNECTING
2020-03-26T20:37:00.457Z | resolving_load_balancer | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:00.458Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:00.458Z | connectivity_state | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:10.551Z | dns_resolver | Resolved addresses for target firestore.googleapis.com:443: [172.217.12.202:443,2607:f8b0:4006:803::200a:443]
2020-03-26T20:37:10.552Z | pick_first | IDLE -> IDLE
2020-03-26T20:37:10.552Z | resolving_load_balancer | firestore.googleapis.com:443 CONNECTING -> IDLE
2020-03-26T20:37:10.552Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:10.552Z | connectivity_state | firestore.googleapis.com:443 CONNECTING -> IDLE
2020-03-26T20:37:10.553Z | pick_first | Connect to address list 172.217.12.202:443,2607:f8b0:4006:803::200a:443
2020-03-26T20:37:10.554Z | subchannel | 172.217.12.202:443 refcount 0 -> 1
2020-03-26T20:37:10.554Z | subchannel | 2607:f8b0:4006:803::200a:443 refcount 0 -> 1
2020-03-26T20:37:10.554Z | subchannel | 172.217.12.202:443 refcount 1 -> 2
2020-03-26T20:37:10.554Z | subchannel | 2607:f8b0:4006:803::200a:443 refcount 1 -> 2
2020-03-26T20:37:10.554Z | pick_first | Start connecting to subchannel with address 172.217.12.202:443
2020-03-26T20:37:10.554Z | pick_first | IDLE -> CONNECTING
2020-03-26T20:37:10.554Z | resolving_load_balancer | firestore.googleapis.com:443 IDLE -> CONNECTING
2020-03-26T20:37:10.554Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:10.555Z | connectivity_state | firestore.googleapis.com:443 IDLE -> CONNECTING
2020-03-26T20:37:10.555Z | subchannel | 172.217.12.202:443 IDLE -> CONNECTING
2020-03-26T20:37:10.558Z | pick_first | CONNECTING -> CONNECTING
2020-03-26T20:37:10.558Z | resolving_load_balancer | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:10.558Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:10.559Z | connectivity_state | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:10.559Z | subchannel | 2607:f8b0:4006:803::200a:443 IDLE -> CONNECTING
2020-03-26T20:37:10.559Z | pick_first | CONNECTING -> CONNECTING
2020-03-26T20:37:10.559Z | resolving_load_balancer | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:10.559Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-03-26T20:37:10.559Z | connectivity_state | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:10.680Z | subchannel | 172.217.12.202:443 CONNECTING -> READY
2020-03-26T20:37:10.680Z | pick_first | Pick subchannel with address 172.217.12.202:443
2020-03-26T20:37:10.680Z | pick_first | CONNECTING -> READY
2020-03-26T20:37:10.680Z | resolving_load_balancer | firestore.googleapis.com:443 CONNECTING -> READY
2020-03-26T20:37:10.680Z | channel | Pick result: COMPLETE subchannel: 172.217.12.202:443 status: undefined undefined
2020-03-26T20:37:10.691Z | connectivity_state | firestore.googleapis.com:443 CONNECTING -> READY
2020-03-26T20:37:10.691Z | subchannel | 172.217.12.202:443 refcount 2 -> 3
2020-03-26T20:37:10.691Z | subchannel | 172.217.12.202:443 refcount 3 -> 2
2020-03-26T20:37:10.692Z | subchannel | 2607:f8b0:4006:803::200a:443 refcount 2 -> 1
2020-03-26T20:37:10.694Z | subchannel | 2607:f8b0:4006:803::200a:443 CONNECTING -> READY
2020-03-26T20:37:11.045Z | call_stream | [0] attachHttp2Stream from subchannel 172.217.12.202:443
2020-03-26T20:37:11.045Z | subchannel | 172.217.12.202:443 callRefcount 0 -> 1
2020-03-26T20:37:11.045Z | call_stream | [0] sending data chunk of length 122 (deferred)
2020-03-26T20:37:11.046Z | call_stream | [0] end() called
2020-03-26T20:37:11.046Z | call_stream | [0] calling end() on HTTP/2 stream
2020-03-26T20:37:11.190Z | call_stream | [0] received HTTP/2 headers frame
2020-03-26T20:37:11.191Z | call_stream | [0] receive HTTP/2 data frame of length 366
2020-03-26T20:37:11.192Z | call_stream | [0] parsed message of length 366
2020-03-26T20:37:11.192Z | call_stream | [0] filterReceivedMessage of length 366
2020-03-26T20:37:11.192Z | call_stream | [0] pushing to reader message of length 361
2020-03-26T20:37:11.204Z | call_stream | [0] received HTTP/2 trailing headers frame
2020-03-26T20:37:11.204Z | call_stream | [0] received HTTP/2 end of data flag
2020-03-26T20:37:11.204Z | call_stream | [0] pushing to reader message of length null
2020-03-26T20:37:11.205Z | call_stream | [0] ended with status: code=0 details=""
2020-03-26T20:37:11.205Z | subchannel | 172.217.12.202:443 callRefcount 1 -> 0
DaubZexsvlUE7beXCOEd: 10.892s
{
  state: 'enabled',
  schedule: '* * * * *',
  httpMethod: 'GET',
  name: 'test0',
  httpHeaders: {},
  userId: '5e7ae88a3720c1003cebcd0b',
  tags: [],
  description: '',
  url: 'https://saasify.sh',
  timezone: 'America/New_York'
}
2020-03-26T20:37:11.210Z | call_stream | [0] HTTP/2 stream closed with code 8
2020-03-26T20:37:11.211Z | channel | firestore.googleapis.com:443 createCall [1] method="/google.firestore.v1.Firestore/BatchGetDocuments", deadline=Thu Mar 26 2020 16:42:11 GMT-0400 (Eastern Daylight Time)
2020-03-26T20:37:11.211Z | call_stream | [1] Sending metadata
2020-03-26T20:37:11.211Z | channel | Pick result: COMPLETE subchannel: 172.217.12.202:443 status: undefined undefined
2020-03-26T20:37:11.211Z | call_stream | [1] write() called with message of length 117
2020-03-26T20:37:11.212Z | call_stream | [1] deferring writing data chunk of length 122
2020-03-26T20:37:11.212Z | call_stream | [1] attachHttp2Stream from subchannel 172.217.12.202:443
2020-03-26T20:37:11.212Z | subchannel | 172.217.12.202:443 callRefcount 0 -> 1
2020-03-26T20:37:11.212Z | call_stream | [1] sending data chunk of length 122 (deferred)
2020-03-26T20:37:11.212Z | call_stream | [1] end() called
2020-03-26T20:37:11.212Z | call_stream | [1] calling end() on HTTP/2 stream
2020-03-26T20:37:11.361Z | call_stream | [1] received HTTP/2 headers frame
2020-03-26T20:37:11.362Z | call_stream | [1] receive HTTP/2 data frame of length 380
2020-03-26T20:37:11.362Z | call_stream | [1] parsed message of length 380
2020-03-26T20:37:11.362Z | call_stream | [1] filterReceivedMessage of length 380
2020-03-26T20:37:11.362Z | call_stream | [1] pushing to reader message of length 375
2020-03-26T20:37:11.364Z | call_stream | [1] received HTTP/2 trailing headers frame
2020-03-26T20:37:11.364Z | call_stream | [1] received HTTP/2 end of data flag
2020-03-26T20:37:11.364Z | call_stream | [1] pushing to reader message of length null
2020-03-26T20:37:11.364Z | call_stream | [1] ended with status: code=0 details=""
2020-03-26T20:37:11.364Z | subchannel | 172.217.12.202:443 callRefcount 1 -> 0
M5G2MBgoPUv4Ur1K7Zu0: 154.886ms
{
  tags: [],
  description: '',
  url: 'https://saasify.sh',
  timezone: 'America/New_York',
  state: 'enabled',
  schedule: '* * * * *',
  httpMethod: 'GET',
  name: 'simple cron test',
  httpHeaders: {},
  userId: '5e7aa4e477a4d81ed8eecd4d'
}
2020-03-26T20:37:11.365Z | call_stream | [1] HTTP/2 stream closed with code 8
2020-03-26T20:37:11.366Z | channel | firestore.googleapis.com:443 createCall [2] method="/google.firestore.v1.Firestore/BatchGetDocuments", deadline=Thu Mar 26 2020 16:42:11 GMT-0400 (Eastern Daylight Time)
2020-03-26T20:37:11.366Z | call_stream | [2] Sending metadata
2020-03-26T20:37:11.366Z | channel | Pick result: COMPLETE subchannel: 172.217.12.202:443 status: undefined undefined
2020-03-26T20:37:11.366Z | call_stream | [2] write() called with message of length 117
2020-03-26T20:37:11.366Z | call_stream | [2] deferring writing data chunk of length 122
2020-03-26T20:37:11.366Z | call_stream | [2] attachHttp2Stream from subchannel 172.217.12.202:443
2020-03-26T20:37:11.366Z | subchannel | 172.217.12.202:443 callRefcount 0 -> 1
2020-03-26T20:37:11.367Z | call_stream | [2] sending data chunk of length 122 (deferred)
2020-03-26T20:37:11.367Z | call_stream | [2] end() called
2020-03-26T20:37:11.367Z | call_stream | [2] calling end() on HTTP/2 stream
2020-03-26T20:37:11.531Z | call_stream | [2] received HTTP/2 headers frame
2020-03-26T20:37:11.532Z | call_stream | [2] receive HTTP/2 data frame of length 391
2020-03-26T20:37:11.532Z | call_stream | [2] parsed message of length 391
2020-03-26T20:37:11.532Z | call_stream | [2] filterReceivedMessage of length 391
2020-03-26T20:37:11.532Z | call_stream | [2] pushing to reader message of length 386
2020-03-26T20:37:11.533Z | call_stream | [2] received HTTP/2 trailing headers frame
2020-03-26T20:37:11.534Z | call_stream | [2] received HTTP/2 end of data flag
2020-03-26T20:37:11.534Z | call_stream | [2] pushing to reader message of length null
2020-03-26T20:37:11.534Z | call_stream | [2] ended with status: code=0 details=""
2020-03-26T20:37:11.534Z | subchannel | 172.217.12.202:443 callRefcount 1 -> 0
OK7ZMTiLCNCbbcf4euWe: 169.386ms
{
  description: '',
  url: 'https://puppet-master.sh/',
  timezone: 'America/New_York',
  state: 'enabled',
  schedule: '*/5 * * * *',
  httpMethod: 'GET',
  name: 'simple cron test 2',
  httpHeaders: {},
  userId: '5e7aa4e477a4d81ed8eecd4d',
  tags: []
}
2020-03-26T20:37:11.535Z | call_stream | [2] HTTP/2 stream closed with code 8
2020-03-26T20:37:11.537Z | channel | firestore.googleapis.com:443 createCall [3] method="/google.firestore.v1.Firestore/BatchGetDocuments", deadline=Thu Mar 26 2020 16:42:11 GMT-0400 (Eastern Daylight Time)
2020-03-26T20:37:11.537Z | call_stream | [3] Sending metadata
2020-03-26T20:37:11.537Z | channel | Pick result: COMPLETE subchannel: 172.217.12.202:443 status: undefined undefined
2020-03-26T20:37:11.538Z | call_stream | [3] write() called with message of length 117
2020-03-26T20:37:11.538Z | call_stream | [3] deferring writing data chunk of length 122
2020-03-26T20:37:11.538Z | call_stream | [3] attachHttp2Stream from subchannel 172.217.12.202:443
2020-03-26T20:37:11.538Z | subchannel | 172.217.12.202:443 callRefcount 0 -> 1
2020-03-26T20:37:11.538Z | call_stream | [3] sending data chunk of length 122 (deferred)
2020-03-26T20:37:11.539Z | call_stream | [3] end() called
2020-03-26T20:37:11.539Z | call_stream | [3] calling end() on HTTP/2 stream
2020-03-26T20:37:11.629Z | call_stream | [3] received HTTP/2 headers frame
2020-03-26T20:37:11.630Z | call_stream | [3] receive HTTP/2 data frame of length 392
2020-03-26T20:37:11.630Z | call_stream | [3] parsed message of length 392
2020-03-26T20:37:11.630Z | call_stream | [3] filterReceivedMessage of length 392
2020-03-26T20:37:11.630Z | call_stream | [3] pushing to reader message of length 387
2020-03-26T20:37:11.631Z | call_stream | [3] received HTTP/2 trailing headers frame
2020-03-26T20:37:11.631Z | call_stream | [3] received HTTP/2 end of data flag
2020-03-26T20:37:11.632Z | call_stream | [3] pushing to reader message of length null
2020-03-26T20:37:11.632Z | call_stream | [3] ended with status: code=0 details=""
2020-03-26T20:37:11.632Z | subchannel | 172.217.12.202:443 callRefcount 1 -> 0
ncQYovHShkRWfIyVKZSy: 97.752ms
{
  description: '',
  url: 'https://ssfy.sh/dev/hello-world?name=test2',
  timezone: 'America/New_York',
  state: 'enabled',
  schedule: '* * * * *',
  httpMethod: 'GET',
  name: 'test2',
  httpHeaders: {},
  userId: '5e766a8359bf49e6a55ce9ac',
  tags: []
}
main: 11.317s

The key line that seems to be the 10 second bottleneck seems to be related to DNS:

2020-03-26T20:37:00.458Z | connectivity_state | firestore.googleapis.com:443 CONNECTING -> CONNECTING
2020-03-26T20:37:10.551Z | dns_resolver | Resolved addresses for target firestore.googleapis.com:443: [172.217.12.202:443,2607:f8b0:4006:803::200a:443]

I don't have any non-standard DNS-related settings on my Mac's network and this same issue repros on AWS Lambda afaict.

@transitive-bullshit Can you try running the same code but passing the native C++ gRPC as a parameter? npm install grpc, and then

const grpc = require('grpc');

...
const db = new Firestore({ grpc });

This will give us one more data point.

@alexander-fenster the problem goes away if I use the native grpc client 😄💯😄

node benchmark-multiple-gets.js

using gRPC via @google-cloud/firestore
DaubZexsvlUE7beXCOEd: 820.689ms
M5G2MBgoPUv4Ur1K7Zu0: 105.515ms
OK7ZMTiLCNCbbcf4euWe: 159.942ms
ncQYovHShkRWfIyVKZSy: 125.42ms
main: 1.215s

Here is the verbose output with grpc logging enabled via GRPC_TRACE=all GRPC_VERBOSITY=DEBUG node benchmark-multiple-gets.js.

I am the primary developer of grpc-js. I have a guess about what the problem is: the grpc-js library currently makes two DNS requests: a request for A/AAAA records, to get addresses, and a request for TXT records, to get service config information. Currently we wait for both before reporting the result, but I think that is not actually necessary. I will publish a version soon that changes that and we can see if it helps.

Thanks for the quick update @murgatroid99 😄

I have published grpc-js version 0.7.4 with my possible fix. Can you try it out?

I'm still seeing the same issue after updating to v0.7.4.

Here are the verbose logs along with some output from yarn showing the version of @grpc/grpc-js being used.

Update: so apparently just updating @grpc/grpc-js in my package isn't enough to force google-gax to use it.. Passing the updated @grpc-grpc-js directly to Firestore resolves the issue. (updated verbose logs) 💯

const { Firestore } = require('@google-cloud/firestore')
const grpc = require('@grpc/grpc-js')

console.log('using gRPC via @google-cloud/firestore')
const db = new Firestore({ grpc })

Thank you so much @murgatroid99!

I'm assuming that once this update gets propagated to google-gax and the other Google SDKs, things will just work without having override the grpc constructor option to Firestore and other SDKs?

Yes, we will roll this out across all SDKs, at which point google-gax will directly pin to the new version. Thanks for your help!

In the meantime, is there any reason to use the JS vs native grpc client?

Once this patch rolls out, I'd recommend closing this issue to keep things focused btw.

The C++ library can have difficulty building in different environments, so
platform compatibility and ease of use are the main reasons for the JS
version.

On Sat, Mar 28, 2020, 8:07 PM Travis Fischer notifications@github.com
wrote:

In the meantime, is there any reason to use the JS vs native grpc client?

Once this patch rolls out, I'd recommend closing this issue to keep things
focused btw.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/googleapis/google-cloud-node/issues/2942#issuecomment-605551364,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAAH7S4PHJIT6LXE3JXUJLRJ23ODANCNFSM4IHKGAPQ
.

As for google-gax, we'll make a release of 1.15.x branch on Monday (2.x already uses @grpc/grpc-js 0.7.x but it will take some time for Firestore to switch to google-gax 2.x).

If you update your dependencies, you should now be using @grpc/[email protected].

Will this have any improvement on cold starts of firebase functions that get values from firestore (without using any special libraries) once released?

Currently we are not directly using @google-cloud/firestore module but its being referred from firebase-admin.

The loading times of this module are really unacceptable. Please check the screenshot on the loading times from a trace. It's nearly 1.4 secs just for loading the module. Can someone suggest what can be done?

Screenshot 2020-04-14 at 7 29 11 PM

@IljaDaderko yes, it should.

@manwithsteelnerves yeah, I agree this seems pretty unoptimized. I would suggest the Google team look at lazy requires though without some serious work we'll still have to pay the load costs either at initial load or at initial request, and I'd probably prefer to pay them at initial load given that my users are likely waiting on initial requests.

@transitive-bullshit Ty for answering. Is there a plan / eta on when it will hit firebase?

@transitive-bullshit Thanks for the reply.
As cloud function instance is not subjected to stay idle for any specified time after it serves a request, It's a bit worrying. And during my tests it always has this cold boot unfortunately :|.

Not sure how others are handling this as this seems to happen for every function call.

Finally, an issue where the problem isn't being denied. I was starting to feel like a conspiracy theorist. Anyway, I still want to vent a bit. Before you read this, just remember I do love google and firebase. I'm just frustrated.

That being said... Why would a boot take so long for something that is meant to be called and destroyed _frequently_? That's what firebase functions are for! That logic doesn't make sense to me.

Calling functions from web means doing CORS as well. This means that the preflight as well as the request combined take 8-9s to complete. That's unacceptable for any and all features you could offer.

... Unless, maybe you're building a lag simulator or something.

Google/Firebase keep saying "but... cold boot", like that's supposed to mean something. Yes, it's a cold boot. I get it. Functions need to be torn down to keep it affordable and flexible. But if you know that's what this is supposed to be doing, why implement other mechanisms that make things even slower?

Real world example of why this sucks: with stripe and apple pay, when a customer changes their shipping address I need to call a function to check the delivery cost. This takes 10 seconds. From the user's perspective this is unacceptable. All they did was change the shipping address, and now they're staring at a spinner. You have to remember, not everyone gets millions of requests a second to keep their functions warm and running, so that argument shouldn't be an argument imo.

Now, I've been at this for several months now and I'm getting tired of people telling me to:

  • Combine my functions
  • Lazy-load dependencies
  • Keep total bundle size low
  • Check my regions

I've done these things. I've read all the issues. I've contacted support. It's just not working properly right now and that's a problem. It really does make me happy to see that it's not being denied everywhere.

That feels good... Thanks for letting me rant.

Now, how can I help! I want this working as my teeny tiny startup depends on it. Can I test something? Fix something? As much as I enjoy seeing this issue, it has been open for over a year as well.

@RWOverdijk definitely on the same page but my 10s+ initial connections are now resolved after the most recent fixes in this thread. Have you tried overriding the grpc client to use the native one (see my previous replies in this thread)?

@transitive-bullshit Thanks for a fast reply I appreciate it!

Do you mean this?

const { Firestore } = require('@google-cloud/firestore')
const grpc = require('@grpc/grpc-js')

console.log('using gRPC via @google-cloud/firestore')
const db = new Firestore({ grpc })

And if so, would that work with updates and such as well?

@RWOverdijk yes that should work as a short-term solution. Previous google peeps in this thread created a fix for the normal grpc-js which they're supposed to update this thread once it has propagated to all the node.js sdks.

Note that I'm not affiliated with google in any ways. Just trying to help out.

@transitive-bullshit And helping out you are! It already helps that I don't feel like an angry old man shouting at a cloud.

A quick try seems to reveal that stuff like FieldValues aren't in there. I'll give this a more serious try tomorrow. Thanks :)

@transitive-bullshit I got around to giving this a go. In general it appears faster but not always.

Also for some odd reason it's not doing preflight requests anymore which makes a huge difference, but also makes an equal amount of huge no sense to me.

Anyway... It's a workaround that makes things a bit better but the difference isn't always that visible.

@RWOverdijk my advice is that you have to make this actionable for the google peeps.

See my previous minimal repro example https://github.com/saasify-sh/google-cloud-grpc-issue

Fork or create your own version of this with reproducible timings. Then we may be able to help figure out what you're seeing.

@transitive-bullshit you seem to be very familiar with the topic, so I wanted to ask if you could explain if this fix can be somehow tested in firebase cloud functions, if yes, any chance you could provide an example?

Update: so apparently just updating @grpc/grpc-js in my package isn't enough to force google-gax to use it.. Passing the updated @grpc-grpc-js directly to Firestore resolves the issue. (updated verbose logs) 💯

const { Firestore } = require('@google-cloud/firestore')
const grpc = require('@grpc/grpc-js')

console.log('using gRPC via @google-cloud/firestore')
const db = new Firestore({ grpc })

Thank you so much @murgatroid99!

I'm assuming that once this update gets propagated to google-gax and the other Google SDKs, things will just work without having override the grpc constructor option to Firestore and other SDKs?

I tried the mentioned fix. Initially I thought it was working but doesn't seem to be. I see nearly 4secs being taken for loading firestore module alone which doesn't make it fit for a production app :|

Screenshot 2020-05-15 at 10 10 05 PM

@manwithsteelnerves how are you generating this trace? I am seeing < 300ms for module load ....

image

I’m using ‘require-so-slow’ module to capture the trace. The above trace i got it from firebase project without any emulation.

Do you have any other profiler which can be used with functions?

Hey wait, I'm the maintainer of that module 😆 That's what I used too! But on my local macbook. Did you run it inside of google cloud functions, or on your local machine?

Hey wait, I'm the maintainer of that module 😆 That's what I used too! But on my local macbook. Did you run it inside of google cloud functions, or on your local machine?

Oh nice :) The above trace one is from firebase functions. This is invoked from 256MB memory cloud function. On local it looks fine < 1sec but on real environment I see such unacceptable performance.

Also can someone suggest what could be the reason for even the function execution start is even delayed a lot? (I see around 5secs before even the exact project module loading starts). I'm using require-in-the-middle module to log the loading of dependencies.

Screenshot 2020-05-16 at 11 22 02 PM

Edit : Anyone can help on this? I see it quite common this kind of behaviour. If it logs its taking time for loading a module, I can be able to fix it. But its just idle for so long and starting which is definitely specific to the core of the product.

Today I saw something which is totally disappointing!
I tested with a very simple subscriber function and saw a delay of nearly 7secs just to start the function.

"Function execution started" message which is added automatically in the logs by the cloud functions took 7 secs.

Screenshot 2020-05-19 at 12 44 53 PM

You can see in the above message that the real function execution started after 6.9secs and then the actual project code starts(require / import code). Here, the function I have is very basic one too which leads to make me worried that I can't go production with these kind of delays.

import * as functions       from 'firebase-functions';

exports.cronTest =  functions.pubsub.topic('CronTest').onPublish(async (message, context) => {
  console.log('This is test function for testing cold start!');
});

Hey wait, I'm the maintainer of that module 😆 That's what I used too! But on my local macbook. Did you run it inside of google cloud functions, or on your local machine?

Did you get a chance to check on production?

@manwithsteelnerves I could imagine require-in-the-middle and require-so-slow stepping on each other's toes, since they're both overriding require hooks.

Just to eliminate a variable, what happens if you do something like this:

const requireSoSlow = require('require-so-slow');

import * as functions       from 'firebase-functions';

exports.cronTest =  functions.pubsub.topic('CronTest').onPublish(async (message, context) => {
  console.log('This is test function for testing cold start!');
});

What are your local require times for firebase-functions vs., cloud functions.


import * as functions       from 'firebase-functions';

exports.cronTest =  functions.pubsub.topic('CronTest').onPublish(async (message, context) => {
  //console.log('This is test function for testing cold start!');
});

I tried with the above code and it took 5.6secs to finish

  1. 5334.827399 ms just to start execution of function
  2. 277ms for exact function code execution.

May I know the average cold start values I can expect If i'm just importing firestore module alone? This can help to decide if I can use firebase in production or not.

Screenshot 2020-05-26 at 11 13 08 PM

I'm also experiencing this every single day.
I did a short test from the API consumer perspective yesterday - The cold start timings in there are really just unacceptable.

  • Wed, 27 May 2020 20:27:30 GMT -> 6920 ms
  • Wed, 27 May 2020 20:27:54 GMT -> 348 ms
  • Wed, 27 May 2020 20:29:14 GMT -> 498 ms
  • Wed, 27 May 2020 20:32:15 GMT -> 519 ms
  • Wed, 27 May 2020 20:36:17 GMT -> 473 ms
  • Wed, 27 May 2020 20:41:21 GMT -> 707 ms
  • Wed, 27 May 2020 20:51:26 GMT -> 555 ms
  • Wed, 27 May 2020 21:10:28 GMT -> 5540 ms
  • Wed, 27 May 2020 21:11:28 GMT -> 604 ms
  • Wed, 27 May 2020 21:13:39 GMT -> 424 ms
  • Wed, 27 May 2020 21:29:35 GMT -> 482 ms
  • Wed, 27 May 2020 21:36:30 GMT -> 319 ms
  • Thu, 28 May 2020 09:33:48 GMT -> 5016 ms

Especially the last timing blows my mind, that's my auth middleware responding with 403 - no Firestore calls whatsoever, just verifying whether the id token is valid and it takes 5 seconds! 🤯

I don't know if this is useful to anyone, but I'm moving everything to app engine, this gives more control and might help with performance. It doesn't solve the actual issue, of course, but it might be a workaround to get the performance needed.

If this is a dumb/unhelpful comment let me know, I can delete it.

@RWOverdijk Let me know if that improves anything - I'm also considering just switching to a "dedicated" Node instance that gives me full control

Here's the log of the function call that took 5016ms from an API consumer perspective.
The log says the execution only took 1073ms so as @manwithsteelnerves pointed out there must be a 4 second delay even before the "Function execution started" message is printed.

image

1073ms is something I would consider to be ok for a cold start and even that is still far from the warm ~500ms

I have created the following script for people to test module load times during cold starts:

index.js:

const requireSoSlow = require('require-so-slow');
const shim = require('require-so-slow/build/src/shim')

require('@google-cloud/firestore')

let events;
exports.main = async (req, res) => {
  if (!events) {
    events = shim.getAndClearEvents()
  }
  res.set('Access-Control-Allow-Origin', '*');
  res.status(200).send(events);
}

In my own testing, I found that they're fairly dependent on the amount of memory allocated to the cloud function (_which stands to reason_):

256MB saw around 1.5 seconds fairly consistently:

256m

512MB ~750ms:

512m

1GB ~400ms

1gb


I'm pretty sure that the variable times in cold starts are no longer attributable to our client libraries (_the original culprit was the grpc module, which we moved off of earlier this year).

If folks are seeing significant variation in cold start times, I think there's a good chance it's related to bootstrapping in GCF prior to the time that user code executes.

If this is the case, I think we'd do better to open a public issue on the Cloud Function product (_which I can help make sure gets to the appropriate folks_).

Hello everyone and thank you @bcoe, i’m wondering the timing for a 2GB function and also ask you of you would you recommand as a workaround to set every fonction as 1GB !
Thanks a lots for all you work !

  1. 5334.827399 ms just to start execution of function

@bcoe Thanks for the insights. Can you please explain what could be the reason for such a delayed function start? Should I consider 5.3 secs as total time it took for just bootstrapping GCF?

So, I've successfully moved my stuff to app engine to test and the initial cold boot is still around 4 seconds, but now this only happens once because I've set my min instances to 1, it's a lot faster now. Also because I've limited my max instances and tweaked my target_cpu_utilization and max_concurrent_requests it's a lot faster. It no longer (wrongly) thinks it needs to scale up.

So now my schedulers and triggers are deployed to firebase functions and my api has been deployed to app engine.

Again, not a solution, but for me at least a workaround I can live with.

Can you please explain what could be the reason for such a delayed function start? Should I consider 5.3 secs as total time it took for just bootstrapping GCF?

I think I will need to open a ticket with the GCF folks (_early next week_) and start a conversation around this. I'm not sure what's happening in that initial 5 seconds, and I think it's good to put on their radar.

As promised, I've opened an issue on the Cloud Functions issue tracker:

https://issuetracker.google.com/issues/158014637

I'm closing and locking this issue in favor of this new public thread. If you feel I've missed any pertinent details, please update it accordingly.

I will make sure this new issue is brought to the GCF team's attention.

Was this page helpful?
0 / 5 - 0 ratings