Firebase-tools: Request: Automatically retry failed function deploys that fail due to "You have exceeded your deployment quota"

Created on 9 Sep 2020 · 14Comments · Source: firebase/firebase-tools

Problem & Background

We have a deployment of ~70 functions and deploy them as part of CI at once. Whenever we deploy a random set of functions fails to deploy due to "You have exceeded your deployment quota". This used to block us completely a few weeks ago when there still was a quota on build time. Since this has been moved to Cloud Build as far as I can tell there are no more limits and you simply pay for your build time. Which is awesome, no more blocking the build due to quotas!

Unfortunately there still seems to be some other quotas/rate limits that are causing the deploy to fail. From what I can tell these might be simply too many write requests against the Cloud Functions API (or perhaps cloud build?).

The error message suggests to deploy with --only - Unfortunately it is not easy for us to split these up into separately deployed functions. It is also impossible to do when a dependency is updated or we change a data model or utility library that is used by different functions. Analyzing which function changed and has to be redeployed is not possible (for us) automatically and has to be done manually. This then brings new pains, where automated deploys after code review become impossible.

Right now this means our pipelines all fail regularily - and we manually retry them until every functions deployed successfully once.

Suggestion

Since retrying the failed functions right after always works, my suggestion would be to retry failed deploys that hit code 8 here:

https://github.com/firebase/firebase-tools/blob/3994bf7209a71ed5dde100c69ca96dc2d9c3634e/src/deploy/functions/release.js#L97-L105

I suppose this could be implemented here:

https://github.com/firebase/firebase-tools/blob/3994bf7209a71ed5dde100c69ca96dc2d9c3634e/src/deploy/functions/release.js#L526

If this would be a way forward I'm happy to create a PR for this.

functions

Source

Kamshak

👍6 🚀1

Most helpful comment

I have 122 functions that I deploy and after updating my firebase-tools to v 8.10.0, firebase-tools no longer tells me which functions failed to deploy because of the rate limiting. Having something that retries the failed functions would be fantastic.

KyleFoleyImajion on 11 Sep 2020

👍2

All 14 comments

@joehan is there something we could do in deploy to rate limit or retry in these cases?

samtstern on 10 Sep 2020

KyleFoleyImajion on 11 Sep 2020

👍2

Agree, nice idea

LeoLfL on 15 Sep 2020

Definitely it would be awesome to have it. We have the same problem (we have 74 functions) and have to redeploy them manually. It's a big pain and stops us from doing automatic deployments for firebase functions.

dzmitrynik on 15 Sep 2020

I have the same problem with aprox 90 functions. Not being able to rely on CI deploy is very limiting.

mvergarair on 22 Sep 2020

@joehan friendly ping, did you get a change to take a look yet? I'm happy to take a stab at this in a MR if it has a chance of being accepted

Kamshak on 28 Sep 2020

Hey @Kamshak, thanks for the reminder, this one slipped past me! With the recent switch to using Cloud Build, I don't see any reason we shouldn't add retries here. I would be more than happy to review a PR for this.

joehan on 28 Sep 2020

❤1

Not sure if this is a bug or not, but did anyone else notice that sometimes the CLI tells you which functions failed to deploy due to rate limiting, but then other times the CLI doesn't tell you (even though some failed due to rate limiting)?

KyleFoleyImajion on 30 Sep 2020

@KyleFoleyImajion how many functions do you have in your project currently?

I've started looking into this and there are 2 cases when you can run into rate limits:
1) Write API to update/create functions (done in the beginning of each deploy)
This one was easy to solve without changing much of the code.
2) Even if 1) succeeds you can get rate limits from the Cloud Build API
A bit more tricky. You need to do a new wait request and poll the new long running task for status.

The solution is therefore a bit more complicated than I thought originally. The code right now is also written in a way that makes it a bit difficult to add retries. I've run into two problems:
1) Logging / Messages: The logging is done in the API itself. There is for example no was to update a function right now without having the error logged - even if you retry after that error and therefore the error doesn't have to be shown.
2) Mutable State: A lot of the API methods mutate the objects you pass in, which means for example an "error" property might stay after a retry since the retried operation object only merges in new fields but doesn't delete old one.

@joehan I'm not sure in how far you are OK with larger changes to the codebase here. Currently I'm thinking the best solution would be to refactor it a little bit. My Idea right now would be:

Support deploys for at least 200 functions. Print a warning when deploying more than 60 functions to inform users that using --only would be better.
Have no hard limit on how many functions can be deployed. Simply take longer.
Log when a build for a function has been started in cloud build and when it succeeds, so that users can see deploy progress. (similar to now - except that with concurrency limit you would get the log only when the update was started).
Log a deploy summary at the end (same as it is now)

1) Treat a single function deploy as one operation that wraps everything needed to perform a deploy. Specifically:
1) Do the create/update request against the cloud functions API to get an operation
2) Poll the operation for error / success / rate limit
2) Move logging out of the api client, move retries logic for errors that happen on a request level into the api client.
3) Have a coordinator class/function that uses a Queue/Throttler that is used in other parts of the API to control concurrency and retries for deploys and long running operations. Concurrency would be set to 60 (build requests per minute for Cloud Build API).

Kamshak on 2 Oct 2020

Currently at 119 functions, and I deploy using the firebase deploy --only functions. We don't have it automated at this point, but will be moving to that in the future.

Thanks for taking a look at this issue!

KyleFoleyImajion on 2 Oct 2020

@Kamshak I'm open to larger changes in this part of the codebase - its one of our older and less healthy flows at this point. Fair warning, I am planning on making some large refactors to this path later this year that might make these suggested changes obsolete - totally understand if that changes your appetite for refactoring this.

To your specific suggestions:

Logging / Messages: The logging is done in the API itself. There is for example no was to update a function right now without having the error logged - even if you retry after that error and therefore the error doesn't have to be shown.
Mutable State: A lot of the API methods mutate the objects you pass in, which means for example an "error" property might stay after a retry since the retried operation object only merges in new fields but doesn't delete old one.

Agreed that these, along with the heavy reliance on long promise chains, are the main problems with this code.

Log when a build for a function has been started in cloud build and when it succeeds, so that users can see deploy progress. (similar to now - except that with concurrency limit you would get the log only when the update was started).

I'm not sure if you'll be able to see when the Cloud Build itself starts from the CRUDFunction and GetOperation calls. In the interest of being as clear as possible, I think the logging here should be more along the lines of "Function deploy started" as opposed to "cloud build started". We also still support Node8 deploys, which don't use Cloud Build in the same way.

The rest of the design sounds good to me at a high level, particularly reusing the Queue/Throttler code

joehan on 12 Oct 2020

We have the same problem with only around ~30 functions. In our case we deploy several times per day, or even several times per hour if we are really moving. We also leverage cloud build for other things (Cloud Run, Google App Engine), which perhaps makes the problem worse.

spencerwhyte on 31 Oct 2020

We are facing the same issue of lately. With around ~60 functions that we deploy automatically from a CI/CD pipeline, the deployment fails randomly for 1-3 functions. Though checking the limits in GCP clearly shows that rate limit was never hit for any of the deployments, so not sure what's going on in here.

We have 2 identical projects, one for our staging and one for production and so far this issue seems to be happening only on staging project, and everything seem to work fine on production setup. So not sure what's going on in there. It's a bit cryptic to debug and resolve this issue without knowing where could the underlying problem be. We have had issues with Cloud Functions in the past and most of the time it turned out to be an issue on the GCP side itself which was confirmed by the Firebase Support Team. Not sure should we already reach out to the Firebase Support Team or what 🤷

emadalam on 10 Nov 2020

We're encountering the same in a project with about 60 functions. Our workaround for now is to split out a few functions and deploy those separately.

-firebase deploy -f
+separate_functions='functions:one,functions:two,functions:three...." # About half of our cloud functions
+firebase deploy -f --only "$separate_functions"
+firebase deploy -f --except "$separate_functions"