Sp-dev-docs: Large amounts of 503 errors in multiple tenants

Created on 27 Feb 2020 · 24Comments · Source: SharePoint/sp-dev-docs

Question

Our application uses libcurl to send Sharepoint REST APIs (e.g. https://site-url/_api/Web) to process thousands of tenants. Since this February, our application has been experiencing large amounts of 503 errors in multiple tenants, severely hindering and almost breaking our application.

For some tenants the 503 errors occur only at 08:00 to 14:00 UTC+0 on workdays.
The 503 errors also happen for tenants that have not been used for months, hinting that the issue is most likely not an tenant issue.
We suspect that something is wrong with our application id. We ran identical versions of our app on the tenant mentioned in the 2nd point at the same time, the only difference being the application ids they use. One uses our production application id and the other an application id that is seldom used. The one with the production application id received 503 errors 4/5 of the time while the other one received no errors at all.

csorest answered question

Source

nozhT

👍1

Most helpful comment

This has been really bad today (and most of last week).
36.000 503 errors on 150 different customer tenants. I really think Microsoft should publish a public announcement.
We get no updates from Premier Support.

To all ISVs out there having the same issue. Please contact me.
We need to put some pressure on Microsoft together!!!!

SchauDK on 23 Mar 2020

👍6

All 24 comments

Thank you for reporting this issue. We will be triaging your incoming issue as soon as possible.

msft-github-bot on 27 Feb 2020

Curious, how many calls are you making into SharePoint? Do you ever receive 429 errors in your solutions?

You will see 503s when the server is "too busy" as a result of being overloaded... which may coincide during the normal work hours you specified.

bcameron1231 on 27 Feb 2020

We see the same 503s too with our app making REST calls. Honestly. I don't think we make that many, but we had to write our own 503 handler to capture them, back off and try again, otherwise our application was useless for our customers. You can look at the server health in the response headers - 0 is excellent, 10 is overloaded, and we frequently see 9 on multiple tenants, even when it's the first REST call we've made that day.

There is a recommendation to set an ISV user agent header with your request so that MS can recognise it's an app making the calls, but that's not possible in modern browsers where the UA herader is locked down.

Out experience is - expect many many 503 errors, and code defensively against them. :(

ng-marcus on 1 Mar 2020

I also found that uploading files can also trigger 429 errors in the REST API. We had a file importer that was to be fair uploading multiple files - 5 threads uploading files, but not at a huge bitrate. With that running I saw 429 errors in the REST API endpoint - so all activity in the SharePoint world seems to count across all endpoints.

ng-marcus on 1 Mar 2020

Curious, how many calls are you making into SharePoint? Do you ever receive 429 errors in your solutions?

Our app runs on our customers' machines which we do not have access to so we do not know how many calls are made across all tenants. We inspected one of the tenants having this problem and our app sends roughly 3k~7k APIs per day. However our API usage varies a lot from tenant to tenant and I wouldn't be surprised if there are tenants using hundreds of thousands of API per day.

We do receive 429 errors but 90% of the time the error's an 503.

You will see 503s when the server is "too busy" as a result of being overloaded... which may coincide during the normal work hours you specified.

That is what we initially suspected, but

during work hours the Sharepoint Online webpages are still responsive
if we change our application to use use another application id the 503 errors disappear
x-sharepointhealthscore: 1 in the successful responses (responses with 503 errors don't seem to have that header) so the server should be capable of handling our requests
so we think something else is going on.

We see the same 503s too with our app making REST calls. Honestly. I don't think we make that many, but we had to write our own 503 handler to capture them, back off and try again, otherwise our application was useless for our customers. You can look at the server health in the response headers - 0 is excellent, 10 is overloaded, and we frequently see 9 on multiple tenants, even when it's the first REST call we've made that day.

We handle 503 errors by exponentially backing off but it doesn't work - the subsequent retries still fail and our app gives up after retrying for roughly a hour.

It would make sense if the service health were 9 or 10, but to our surprise the service health turned out to be 1s or 0s.

There is a recommendation to set an ISV user agent header with your request so that MS can recognise it's an app making the calls, but that's not possible in modern browsers where the UA herader is locked down.

We can set it but we don't want to do that before understanding the consequences. If we set the User-Agent header, will Microsoft increase our app's API limits for us or will they contact us and tell us to stop sending so many requests?

nozhT on 2 Mar 2020

Same issue here - and it seems that this is getting worse over time.
Every now and then various API calls return a status 503. Setting the user agent header as proposed does not make any difference. As all of the current SDKs (including CSOM, OfficeDevPnP) do not handle 5xx errors with any retry logic this is getting more and more problematic…
It's comparable to the situation in October last year where we had seen a lot of 500s, now it's 503s. https://github.com/SharePoint/sp-dev-docs/issues/4924

cwdata on 20 Mar 2020

👍1

We also encounter lots of 503's sometimes 429's in both CSOM (using PnP context), PnP cmdlets, PnP apply template, Graph Get,Post,Put,Patch queries to Groups/Teams.

It's starting to disrupt critical systems on our customer's environments. As stated above, the back-off retries does seem to help a bit but there are still lot's of cases where the errors are blocking our automation tooling.

@andrewconnell @VesaJuvonen we can't see any Service health messages in this area. Are there any known recent issues that are causing this behaviour? Is there anything you would suggest us to do besides creating Microsoft support requests on each tenant where this happens?

advdberg on 20 Mar 2020

503's & 429's are totally different things. 429's are your responsibility as you're being throttled & you need to back off on your calls.

As for the 503's, this looks like #4924... correct? If so, MSFT recommended submitting issues using the SP tenant admin center support options (ref https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-558107240)

andrewconnell on 20 Mar 2020

👍1

If you are in general getting these throttling issues, please do report them also using the tenant administrative support tooling to tenant support. If you get a response where your ticket is declined since it's about dev topics, please share a screenshot with me, so that we can deal with this internally more efficiently.

There is a huge uptake on the cloud usage and we are working on increasing capacity as best as we can, but we are definitely interested on getting these reports.

VesaJuvonen on 20 Mar 2020

👍3

Please see a long explanation & guidance I posted in #4924 related to 50X issues as it applies here as well: https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513

andrewconnell on 21 Mar 2020

👍3

To all ISVs out there having the same issue. Please contact me.
We need to put some pressure on Microsoft together!!!!

SchauDK on 23 Mar 2020

👍6

It is definitely a Microsoft issue. It even 503's its own requests:

bryqu on 23 Mar 2020

@SchauDK seeing the same thing, lots of 503 errors, causing issues for our customers

mcgeeky on 23 Mar 2020

@SchauDK @bryqu @mcgeeky see https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513

IMHO, everyone gets a pass right now as no system was ever designed for THIS many people to migrate to working from home this fast with this little notice. As @VesaJuvonen said above:

There is a huge uptake on the cloud usage and we are working on increasing capacity as best as we can, but we are definitely interested on getting these reports.

This isn't the forum for 500's and 503's... as I explained here, https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513, these are platform issues which should be submitted as a support ticket via your tenant admin center.

andrewconnell on 23 Mar 2020

@andrewconnell I'm aware of https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513
We've already asked customers to submit tickets. I'm not asking for support in this forum, we're discussing the issue that many ISVs/customers have right now.

@VesaJuvonen I'm expressing my opinion that Microsoft should publish an official announcement that they don't have the capacity to keep the services running. That I can tell my customers and they'll hopefully understand. But right now it is our product that isn't working, so thousands of people can't do their job.

SchauDK on 23 Mar 2020

👍2

@SchauDK
Normally, if enough tenants raise a ticket MS will set the state of the service to something like "disrupted" which will be visible on the admin page.

Regarding the issue itself:
In the past we where more and more simply trusting that things "just work". So most of the code including SDKs and frameworks does not include any catch and retry logic for 50x errors. There are SLAs around cloud services but I think mostly they do not cover such intermittent issues.
So the basic question here is: Is it a design flaw to just trust an API that it will deliver a result every time we call it? Should we better include retry logic that handles such errors?
As we could see, these things happen. Be it because of bugs or as a result of a rapidly changing environment.
What do you think?

cwdata on 23 Mar 2020

@cwdata We're already using exponential backoff for 50x errors. We're ok with that for background jobs until we reach a retry count of 10. Then we have a problem.
But what's even worse is user API calls with retry logic. Imagine a user interface where it takes minutes to render and eventually you need to throw an 503 error at the user.
If this is what we need to get used to, Microsoft might as well just abandon all APIs as it makes it impossible to build 3rd party solutions on top of Office365.

I really hope that Microsoft will solve these capacity issues soon. And when it is solved an idea could be that large enterprise customers can buy extra capacity. I'm sure they
're willing to pay for it as they've build their business on the O365 platform.

SchauDK on 23 Mar 2020

👍2

There will be an announcement coming very soon about the throttling which kicked in to prioritize enduser traffic over non-enduser traffic. And yes, even 503's are throttling responses..

robinmeure on 23 Mar 2020

👍3

@SchauDK We are experiencing the exact same situation. 503 responses started last Thursday and have now become an expectation during business hours. Our mitigation strategy has been bolstering our retry logic along with moving some processes to off hours. We use mostly the SharePoint Online REST API. The part that is particularly frustrating on this front is that if you were to read health advisories you would be left to think that everything is ok. We have customers (just like you) that we are responsible for providing updates and guidance. Please someone at Microsoft help us help you.

dsm0880 on 23 Mar 2020

👍2

In the https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513 referenced many times in this thead it says

If you are getting 500's or 503's, these are considered product issues. These status codes mean there is a problem with the service. 500 is internal server error & 503 is service unavailable. There's nothing we in the community can do here to investigate nor are the MSFT folks in the extensibility areas (SPFx/APIs/etc) who frequent this list can do. Microsoft support has to engage.

Maybe it should be revised a little to explicitly state that this is also your own responsibility handle server side throttling.

@robinmeure looking forward to the announcement. I guess any API call (REST/JSOM/CSOM) regardless the context is non-enduser traffic.

SchauDK on 23 Mar 2020

@robinmeure Any timeline or where we can expect an announcement?

dsm0880 on 24 Mar 2020

It has been published as message in the message center under MC207439.
Furthermore, there is another message SP207374 about provisioning.

robinmeure on 24 Mar 2020

@robinmeure Thanks, I got this. Makes sense, but it does not say anything on the REST API 503 issues as noted above. Any additional details there?

dsm0880 on 24 Mar 2020

Closing this issue as "answered". If you encounter a similar issue(s), please open up a new issue. See our wiki for more details: Issue-List: Our approach to closed issues

msft-github-bot on 27 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings