Our application uses libcurl to send Sharepoint REST APIs (e.g. https://site-url/_api/Web) to process thousands of tenants. Since this February, our application has been experiencing large amounts of 503 errors in multiple tenants, severely hindering and almost breaking our application.
Thank you for reporting this issue. We will be triaging your incoming issue as soon as possible.
Curious, how many calls are you making into SharePoint? Do you ever receive 429 errors in your solutions?
You will see 503s when the server is "too busy" as a result of being overloaded... which may coincide during the normal work hours you specified.
We see the same 503s too with our app making REST calls. Honestly. I don't think we make that many, but we had to write our own 503 handler to capture them, back off and try again, otherwise our application was useless for our customers. You can look at the server health in the response headers - 0 is excellent, 10 is overloaded, and we frequently see 9 on multiple tenants, even when it's the first REST call we've made that day.
There is a recommendation to set an ISV user agent header with your request so that MS can recognise it's an app making the calls, but that's not possible in modern browsers where the UA herader is locked down.
Out experience is - expect many many 503 errors, and code defensively against them. :(
I also found that uploading files can also trigger 429 errors in the REST API. We had a file importer that was to be fair uploading multiple files - 5 threads uploading files, but not at a huge bitrate. With that running I saw 429 errors in the REST API endpoint - so all activity in the SharePoint world seems to count across all endpoints.
Curious, how many calls are you making into SharePoint? Do you ever receive 429 errors in your solutions?
Our app runs on our customers' machines which we do not have access to so we do not know how many calls are made across all tenants. We inspected one of the tenants having this problem and our app sends roughly 3k~7k APIs per day. However our API usage varies a lot from tenant to tenant and I wouldn't be surprised if there are tenants using hundreds of thousands of API per day.
We do receive 429 errors but 90% of the time the error's an 503.
You will see 503s when the server is "too busy" as a result of being overloaded... which may coincide during the normal work hours you specified.
That is what we initially suspected, but
We see the same 503s too with our app making REST calls. Honestly. I don't think we make that many, but we had to write our own 503 handler to capture them, back off and try again, otherwise our application was useless for our customers. You can look at the server health in the response headers - 0 is excellent, 10 is overloaded, and we frequently see 9 on multiple tenants, even when it's the first REST call we've made that day.
We handle 503 errors by exponentially backing off but it doesn't work - the subsequent retries still fail and our app gives up after retrying for roughly a hour.
It would make sense if the service health were 9 or 10, but to our surprise the service health turned out to be 1s or 0s.
There is a recommendation to set an ISV user agent header with your request so that MS can recognise it's an app making the calls, but that's not possible in modern browsers where the UA herader is locked down.
We can set it but we don't want to do that before understanding the consequences. If we set the User-Agent header, will Microsoft increase our app's API limits for us or will they contact us and tell us to stop sending so many requests?
Same issue here - and it seems that this is getting worse over time.
Every now and then various API calls return a status 503. Setting the user agent header as proposed does not make any difference. As all of the current SDKs (including CSOM, OfficeDevPnP) do not handle 5xx errors with any retry logic this is getting more and more problematic…
It's comparable to the situation in October last year where we had seen a lot of 500s, now it's 503s. https://github.com/SharePoint/sp-dev-docs/issues/4924
We also encounter lots of 503's sometimes 429's in both CSOM (using PnP context), PnP cmdlets, PnP apply template, Graph Get,Post,Put,Patch queries to Groups/Teams.
It's starting to disrupt critical systems on our customer's environments. As stated above, the back-off retries does seem to help a bit but there are still lot's of cases where the errors are blocking our automation tooling.
@andrewconnell @VesaJuvonen we can't see any Service health messages in this area. Are there any known recent issues that are causing this behaviour? Is there anything you would suggest us to do besides creating Microsoft support requests on each tenant where this happens?
503's & 429's are totally different things. 429's are your responsibility as you're being throttled & you need to back off on your calls.
As for the 503's, this looks like #4924... correct? If so, MSFT recommended submitting issues using the SP tenant admin center support options (ref https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-558107240)
If you are in general getting these throttling issues, please do report them also using the tenant administrative support tooling to tenant support. If you get a response where your ticket is declined since it's about dev topics, please share a screenshot with me, so that we can deal with this internally more efficiently.
There is a huge uptake on the cloud usage and we are working on increasing capacity as best as we can, but we are definitely interested on getting these reports.
Please see a long explanation & guidance I posted in #4924 related to 50X issues as it applies here as well: https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513
This has been really bad today (and most of last week).
36.000 503 errors on 150 different customer tenants. I really think Microsoft should publish a public announcement.
We get no updates from Premier Support.
To all ISVs out there having the same issue. Please contact me.
We need to put some pressure on Microsoft together!!!!
It is definitely a Microsoft issue. It even 503's its own requests:
@SchauDK seeing the same thing, lots of 503 errors, causing issues for our customers
@SchauDK @bryqu @mcgeeky see https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513
IMHO, everyone gets a pass right now as no system was ever designed for THIS many people to migrate to working from home this fast with this little notice. As @VesaJuvonen said above:
There is a huge uptake on the cloud usage and we are working on increasing capacity as best as we can, but we are definitely interested on getting these reports.
This isn't the forum for 500's and 503's... as I explained here, https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513, these are platform issues which should be submitted as a support ticket via your tenant admin center.
@andrewconnell I'm aware of https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513
We've already asked customers to submit tickets. I'm not asking for support in this forum, we're discussing the issue that many ISVs/customers have right now.
@VesaJuvonen I'm expressing my opinion that Microsoft should publish an official announcement that they don't have the capacity to keep the services running. That I can tell my customers and they'll hopefully understand. But right now it is our product that isn't working, so thousands of people can't do their job.
@SchauDK
Normally, if enough tenants raise a ticket MS will set the state of the service to something like "disrupted" which will be visible on the admin page.
Regarding the issue itself:
In the past we where more and more simply trusting that things "just work". So most of the code including SDKs and frameworks does not include any catch and retry logic for 50x errors. There are SLAs around cloud services but I think mostly they do not cover such intermittent issues.
So the basic question here is: Is it a design flaw to just trust an API that it will deliver a result every time we call it? Should we better include retry logic that handles such errors?
As we could see, these things happen. Be it because of bugs or as a result of a rapidly changing environment.
What do you think?
@cwdata We're already using exponential backoff for 50x errors. We're ok with that for background jobs until we reach a retry count of 10. Then we have a problem.
But what's even worse is user API calls with retry logic. Imagine a user interface where it takes minutes to render and eventually you need to throw an 503 error at the user.
If this is what we need to get used to, Microsoft might as well just abandon all APIs as it makes it impossible to build 3rd party solutions on top of Office365.
I really hope that Microsoft will solve these capacity issues soon. And when it is solved an idea could be that large enterprise customers can buy extra capacity. I'm sure they
're willing to pay for it as they've build their business on the O365 platform.
There will be an announcement coming very soon about the throttling which kicked in to prioritize enduser traffic over non-enduser traffic. And yes, even 503's are throttling responses..
@SchauDK We are experiencing the exact same situation. 503 responses started last Thursday and have now become an expectation during business hours. Our mitigation strategy has been bolstering our retry logic along with moving some processes to off hours. We use mostly the SharePoint Online REST API. The part that is particularly frustrating on this front is that if you were to read health advisories you would be left to think that everything is ok. We have customers (just like you) that we are responsible for providing updates and guidance. Please someone at Microsoft help us help you.
In the https://github.com/SharePoint/sp-dev-docs/issues/4924#issuecomment-602026513 referenced many times in this thead it says
If you are getting 500's or 503's, these are considered product issues. These status codes mean there is a problem with the service. 500 is internal server error & 503 is service unavailable. There's nothing we in the community can do here to investigate nor are the MSFT folks in the extensibility areas (SPFx/APIs/etc) who frequent this list can do. Microsoft support has to engage.
Maybe it should be revised a little to explicitly state that this is also your own responsibility handle server side throttling.
@robinmeure looking forward to the announcement. I guess any API call (REST/JSOM/CSOM) regardless the context is non-enduser traffic.
@robinmeure Any timeline or where we can expect an announcement?
It has been published as message in the message center under MC207439.
Furthermore, there is another message SP207374 about provisioning.
@robinmeure Thanks, I got this. Makes sense, but it does not say anything on the REST API 503 issues as noted above. Any additional details there?
Closing this issue as "answered". If you encounter a similar issue(s), please open up a new issue. See our wiki for more details: Issue-List: Our approach to closed issues
Most helpful comment
This has been really bad today (and most of last week).
36.000 503 errors on 150 different customer tenants. I really think Microsoft should publish a public announcement.
We get no updates from Premier Support.
To all ISVs out there having the same issue. Please contact me.
We need to put some pressure on Microsoft together!!!!