Azure-sdk-for-js: "Unclosed root tag" error when listing blobs or blob containers

Created on 23 Nov 2020 · 19Comments · Source: Azure/azure-sdk-for-js

Package Name: @azure/storage-blob
Package Version: 12.2.1
Operating system: Windows, macOS
[x] nodejs
- version: 12
[ ] browser
- name/version:
[ ] typescript
- version:
Is the bug related to documentation in
- [ ] README.md
- [ ] source code documentation
- [ ] SDK API docs on https://docs.microsoft.com

Describe the bug
Sometimes when listing blobs or blob containers, when the response body is very large, the method being invoked will throw with a error "Unclosed root tag".

To Reproduce
Steps to reproduce the behavior:

Have a container with lots of blobs in it
List blobs flat as fast as you can, you may have to do many in parallel

Expected behavior
I can list blobs all day long with no problems.

Screenshots
N/A

Additional context
The repro is difficult to get going. I can demo over Teams.

Client Storage bug customer-reported

Source

MRayermannMSFT

All 19 comments

It sounds like we don't get the full xml from the response text. As far as I know, we don't set a limit on the size of response body text, but our dependency node-fetch may have some limit because they use Buffer. However Buffer limit is 2^32 which is a large number.

@MRayermannMSFT do you have any idea on how large the response body is? Is the xml response looking fine in fiddler or other proxies?

jeremymeng on 24 Nov 2020

I did a quick look in Fiddler a few days and things looked fine, though the body was too big to render in Fiddler. I'll double check though, and I'll get an exact body count. Will comment again once I've done so.

MRayermannMSFT on 24 Nov 2020

@jeremymeng I did some testing & monitoring and I think you're right that this is a Buffer issue. Once I saw the process' memory hit almost 2 gigs (~1.9) in process explorer, I started seeing the "Unclosed root tag" error.

I saw it when running many, many, many, list blob calls running in parallel, but one of our users saw it while just doing a singular list blob containers call. For my calls, each body was about 3 million bytes each. I'm not sure how big the body was for the user's case, or what other things they might have been doing at the same time.

So whether or not the issue reproduces seems to definitely depend on:

how many concurrent calls are happening
what are the body sizes of those calls
how long does it take to parse/process the calls (how long is all of that data in memory)

So assuming we're right about this being a Buffer issue what can be done about it? I think from my perspective, I'd want the SDK to be more resilient to this type of error, though I can understand there's going to have to be a limit to that resilience eventually. At a minimum I would expect the default exponential retry policy (we don't pass in a policy) to kick in but it doesn't seem to. Though maybe that's by design since the web request itself succeeded, so y'all don't think there's a reason to retry.

What are your thoughts?

MRayermannMSFT on 24 Nov 2020

I doubt that it is related to concurrent calls since each list call has its own request/response.

Though maybe that's by design since the web request itself succeeded
Yes I think that's the case. This error happens during parseXML() on response.text() so when we get there it means there's no errors getting a response.

What is your user scenario? Just listing all blobs? Is it possible that you list them by pages (with some reasonable page size) then adding each page's result into the final list?

https://github.com/Azure/azure-sdk-for-js/blob/25f01e4cde7942b306fa0c48a608031eb003abb8/sdk/storage/storage-blob/samples/typescript/src/iterators-blobs.ts#L92

The service by default returns up to 5000 results per page. I am a bit surprised that we would end up going over the Buffer limit.

jeremymeng on 25 Nov 2020

I doubt that it is related to concurrent calls since each list call has its own request/response.

True that they each have their own request/response, but if the calls are happening in the same process, then they are having to share memory. If you have enough calls going on at once, and all of those calls are listing thousands of blobs, I can see how you might get to 2GB of memory.

What is your user scenario? Just listing all blobs? Is it possible that you list them by pages (with some reasonable page size) then adding each page's result into the final list?

Yes, we're attempting to list many blobs at once. Depending on what we're listing, it could mean just 1 list call at a time, or it could mean many at a time. Obviously I can set a limit to how many we do at a time, but I'd also like to be as fast as possible, for your code snipped, ya I'm basically doing that.

MRayermannMSFT on 25 Nov 2020

Could this be an issue with the parseXML(str: string, opts: SerializerOptions = {}) which accept a string as the input thus is affected by the V8 heap size limit?
If so we may get around this by using Buffer, which store data outside of the V8 heap.

ljian3377 on 26 Nov 2020

I see the error now running @MRayermannMSFT's repro code on Linux. First look shows that node-fetch gave us incomplete Body.text() after it concatenated stream data into a buffer then convert it to string.

jeremymeng on 2 Dec 2020

For anyone who is following along, the repro Jeremy refers to can be found in my fork of electron quick start:

git clone https://github.com/MRayermannMSFT/electron-quick-start
git checkout listing-issue-electron-11

You then set the SAS URI in NodeProc.js to be a SAS URI to a container with a lot of blobs in it, and then:

npm install
npm start

It seems depending on your OS and machine specs, you might have to tweak the for loop in NodeProc.js a bit to get a repro.

MRayermannMSFT on 2 Dec 2020

So I could repro the issue even without SDK code using node-fetch, or even built-in https module. Now I suspect that it's either service that ends the streaming prematurely, or something deeper is happening in the lower-level NodeJS stack. I pushed a couple commits on top of @MRayermannMSFT's repro project to my fork https://github.com/jeremymeng/electron-quick-start/tree/listing-issue-electron-11

Here's one error I got when using https. @ljian3377 is it possible to check the service side log to see anything unusual, given the request id below?

  accumBytes: 4072902
  hasOpenTag: true | hasCloseTag: false
{
  'transfer-encoding': 'chunked',
  'content-type': 'application/xml',
  server: 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0',
  'x-ms-request-id': 'd2b913ab-101e-0060-0307-c930a0000000',
  'x-ms-version': '2019-12-12',
  date: 'Thu, 03 Dec 2020 00:05:06 GMT',
  connection: 'close'
}
listing failed! Error: incomplete XML response when listing for 85-page1
    at doListingUntilDoneHttpsGet (/home/azuser/workin/electron-quick-start/NodeProc.js:78:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async Promise.all (index 85)

jeremymeng on 3 Dec 2020

From the server log, I see that this request was aborted as it timed out after sending back three chunks.
Note that the service will set a timeout for each operation if it's not set in the request.
It timed out after flushing back 3 chunks of 239586 bytes, each taking 20~30s. The total WriteLatencyInMs is 50307 while all time taken in the server is 51138 ms.
So were you testing with a limited bandwidth? Use a VM in the same region of the storage account or manually set a larger timeout (which currently seems to be set to 30s.)

And an issue here is that our SDK doesn't handle these kind of timeout error well when the response headers are already received while the body is incomplete.

ljian3377 on 3 Dec 2020

From the server log, I see that this request was aborted as it timed out after sending back three chunks.

So it's the server who closed the connection after 30s by default?

So were you testing with a limited bandwidth?

It's possible that with many concurrent requests my azure VM's network bandwidth got throttled. Regardless, this can happen in real customer scenario when they have slow network.

And an issue here is that our SDK doesn't handle these kind of timeout error well when the response headers are already received while the body is incomplete.

The problem is that the transportation layer didn't get any errors. It got an end event on the stream just like other normal cases so it's unclear what would indicate an error.

jeremymeng on 3 Dec 2020

So it's the server who closed the connection after 30s by default?

Yes.

The problem is that the transportation layer didn't get any errors. It got an end event on the stream just like other normal cases so it's unclear what would indicate an error.

https://en.wikipedia.org/wiki/Chunked_transfer_encoding says "The transmission ends when a zero-length chunk is received." I thought we won't get an "end chunk" in the timeout case?

You are the expert on network here. Let me know if there is anything I can do to help.

ljian3377 on 4 Dec 2020

Also, can we verify that the body length is the same as the Content-Length before we parse it?

ljian3377 on 4 Dec 2020

👍1

As the printed out log shows, there is no content length header, probably because chucked encoding is used.

jeremymeng on 4 Dec 2020

So after some offline discussions, we think it should be helpful to retry on this error, assuming that it's less unlikely that service would sent us incomplete response in normal cases. It may be easier to handle this error in core-http's retry policy, if other services also close connections after timeout.

jeremymeng on 10 Dec 2020

👍1

Storage has its own retry policy and not using retry policies from core-http. So need to change Storage behavior. In addition, the order of the two policy would need to be reversed in order for the retry policy to retry on the PARSE_ERROR from the deserialization policy

https://github.com/Azure/azure-sdk-for-js/blob/0b18dd62f8fb415d4690edee7e9b6b8ca6bcb769/sdk/storage/storage-blob/src/Pipeline.ts#L205-L206

I am not sure yet what are the impacts of reversing the order of these two policies. I will open a draft PR for further discussion.

jeremymeng on 5 Jan 2021

Or we could add another retry policy specific to this error, after deserialization.

jeremymeng on 5 Jan 2021

Opened PR #13076

jeremymeng on 6 Jan 2021

Closing this now that #13076 is merged.

jeremymeng on 27 Jan 2021

❤1

Was this page helpful?

0 / 5 - 0 ratings