Describe the bug
Sometimes when listing blobs or blob containers, when the response body is very large, the method being invoked will throw with a error "Unclosed root tag".
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I can list blobs all day long with no problems.
Screenshots
N/A
Additional context
The repro is difficult to get going. I can demo over Teams.
It sounds like we don't get the full xml from the response text. As far as I know, we don't set a limit on the size of response body text, but our dependency node-fetch may have some limit because they use Buffer. However Buffer limit is 2^32 which is a large number.
@MRayermannMSFT do you have any idea on how large the response body is? Is the xml response looking fine in fiddler or other proxies?
I did a quick look in Fiddler a few days and things looked fine, though the body was too big to render in Fiddler. I'll double check though, and I'll get an exact body count. Will comment again once I've done so.
@jeremymeng I did some testing & monitoring and I think you're right that this is a Buffer issue. Once I saw the process' memory hit almost 2 gigs (~1.9) in process explorer, I started seeing the "Unclosed root tag" error.
I saw it when running many, many, many, list blob calls running in parallel, but one of our users saw it while just doing a singular list blob containers call. For my calls, each body was about 3 million bytes each. I'm not sure how big the body was for the user's case, or what other things they might have been doing at the same time.
So whether or not the issue reproduces seems to definitely depend on:
So assuming we're right about this being a Buffer issue what can be done about it? I think from my perspective, I'd want the SDK to be more resilient to this type of error, though I can understand there's going to have to be a limit to that resilience eventually. At a minimum I would expect the default exponential retry policy (we don't pass in a policy) to kick in but it doesn't seem to. Though maybe that's by design since the web request itself succeeded, so y'all don't think there's a reason to retry.
What are your thoughts?
I doubt that it is related to concurrent calls since each list call has its own request/response.
Though maybe that's by design since the web request itself succeeded
Yes I think that's the case. This error happens duringparseXML()onresponse.text()so when we get there it means there's no errors getting a response.
What is your user scenario? Just listing all blobs? Is it possible that you list them by pages (with some reasonable page size) then adding each page's result into the final list?
The service by default returns up to 5000 results per page. I am a bit surprised that we would end up going over the Buffer limit.
I doubt that it is related to concurrent calls since each list call has its own request/response.
True that they each have their own request/response, but if the calls are happening in the same process, then they are having to share memory. If you have enough calls going on at once, and all of those calls are listing thousands of blobs, I can see how you might get to 2GB of memory.
What is your user scenario? Just listing all blobs? Is it possible that you list them by pages (with some reasonable page size) then adding each page's result into the final list?
Yes, we're attempting to list many blobs at once. Depending on what we're listing, it could mean just 1 list call at a time, or it could mean many at a time. Obviously I can set a limit to how many we do at a time, but I'd also like to be as fast as possible, for your code snipped, ya I'm basically doing that.
Could this be an issue with the parseXML(str: string, opts: SerializerOptions = {}) which accept a string as the input thus is affected by the V8 heap size limit?
If so we may get around this by using Buffer, which store data outside of the V8 heap.
I see the error now running @MRayermannMSFT's repro code on Linux. First look shows that node-fetch gave us incomplete Body.text() after it concatenated stream data into a buffer then convert it to string.
For anyone who is following along, the repro Jeremy refers to can be found in my fork of electron quick start:
git clone https://github.com/MRayermannMSFT/electron-quick-start
git checkout listing-issue-electron-11
You then set the SAS URI in NodeProc.js to be a SAS URI to a container with a lot of blobs in it, and then:
npm install
npm start
It seems depending on your OS and machine specs, you might have to tweak the for loop in NodeProc.js a bit to get a repro.
So I could repro the issue even without SDK code using node-fetch, or even built-in https module. Now I suspect that it's either service that ends the streaming prematurely, or something deeper is happening in the lower-level NodeJS stack. I pushed a couple commits on top of @MRayermannMSFT's repro project to my fork https://github.com/jeremymeng/electron-quick-start/tree/listing-issue-electron-11
Here's one error I got when using https. @ljian3377 is it possible to check the service side log to see anything unusual, given the request id below?
accumBytes: 4072902
hasOpenTag: true | hasCloseTag: false
{
'transfer-encoding': 'chunked',
'content-type': 'application/xml',
server: 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0',
'x-ms-request-id': 'd2b913ab-101e-0060-0307-c930a0000000',
'x-ms-version': '2019-12-12',
date: 'Thu, 03 Dec 2020 00:05:06 GMT',
connection: 'close'
}
listing failed! Error: incomplete XML response when listing for 85-page1
at doListingUntilDoneHttpsGet (/home/azuser/workin/electron-quick-start/NodeProc.js:78:17)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at async Promise.all (index 85)
From the server log, I see that this request was aborted as it timed out after sending back three chunks.
Note that the service will set a timeout for each operation if it's not set in the request.
It timed out after flushing back 3 chunks of 239586 bytes, each taking 20~30s. The total WriteLatencyInMs is 50307 while all time taken in the server is 51138 ms.
So were you testing with a limited bandwidth? Use a VM in the same region of the storage account or manually set a larger timeout (which currently seems to be set to 30s.)
And an issue here is that our SDK doesn't handle these kind of timeout error well when the response headers are already received while the body is incomplete.
From the server log, I see that this request was aborted as it timed out after sending back three chunks.
So it's the server who closed the connection after 30s by default?
So were you testing with a limited bandwidth?
It's possible that with many concurrent requests my azure VM's network bandwidth got throttled. Regardless, this can happen in real customer scenario when they have slow network.
And an issue here is that our SDK doesn't handle these kind of timeout error well when the response headers are already received while the body is incomplete.
The problem is that the transportation layer didn't get any errors. It got an end event on the stream just like other normal cases so it's unclear what would indicate an error.
So it's the server who closed the connection after 30s by default?
Yes.
The problem is that the transportation layer didn't get any errors. It got an end event on the stream just like other normal cases so it's unclear what would indicate an error.
https://en.wikipedia.org/wiki/Chunked_transfer_encoding says "The transmission ends when a zero-length chunk is received." I thought we won't get an "end chunk" in the timeout case?
You are the expert on network here. Let me know if there is anything I can do to help.
Also, can we verify that the body length is the same as the Content-Length before we parse it?
As the printed out log shows, there is no content length header, probably because chucked encoding is used.
So after some offline discussions, we think it should be helpful to retry on this error, assuming that it's less unlikely that service would sent us incomplete response in normal cases. It may be easier to handle this error in core-http's retry policy, if other services also close connections after timeout.
Storage has its own retry policy and not using retry policies from core-http. So need to change Storage behavior. In addition, the order of the two policy would need to be reversed in order for the retry policy to retry on the PARSE_ERROR from the deserialization policy
I am not sure yet what are the impacts of reversing the order of these two policies. I will open a draft PR for further discussion.
Or we could add another retry policy specific to this error, after deserialization.
Opened PR #13076
Closing this now that #13076 is merged.