While looking into this performance bug I noticed we're wasting so much time just downloading json and deserializing it. Some of those files are 300-400MB size for MS some internal projects. We need some way to know just content changed or not without having to download whole json response. If sha signature is same then we just keep using existing cache file.
I want to have SHA signaturen property of Package metadata response header. So that way we don't have download large json unless there are change of content. Currently we're just keep downloading same giant json and saving into cache file for a day or several hours then keep downloading same thing, which is very costly.
For example I got this response during debug I just want to add sha signature of Json body into below header.
Response header:
{Pragma: no-cache
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-TFS-ProcessId: ed817a2c-34d1-4679-820b-66c83e93dd3b
Strict-Transport-Security: max-age=31536000; includeSubDomains
ActivityId: c17c92f8-5fb1-40d6-b44a-b04658b005fe
X-TFS-Session: c17c92f8-5fb1-40d6-b44a-b04658b005fe
X-VSS-E2EID: c17c92f8-5fb1-40d6-b44a-b04658b005fe
X-VSS-UserData: 198f24b6-2967-63e3-8fad-660b44018526:[email protected]
X-FRAME-OPTIONS: SAMEORIGIN
X-Packaging-Migration: NuGetBlobMetadataV5
Request-Context: appId=cid-v1:936cbc5b-f0cf-47f3-b153-0eeabd846595
Access-Control-Expose-Headers: Request-Context
X-Content-Type-Options: nosniff
X-MSEdge-Ref: Ref A: ECDA26842C4C4BCFBD255BE84167CBE7 Ref B: WSTEDGE0613 Ref C: 2020-06-15T05:41:19Z
Cache-Control: no-cache
Date: Mon, 15 Jun 2020 05:41:24 GMT
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
}
@nkolev92 @zivkan If you have any comment please add here.
Could you tell us more about your scenario? What package source are you using? NuGet servers should mitigate this by splitting the package metadata resources into smaller chunks: https://docs.microsoft.com/en-us/nuget/api/registration-base-url-resource#registration-pages-and-leaves
For example, nuget.org splits a package's metadata if it has over 128 versions. Is this not enough for your needs?
It's connected to MS internal Cortana project. I have no idea of that 128 versions for now, most likely it helps with download go smoothly. But my problem it's more about not downloading same content again and again unless there is change.
Memory profiler shows we're wasting so time just simply download and deserialize jsons, and finally crash because of out of memory. If you wish I can share more details through teams what is going on.

I've spoken to client team several times about this subject. Any asset that is cached by the client is elligable for client cache check flow and should be considered in the broader context of HTTP caching paradigms (i.e. this is not a NuGet specific problem).
The typical approach uses some combination of etags or cache control headers. For NuGet client, cache control headers are ignored. They are essentially hard coded in the protocol resources which is a bit strange by the way from an HTTP perspective. It works so far so let's leave this alone.
This leaves etags. I recommend not inventing a new header or approach but rather leverage this pattern well adopted by many HTTP servers. For example etags are supported by Azure Blob Store out of the box. Etags are opaque but can serve the same purpose as what you described. The client sends If-None-Match: etag-previously-fetched and gets 304 Not Modified if the content has the same etag as before, meaning the client can stop there, no more HTTP requests.
This does mean more bookkeeping for client side. You need to remember the etag values. But this is potentially even better than using a crypto hash on a large blob on disk.
NuGet.org already has etags on our registration blobs so you can start trying it out right now.
There are some steps to still be answered of course:
I agree with Loic that step 1 would just be checking that Azure DevOps is using the tools already available: appropriate page size. But I fear given the size of the blobs and the short cache times you'll be hit with this big download/parse pain any way you slice it.
If they are not gzipping that's another "free" improvement. We added gzip to NuGet registration protocol several years ago.
A server side concern of returning a hash header is that you must compute the entire response body before return it to the client. For some compute based service like ASP.NET MVC this means using a trailer (limited support) or buffering the entire content, calculating the hash, return the hash a header, return the response body. For servers based on Azure Blob Storage, this means buffering the entire response body in memory, and doing a PUT to Blob Storage with both the hash and content in the same request (to avoid race conditions).
This is a LOT more painful than an opaque string like etag which can be implemented as simply as a string representing the time of last modification.
But my problem it's more about not downloading same content again and again unless there is change.
Memory profiler shows we're wasting so time just simply download and deserialize jsons, and finally crash because of out of memory.
I think the challenge is slightly different.
The out of memory issues would only happen if we attempt to use/retain all that data at once.
We can deserialize gigabytes of data easily as long as we don't retain to try to use it.
The specific example you have in mind list_ is actually the flat container resources which contains versions. Whether we download this from a server and use it, or we fetch it from the disk and use, it will not affect the OOM incidents differently.
So while I do agree that downloading a resource of 300MB often is an issue, it is a red herring when it comes to OOM issue. Us using the 300MB resource is a problem.
I think paginating the package content(flat container) resource would be the way to improve that, but that's require a protocol change.
https://docs.microsoft.com/en-us/nuget/api/package-base-address-resource
tldr
I definitely think we should consider etags. However, that is more about reducing network load, rather than fixing OOM problems.
I think paginating the package content(flat container) resource would be the way to improve that, but that's require a protocol change.
I spoke to @erdembayar offline. I think the affected endpoint is registration index not flat container index. Azure DevOps does not seem to be leveraging paging in the problem case.
Totally agree that etag will not fix the OOM. I suggested getting rid of JObject as the first area of investigation.
This issue still stands on its own but is about network IO, as you said, not OOM.
Totally agree that etag will not fix the OOM. I suggested getting rid of JObject as the first area of investigation.
You were ahead of me then. I was just looking at the registration utility and all the jobjects passed around from that to the metadata resources, and thought the exact same thing.
I think the affected endpoint is registration index not flat container index.
Had a conversation yesterday and the file that was opened looked like flat container. nvm then
@nkolev92 , @joelverhagen , should we continue tracking this issue or can I close it?
I don't speak for the others, but my point of view is that E-Tag is the industry standard HTTP header for the scenario that Erick was writing about. Therefore, using E-Tag, rather than a NuGet protocol specific header would increase the chance that servers other than nuget.org implement it.
However, how to get E-Tags officially part of the NuGet protocol is another question. I'm not sure this helps answer the "leave it open, or close it" question. Sorry.
I agree that an e-tag will be great. However, is there a business reason to add one? How will it be used? What would be the benefit of the server adding it without a client scenario taking advantage of it?
Client can't use a header that doesn't exist 馃槉 But as a consequence of nuget.org using Azure Blob Storage, and exposing it almost directly (well, via a CDN, but all the blob storage headers come through), client already receives the E-Tag header from nuget.org (at least the v3 feed, but that's the only place it will benefit client).
The client scenario is when a customer's NuGet client has previously downloaded a file from the v3 protocol, but more than 30 minutes ago. Client can use this header to avoid re-downloading the same content again, and just keep re-using the existing file it has locally. Particularly where customers are on a slow internet connection and restore needs to download large files that have already been saved in the http-cache more than 30 minutes ago, this could provide non-trivial improvements to restore performance.
At this time there's nothing for server to do, except make sure that if your implementation of the v3 protocol is re-architected, to make sure that an E-Tag header keeps being provided. But this needs to become officially part of the protocol, until then client has no right to demand any server to implement this header.
Got it. Renamed the issue.
Related: https://github.com/NuGet/NuGetGallery/issues/8071
The etag sent back by server today is non-compliant. Therefore HttpClient does not parse the etag response header as expected. See the aforementioned issue for more details.
Most helpful comment
Related: https://github.com/NuGet/NuGetGallery/issues/8071
The etag sent back by server today is non-compliant. Therefore
HttpClientdoes not parse the etag response header as expected. See the aforementioned issue for more details.