Nugetgallery: Add E-tags to V3 protocol

Created on 15 Jun 2020 · 13Comments · Source: NuGet/NuGetGallery

Is your feature request related to a problem? Please describe.

While looking into this performance bug I noticed we're wasting so much time just downloading json and deserializing it. Some of those files are 300-400MB size for MS some internal projects. We need some way to know just content changed or not without having to download whole json response. If sha signature is same then we just keep using existing cache file.

Describe the solution you'd like

I want to have SHA signaturen property of Package metadata response header. So that way we don't have download large json unless there are change of content. Currently we're just keep downloading same giant json and saving into cache file for a day or several hours then keep downloading same thing, which is very costly.
For example I got this response during debug I just want to add sha signature of Json body into below header.
Response header:
{Pragma: no-cache Transfer-Encoding: chunked Vary: Accept-Encoding X-TFS-ProcessId: ed817a2c-34d1-4679-820b-66c83e93dd3b Strict-Transport-Security: max-age=31536000; includeSubDomains ActivityId: c17c92f8-5fb1-40d6-b44a-b04658b005fe X-TFS-Session: c17c92f8-5fb1-40d6-b44a-b04658b005fe X-VSS-E2EID: c17c92f8-5fb1-40d6-b44a-b04658b005fe X-VSS-UserData: 198f24b6-2967-63e3-8fad-660b44018526:[email protected] X-FRAME-OPTIONS: SAMEORIGIN X-Packaging-Migration: NuGetBlobMetadataV5 Request-Context: appId=cid-v1:936cbc5b-f0cf-47f3-b153-0eeabd846595 Access-Control-Expose-Headers: Request-Context X-Content-Type-Options: nosniff X-MSEdge-Ref: Ref A: ECDA26842C4C4BCFBD255BE84167CBE7 Ref B: WSTEDGE0613 Ref C: 2020-06-15T05:41:19Z Cache-Control: no-cache Date: Mon, 15 Jun 2020 05:41:24 GMT P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT" }

Additional context

@nkolev92 @zivkan If you have any comment please add here.

Performance V3 Feed Feature

Source

erdembayar

Most helpful comment

The etag sent back by server today is non-compliant. Therefore HttpClient does not parse the etag response header as expected. See the aforementioned issue for more details.

joelverhagen on 7 Aug 2020

👍2

All 13 comments

Could you tell us more about your scenario? What package source are you using? NuGet servers should mitigate this by splitting the package metadata resources into smaller chunks: https://docs.microsoft.com/en-us/nuget/api/registration-base-url-resource#registration-pages-and-leaves

For example, nuget.org splits a package's metadata if it has over 128 versions. Is this not enough for your needs?

loic-sharma on 15 Jun 2020

It's connected to MS internal Cortana project. I have no idea of that 128 versions for now, most likely it helps with download go smoothly. But my problem it's more about not downloading same content again and again unless there is change.
Memory profiler shows we're wasting so time just simply download and deserialize jsons, and finally crash because of out of memory. If you wish I can share more details through teams what is going on.

erdembayar on 15 Jun 2020

I've spoken to client team several times about this subject. Any asset that is cached by the client is elligable for client cache check flow and should be considered in the broader context of HTTP caching paradigms (i.e. this is not a NuGet specific problem).

The typical approach uses some combination of etags or cache control headers. For NuGet client, cache control headers are ignored. They are essentially hard coded in the protocol resources which is a bit strange by the way from an HTTP perspective. It works so far so let's leave this alone.

This leaves etags. I recommend not inventing a new header or approach but rather leverage this pattern well adopted by many HTTP servers. For example etags are supported by Azure Blob Store out of the box. Etags are opaque but can serve the same purpose as what you described. The client sends If-None-Match: etag-previously-fetched and gets 304 Not Modified if the content has the same etag as before, meaning the client can stop there, no more HTTP requests.

This does mean more bookkeeping for client side. You need to remember the etag values. But this is potentially even better than using a crypto hash on a large blob on disk.

NuGet.org already has etags on our registration blobs so you can start trying it out right now.

There are some steps to still be answered of course:

getting Azure DevOps to support this "new" protocol (we don't own the endpoint making these big JSON blobs so we can just advise on protocol)
persisting etag on client (my suggestion has always been SQLite 😺 but I'm a fanboy so 🤷)
considering optimizations like putting leaf etag in parent documents

I agree with Loic that step 1 would just be checking that Azure DevOps is using the tools already available: appropriate page size. But I fear given the size of the blobs and the short cache times you'll be hit with this big download/parse pain any way you slice it.

If they are not gzipping that's another "free" improvement. We added gzip to NuGet registration protocol several years ago.

joelverhagen on 16 Jun 2020

👍1

A server side concern of returning a hash header is that you must compute the entire response body before return it to the client. For some compute based service like ASP.NET MVC this means using a trailer (limited support) or buffering the entire content, calculating the hash, return the hash a header, return the response body. For servers based on Azure Blob Storage, this means buffering the entire response body in memory, and doing a PUT to Blob Storage with both the hash and content in the same request (to avoid race conditions).

This is a LOT more painful than an opaque string like etag which can be implemented as simply as a string representing the time of last modification.

joelverhagen on 16 Jun 2020

But my problem it's more about not downloading same content again and again unless there is change.
Memory profiler shows we're wasting so time just simply download and deserialize jsons, and finally crash because of out of memory.

I think the challenge is slightly different.
The out of memory issues would only happen if we attempt to use/retain all that data at once.
We can deserialize gigabytes of data easily as long as we don't retain to try to use it.

The specific example you have in mind list_ is actually the flat container resources which contains versions. Whether we download this from a server and use it, or we fetch it from the disk and use, it will not affect the OOM incidents differently.

So while I do agree that downloading a resource of 300MB often is an issue, it is a red herring when it comes to OOM issue. Us using the 300MB resource is a problem.

I think paginating the package content(flat container) resource would be the way to improve that, but that's require a protocol change.
https://docs.microsoft.com/en-us/nuget/api/package-base-address-resource

tldr

I definitely think we should consider etags. However, that is more about reducing network load, rather than fixing OOM problems.

nkolev92 on 17 Jun 2020

I think paginating the package content(flat container) resource would be the way to improve that, but that's require a protocol change.

I spoke to @erdembayar offline. I think the affected endpoint is registration index not flat container index. Azure DevOps does not seem to be leveraging paging in the problem case.

Totally agree that etag will not fix the OOM. I suggested getting rid of JObject as the first area of investigation.

This issue still stands on its own but is about network IO, as you said, not OOM.

joelverhagen on 17 Jun 2020

Totally agree that etag will not fix the OOM. I suggested getting rid of JObject as the first area of investigation.

You were ahead of me then. I was just looking at the registration utility and all the jobjects passed around from that to the metadata resources, and thought the exact same thing.

I think the affected endpoint is registration index not flat container index.

Had a conversation yesterday and the file that was opened looked like flat container. nvm then

nkolev92 on 17 Jun 2020

@nkolev92 , @joelverhagen , should we continue tracking this issue or can I close it?

skofman1 on 7 Aug 2020

I don't speak for the others, but my point of view is that E-Tag is the industry standard HTTP header for the scenario that Erick was writing about. Therefore, using E-Tag, rather than a NuGet protocol specific header would increase the chance that servers other than nuget.org implement it.

However, how to get E-Tags officially part of the NuGet protocol is another question. I'm not sure this helps answer the "leave it open, or close it" question. Sorry.

zivkan on 7 Aug 2020

👍1

I agree that an e-tag will be great. However, is there a business reason to add one? How will it be used? What would be the benefit of the server adding it without a client scenario taking advantage of it?

skofman1 on 7 Aug 2020

Client can't use a header that doesn't exist 😊 But as a consequence of nuget.org using Azure Blob Storage, and exposing it almost directly (well, via a CDN, but all the blob storage headers come through), client already receives the E-Tag header from nuget.org (at least the v3 feed, but that's the only place it will benefit client).

The client scenario is when a customer's NuGet client has previously downloaded a file from the v3 protocol, but more than 30 minutes ago. Client can use this header to avoid re-downloading the same content again, and just keep re-using the existing file it has locally. Particularly where customers are on a slow internet connection and restore needs to download large files that have already been saved in the http-cache more than 30 minutes ago, this could provide non-trivial improvements to restore performance.

At this time there's nothing for server to do, except make sure that if your implementation of the v3 protocol is re-architected, to make sure that an E-Tag header keeps being provided. But this needs to become officially part of the protocol, until then client has no right to demand any server to implement this header.

zivkan on 7 Aug 2020

Got it. Renamed the issue.

skofman1 on 7 Aug 2020

The etag sent back by server today is non-compliant. Therefore HttpClient does not parse the etag response header as expected. See the aforementioned issue for more details.

joelverhagen on 7 Aug 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings