Nugetgallery: Package ID service (enumerate all package IDs) for V3 feeds

Created on 4 May 2017 · 18Comments · Source: NuGet/NuGetGallery

NuGet v3 feeds should provide a way to get the full list of package ids available on the feed.

This could potentially be done either with a new service in index.json or by adding an index to flatcontainer that lists all package ids.

Scenarios

nuget.exe list support for v3 feeds

As v3 feeds exist today the only way to discover all packages ids and versions from a feed is to read the catalog. This requires a large number of url requests to get through all the edits and unneeded entries.

Having a way to get all the ids directly would allow a caller to then navigate to either the flat container lists or the registration blobs.

client performance

When resolving packages the nuget client searches each available feed for the complete list of versions available for every package id. If a user is referencing 100 packages (common due to system packages) and they add an additional feed containing a single package, the new feed will be queried 100 times and 99 of those will be 404s.

If the client could determine up front what package ids the feeds contained then it could avoid making 99 calls for packages that aren't on the feed.

Additional considerations

From discussing this with @joelverhagen and @ryuyu

Should the index be a flat file that can be requested in a single call or a tree that supports a large number of package ids?
Clients could cache this locally and use e-tags to optimize performance
The cache would need be invalidated more aggressively to avoid blocking recently uploaded packages
This should be static to support a high volume of traffic

Feature

Source

emgarten

👍3

Most helpful comment

I tried a proof of concept out to measure any performance increase this optimization could give. It's not at all ready for use, but this is my code:
https://github.com/joelverhagen/NuGet.Client/commit/9d17f37f317749ce49b7a411926f2d6f1bf517d9

I essentially added a new resource that skips checking flat container or registration blobs if the request ID is not in a hard-coded list of IDs per source.

I then performed a restore of the following project (dotnet new mvc) after clearing cache:

<Project Sdk="Microsoft.NET.Sdk.Web">
  <PropertyGroup>
    <TargetFramework>netcoreapp1.1</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.AspNetCore" Version="1.1.2" />
    <PackageReference Include="Microsoft.AspNetCore.Mvc" Version="1.1.3" />
    <PackageReference Include="Microsoft.AspNetCore.StaticFiles" Version="1.1.2" />
    <PackageReference Include="Microsoft.Extensions.Logging.Debug" Version="1.1.2" />
    <PackageReference Include="Microsoft.VisualStudio.Web.BrowserLink" Version="1.1.2" />
  </ItemGroup>
</Project>

Note that all of the packages in the restore graph (205 packages in all) come from NuGet.org. The other sources I had configured did not have the packages.

I used the following NuGet.config: NuGet.org, 2 MyGet sources, and a VSTS source:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <clear />
    <add key="jver-sandbox" value="https://www.myget.org/F/jver-sandbox/api/v3/index.json" />
    <add key="rx" value="https://dotnet.myget.org/F/rx/api/v3/index.json" />
    <add key="knapcode" value="https://knapcode.pkgs.visualstudio.com/_packaging/knapcode-nugetprotocol/nuget/v3/index.json" />
    <add key="NuGet.org" value="https://api.nuget.org/v3/index.json" />
  </packageSources>
  <disabledPackageSources>
     <clear />
  </disabledPackageSources>
</configuration>

Here's my data performance measurements.

Notes:

All times are in seconds.
"Before" means bits without the optimization (nuget.exe 4.3.0-preview1-4056 was used)
"After" means bits with the optimization
"NuGet.org first" means NuGet.org was the first package source in the configuration
"NuGet.org last" means NuGet.org was the last package source in the configuration

Attempt | Before, NuGet.org first | After, NuGet.org first | Before, NuGet.org last | After, NuGet.org last
--- | --- | --- | --- | ---
1 | 54.815 | 57.267 | 52.118 | 39.558
2 | 51.846 | 53.637 | 61.534 | 48.038
3 | 52.774 | 57.358 | 49.941 | 42.406
4 | 54.757 | 58.002 | 50.917 | 42.873
5 | 53.119 | 52.481 | 52.212 | 49.961
6 | 48.412 | 55.592 | 53.984 | 59.136
7 | 48.494 | 52.189 | 50.155 | 43.721
8 | 58.947 | 44.071 | 46.070 | 48.198
9 | 56.089 | 44.552 | 64.431 | 41.629
10 | 47.911 | 48.111 | 50.276 | 55.678
Average | 52.717 | 52.326 | 53.164 | 47.120

This indicates the following observations:

Order of sources seems to have some impact on restore performance. Presumable sources that are further down in the list are attempted later, allowing more time to be wasted on requests that will eventually lead to 404 Not Found.
1. When NuGet.org was the first source in the list, no performance difference was observed.
2. When NuGet.org was the liast source in the list, there was performance difference observed.
There is a measurable performance improvement (11% decrease in time) when using this ID list concept. This should be considered a lower bound since I hard coded the list of IDs.

The potential performance impact here is lower than I expected but still seems consequential. Another angle to consider is that this will likely help servers deal with load better since there will be many fewer 404 Not Found or terminated HTTP requests.

joelverhagen on 13 May 2017

❤2

All 18 comments

Imo this depends on the scenario. E.g. nuget.exe list will need such service, however nuget.exe list json should use search to retrieve its results. I wonder how many end users do a plain "list" without searching?

Yay on the client performance consideration though!

maartenba on 4 May 2017

@maartenba good point on list, I forgot it does searches also. The actual v3 list would need more design. But for general scenarios such as writing a script to download packages from a v3 feed or general discovery the id list would help.

emgarten on 4 May 2017

👍1

One scenario I am concerned about: feed proxies. Does this API mean that if feed A proxies B and C, this new endpoint will list all dependencies in B and C, too?

maartenba on 4 May 2017

Feedback

This could potentially be done either with a new service in index.json or by adding an index to flatcontainer that lists all package ids.

I thought more about this and I think it should be a new service in v3/index.json. We should definitely bake in the ability for this list to be broken up into pages, which does not really fit the flat container model. Additionally, some V3 implementations may not choose to support this feature. So, as we iterate on flat container protocol, I think it would be clearer to have this new thing as its own resource, versioned independently.

nuget.exe list support for v3 feeds

I think we really need to design "nuget.exe list for V3" separately. As @maartenba mentioned, there are multiple scenarios that "nuget.exe list for V2" supports:

enumerating listed package IDs
finding packages that match a search term
searching for the existence of an ID
(just thought of this one) listing all versions with -AllVersions

Scenario 1 could possibility be supported with this ID list, but we would need to decide how "listed" state of a package effects this.

Scenario 2 and 3 could go through the search service, but it's also no clear how "listed" state effects this.

Scenario 4 would have to essentially JOIN this ID list with flat container or registration.

client performance

I think we should focus on this as the primary goal. Given the lack of design around nuget.exe list for V3, we should take client performance as it is concrete and measurable.

Should the index be a flat file that can be requested in a single call or a tree that supports a large number of package ids?

As mentioned before, we should design in the tree idea. The page size can be flexible meaning server implementations can choose whether to put everything in one page or not.

Clients could cache this locally and use e-tags to optimize performance

This would not be necessary for all clients wanting to implement this new protocol, but our client should do this to save the download of a potentially large file. Given that new IDs are added pretty frequently, we should consider smaller page size.

This should be static to support a high volume of traffic

VSTS would still need to put this behind auth, right? I have always wondered why they do not push for the use of SAS tokens to eliminate the need for a app service in front of blob storage...

One scenario I am concerned about: feed proxies. Does this API mean that if feed A proxies B and C, this new endpoint will list all dependencies in B and C, too?

Yes. This should be relatively simple for the feed proxy by either using the catalog from A or B (if it exists) or by implement smart caching of A and B using etags. Again, having smaller pages on A and B could optimize this.

More thoughts

I love the idea of this resource because it knits the V3 protocol together. It provides an easy way for clients to explore the entire corpus of packages on a source.

Data file size

I did some tests on how big this file would be on NuGet.org (AKAIK the source with the largest set of IDs). Today, there are 99551 IDs in the NuGetGallery database. If you put these in a JSON array (most naive approach), the file has the following characteristics:

2.49 MB uncompressed on disk.
751 KB gzipped using 7-zip on "normal" compression.

To me this is not a scary large download.

Client usage

The tricky part in my mind would be how clients use this file. This is a lot of data to keep in memory for each nuget.exe operation, so I think our NuGet client should persist the data in a query-able, on-disk data file. A SQLite database comes to mind. This would also be convenient for storage of etags.

Caching is the other tricky part. If we properly implement etags, we could have a pretty short cache time. Again, small page sizes would mean more round trips (albeit parallelizable) but more favorable granularity of caching.

Server updating the file

Server implementation for NuGet.org would probably have a catalog2ids job that follows behind the flat container and registration cursors updating the ID list.

joelverhagen on 4 May 2017

❤1

Great insights!

Just wondering how many times enumerating listed package IDs would happen typically (and wondering if the current autocomplete endpoint could serve this content with a minor modification in sorting/paging?)

maartenba on 4 May 2017

I like this idea!

Seeker1437 on 4 May 2017

We could make a very compelling case for this client and server work if we develop a proof of concept in the client and demonstrate the potential performance improvements.

For example, we could update the client to use a hard-coded list of IDs available on nuget.org and two small MyGet feeds. Then, run a restore on a project with NETStandard.Library as a dependency. Clear cache and time the restore with and without this optimization.

Theoretically this should be a lot faster since all of the dependencies will come from NuGet.org and we will not need to hit the MyGet feeds at all.

joelverhagen on 4 May 2017

https://github.com/NuGet/Home/issues/5184 does sound compelling, too :-) (slightly related)

maartenba on 9 May 2017

I essentially added a new resource that skips checking flat container or registration blobs if the request ID is not in a hard-coded list of IDs per source.

I then performed a restore of the following project (dotnet new mvc) after clearing cache:

<Project Sdk="Microsoft.NET.Sdk.Web">
  <PropertyGroup>
    <TargetFramework>netcoreapp1.1</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.AspNetCore" Version="1.1.2" />
    <PackageReference Include="Microsoft.AspNetCore.Mvc" Version="1.1.3" />
    <PackageReference Include="Microsoft.AspNetCore.StaticFiles" Version="1.1.2" />
    <PackageReference Include="Microsoft.Extensions.Logging.Debug" Version="1.1.2" />
    <PackageReference Include="Microsoft.VisualStudio.Web.BrowserLink" Version="1.1.2" />
  </ItemGroup>
</Project>

Note that all of the packages in the restore graph (205 packages in all) come from NuGet.org. The other sources I had configured did not have the packages.

I used the following NuGet.config: NuGet.org, 2 MyGet sources, and a VSTS source:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <clear />
    <add key="jver-sandbox" value="https://www.myget.org/F/jver-sandbox/api/v3/index.json" />
    <add key="rx" value="https://dotnet.myget.org/F/rx/api/v3/index.json" />
    <add key="knapcode" value="https://knapcode.pkgs.visualstudio.com/_packaging/knapcode-nugetprotocol/nuget/v3/index.json" />
    <add key="NuGet.org" value="https://api.nuget.org/v3/index.json" />
  </packageSources>
  <disabledPackageSources>
     <clear />
  </disabledPackageSources>
</configuration>

Here's my data performance measurements.

Notes:

All times are in seconds.
"Before" means bits without the optimization (nuget.exe 4.3.0-preview1-4056 was used)
"After" means bits with the optimization
"NuGet.org first" means NuGet.org was the first package source in the configuration
"NuGet.org last" means NuGet.org was the last package source in the configuration

This indicates the following observations:

Order of sources seems to have some impact on restore performance. Presumable sources that are further down in the list are attempted later, allowing more time to be wasted on requests that will eventually lead to 404 Not Found.
1. When NuGet.org was the first source in the list, no performance difference was observed.
2. When NuGet.org was the liast source in the list, there was performance difference observed.
There is a measurable performance improvement (11% decrease in time) when using this ID list concept. This should be considered a lower bound since I hard coded the list of IDs.

joelverhagen on 13 May 2017

❤2

Perhaps https://github.com/NuGet/Home/issues/5184 would yield better performance? E.g try known origin for package I'd first, in case of failure try others.

maartenba on 13 May 2017

Restore will stop as soon as it finds an exact match from a remote source, which is why order can matter:
https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.DependencyResolver.Core/ResolverUtility.cs#L390-L391

If floating versions are used then all sources will be checked since there is no exact match, so the id list would be helpful there.

11% decrease in time seems pretty huge considering that most of the time here is probably spent downloading the packages. So this probably takes off more than half the time of finding the right source.

I would expect NuGet/Home#5184 to be the exact same performance as @joelverhagen's hardcoded lists here. The noop from the resources would be very fast.

emgarten on 14 May 2017

@joelverhagen , @emgarten , is this something you are still considering or NuGet/Home#5184 is the chosen feature?

skofman1 on 7 Jun 2017

I think these both make sense. We can implement them separately and still get value from both. This particular issue is broader performance improvement for any package not in the user packages folder. Issue https://github.com/NuGet/Home/issues/5184 is an additional control for users that really care where a package comes from.

joelverhagen on 7 Jun 2017

This server feature issue would provide benefits to all users with newer clients without any changes required on their part, and it would work on the first restore.

NuGet/Home#5184 is not being considered on the client side currently, to make it work the user would need to end up specifying the source for hundreds of packages manually.

I would really like to see this id service move forward. I think this needs:

Design of the json file format (done by both client and server)
Generate the file on nuget.org (server)
Resource for reading and using the new service (client)

The catalog and registration blobs have a consistent design in how they page and list items, we could use the same thing for this. We just need to agree on what it would look like.

emgarten on 7 Jun 2017

11% decrease in time seems pretty huge considering that most of the time here is probably spent downloading the packages. So this probably takes off more than half the time of finding the right source.

Is there a part of the code I could instrument to more clearly show the performance improvement? I'm not sure what you mean by "finding the right source".

Even if this is the case, is it really worth it to add a new V3 resource and a lot of new client code for a small improvement for the whole restore operation?

joelverhagen on 8 Jun 2017

I'm not sure what you mean by "finding the right source".

The time it takes to query all sources for their list of package versions to find the source that actually contains the package.

is it really worth it to add a new V3 resource and a lot of new client code for a small improvement for the whole restore operation?

I don't think any other optimization could come close to improving the full download scenario by 11%. The amount of client code needed depends on how complex the paging and caching strategy is.

emgarten on 8 Jun 2017

discussed offline with @joelverhagen, this needs more investigation

emgarten on 8 Jun 2017

One idea that occurred to me over the weekend is that the resource could come in the form of a serialized bloom filter. This would be small and be really quick for the client to query against once downloaded.

joelverhagen on 8 Jan 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings