NuGet v3 feeds should provide a way to get the full list of package ids available on the feed.
This could potentially be done either with a new service in index.json or by adding an index to flatcontainer that lists all package ids.
As v3 feeds exist today the only way to discover all packages ids and versions from a feed is to read the catalog. This requires a large number of url requests to get through all the edits and unneeded entries.
Having a way to get all the ids directly would allow a caller to then navigate to either the flat container lists or the registration blobs.
When resolving packages the nuget client searches each available feed for the complete list of versions available for every package id. If a user is referencing 100 packages (common due to system packages) and they add an additional feed containing a single package, the new feed will be queried 100 times and 99 of those will be 404s.
If the client could determine up front what package ids the feeds contained then it could avoid making 99 calls for packages that aren't on the feed.
From discussing this with @joelverhagen and @ryuyu
Imo this depends on the scenario. E.g. nuget.exe list will need such service, however nuget.exe list json should use search to retrieve its results. I wonder how many end users do a plain "list" without searching?
Yay on the client performance consideration though!
@maartenba good point on list, I forgot it does searches also. The actual v3 list would need more design. But for general scenarios such as writing a script to download packages from a v3 feed or general discovery the id list would help.
One scenario I am concerned about: feed proxies. Does this API mean that if feed A proxies B and C, this new endpoint will list all dependencies in B and C, too?
This could potentially be done either with a new service in index.json or by adding an index to flatcontainer that lists all package ids.
I thought more about this and I think it should be a new service in v3/index.json. We should definitely bake in the ability for this list to be broken up into pages, which does not really fit the flat container model. Additionally, some V3 implementations may not choose to support this feature. So, as we iterate on flat container protocol, I think it would be clearer to have this new thing as its own resource, versioned independently.
nuget.exe list support for v3 feeds
I think we really need to design "nuget.exe list for V3" separately. As @maartenba mentioned, there are multiple scenarios that "nuget.exe list for V2" supports:
-AllVersionsScenario 1 could possibility be supported with this ID list, but we would need to decide how "listed" state of a package effects this.
Scenario 2 and 3 could go through the search service, but it's also no clear how "listed" state effects this.
Scenario 4 would have to essentially JOIN this ID list with flat container or registration.
client performance
I think we should focus on this as the primary goal. Given the lack of design around nuget.exe list for V3, we should take client performance as it is concrete and measurable.
Should the index be a flat file that can be requested in a single call or a tree that supports a large number of package ids?
As mentioned before, we should design in the tree idea. The page size can be flexible meaning server implementations can choose whether to put everything in one page or not.
Clients could cache this locally and use e-tags to optimize performance
This would not be necessary for all clients wanting to implement this new protocol, but our client should do this to save the download of a potentially large file. Given that new IDs are added pretty frequently, we should consider smaller page size.
This should be static to support a high volume of traffic
VSTS would still need to put this behind auth, right? I have always wondered why they do not push for the use of SAS tokens to eliminate the need for a app service in front of blob storage...
One scenario I am concerned about: feed proxies. Does this API mean that if feed A proxies B and C, this new endpoint will list all dependencies in B and C, too?
Yes. This should be relatively simple for the feed proxy by either using the catalog from A or B (if it exists) or by implement smart caching of A and B using etags. Again, having smaller pages on A and B could optimize this.
I love the idea of this resource because it knits the V3 protocol together. It provides an easy way for clients to explore the entire corpus of packages on a source.
I did some tests on how big this file would be on NuGet.org (AKAIK the source with the largest set of IDs). Today, there are 99551 IDs in the NuGetGallery database. If you put these in a JSON array (most naive approach), the file has the following characteristics:
To me this is not a scary large download.
The tricky part in my mind would be how clients use this file. This is a lot of data to keep in memory for each nuget.exe operation, so I think our NuGet client should persist the data in a query-able, on-disk data file. A SQLite database comes to mind. This would also be convenient for storage of etags.
Caching is the other tricky part. If we properly implement etags, we could have a pretty short cache time. Again, small page sizes would mean more round trips (albeit parallelizable) but more favorable granularity of caching.
Server implementation for NuGet.org would probably have a catalog2ids job that follows behind the flat container and registration cursors updating the ID list.
Great insights!
Just wondering how many times enumerating listed package IDs would happen typically (and wondering if the current autocomplete endpoint could serve this content with a minor modification in sorting/paging?)
I like this idea!
We could make a very compelling case for this client and server work if we develop a proof of concept in the client and demonstrate the potential performance improvements.
For example, we could update the client to use a hard-coded list of IDs available on nuget.org and two small MyGet feeds. Then, run a restore on a project with NETStandard.Library as a dependency. Clear cache and time the restore with and without this optimization.
Theoretically this should be a lot faster since all of the dependencies will come from NuGet.org and we will not need to hit the MyGet feeds at all.
https://github.com/NuGet/Home/issues/5184 does sound compelling, too :-) (slightly related)
I tried a proof of concept out to measure any performance increase this optimization could give. It's not at all ready for use, but this is my code:
https://github.com/joelverhagen/NuGet.Client/commit/9d17f37f317749ce49b7a411926f2d6f1bf517d9
I essentially added a new resource that skips checking flat container or registration blobs if the request ID is not in a hard-coded list of IDs per source.
I then performed a restore of the following project (dotnet new mvc) after clearing cache:
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<TargetFramework>netcoreapp1.1</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.AspNetCore" Version="1.1.2" />
<PackageReference Include="Microsoft.AspNetCore.Mvc" Version="1.1.3" />
<PackageReference Include="Microsoft.AspNetCore.StaticFiles" Version="1.1.2" />
<PackageReference Include="Microsoft.Extensions.Logging.Debug" Version="1.1.2" />
<PackageReference Include="Microsoft.VisualStudio.Web.BrowserLink" Version="1.1.2" />
</ItemGroup>
</Project>
Note that all of the packages in the restore graph (205 packages in all) come from NuGet.org. The other sources I had configured did not have the packages.
I used the following NuGet.config: NuGet.org, 2 MyGet sources, and a VSTS source:
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<packageSources>
<clear />
<add key="jver-sandbox" value="https://www.myget.org/F/jver-sandbox/api/v3/index.json" />
<add key="rx" value="https://dotnet.myget.org/F/rx/api/v3/index.json" />
<add key="knapcode" value="https://knapcode.pkgs.visualstudio.com/_packaging/knapcode-nugetprotocol/nuget/v3/index.json" />
<add key="NuGet.org" value="https://api.nuget.org/v3/index.json" />
</packageSources>
<disabledPackageSources>
<clear />
</disabledPackageSources>
</configuration>
Here's my data performance measurements.
Notes:
Attempt | Before, NuGet.org first | After, NuGet.org first | Before, NuGet.org last | After, NuGet.org last
--- | --- | --- | --- | ---
1 | 54.815 | 57.267 | 52.118 | 39.558
2 | 51.846 | 53.637 | 61.534 | 48.038
3 | 52.774 | 57.358 | 49.941 | 42.406
4 | 54.757 | 58.002 | 50.917 | 42.873
5 | 53.119 | 52.481 | 52.212 | 49.961
6 | 48.412 | 55.592 | 53.984 | 59.136
7 | 48.494 | 52.189 | 50.155 | 43.721
8 | 58.947 | 44.071 | 46.070 | 48.198
9 | 56.089 | 44.552 | 64.431 | 41.629
10 | 47.911 | 48.111 | 50.276 | 55.678
Average | 52.717 | 52.326 | 53.164 | 47.120
This indicates the following observations:
404 Not Found.The potential performance impact here is lower than I expected but still seems consequential. Another angle to consider is that this will likely help servers deal with load better since there will be many fewer 404 Not Found or terminated HTTP requests.
Perhaps https://github.com/NuGet/Home/issues/5184 would yield better performance? E.g try known origin for package I'd first, in case of failure try others.
Restore will stop as soon as it finds an exact match from a remote source, which is why order can matter:
https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.DependencyResolver.Core/ResolverUtility.cs#L390-L391
If floating versions are used then all sources will be checked since there is no exact match, so the id list would be helpful there.
11% decrease in time seems pretty huge considering that most of the time here is probably spent downloading the packages. So this probably takes off more than half the time of finding the right source.
I would expect NuGet/Home#5184 to be the exact same performance as @joelverhagen's hardcoded lists here. The noop from the resources would be very fast.
@joelverhagen , @emgarten , is this something you are still considering or NuGet/Home#5184 is the chosen feature?
I think these both make sense. We can implement them separately and still get value from both. This particular issue is broader performance improvement for any package not in the user packages folder. Issue https://github.com/NuGet/Home/issues/5184 is an additional control for users that really care where a package comes from.
This server feature issue would provide benefits to all users with newer clients without any changes required on their part, and it would work on the first restore.
NuGet/Home#5184 is not being considered on the client side currently, to make it work the user would need to end up specifying the source for hundreds of packages manually.
I would really like to see this id service move forward. I think this needs:
The catalog and registration blobs have a consistent design in how they page and list items, we could use the same thing for this. We just need to agree on what it would look like.
11% decrease in time seems pretty huge considering that most of the time here is probably spent downloading the packages. So this probably takes off more than half the time of finding the right source.
Is there a part of the code I could instrument to more clearly show the performance improvement? I'm not sure what you mean by "finding the right source".
Even if this is the case, is it really worth it to add a new V3 resource and a lot of new client code for a small improvement for the whole restore operation?
I'm not sure what you mean by "finding the right source".
The time it takes to query all sources for their list of package versions to find the source that actually contains the package.
is it really worth it to add a new V3 resource and a lot of new client code for a small improvement for the whole restore operation?
I don't think any other optimization could come close to improving the full download scenario by 11%. The amount of client code needed depends on how complex the paging and caching strategy is.
discussed offline with @joelverhagen, this needs more investigation
One idea that occurred to me over the weekend is that the resource could come in the form of a serialized bloom filter. This would be small and be really quick for the client to query against once downloaded.
Most helpful comment
I tried a proof of concept out to measure any performance increase this optimization could give. It's not at all ready for use, but this is my code:
https://github.com/joelverhagen/NuGet.Client/commit/9d17f37f317749ce49b7a411926f2d6f1bf517d9
I essentially added a new resource that skips checking flat container or registration blobs if the request ID is not in a hard-coded list of IDs per source.
I then performed a restore of the following project (
dotnet new mvc) after clearing cache:Note that all of the packages in the restore graph (205 packages in all) come from NuGet.org. The other sources I had configured did not have the packages.
I used the following NuGet.config: NuGet.org, 2 MyGet sources, and a VSTS source:
Here's my data performance measurements.
Notes:
Attempt | Before, NuGet.org first | After, NuGet.org first | Before, NuGet.org last | After, NuGet.org last
--- | --- | --- | --- | ---
1 | 54.815 | 57.267 | 52.118 | 39.558
2 | 51.846 | 53.637 | 61.534 | 48.038
3 | 52.774 | 57.358 | 49.941 | 42.406
4 | 54.757 | 58.002 | 50.917 | 42.873
5 | 53.119 | 52.481 | 52.212 | 49.961
6 | 48.412 | 55.592 | 53.984 | 59.136
7 | 48.494 | 52.189 | 50.155 | 43.721
8 | 58.947 | 44.071 | 46.070 | 48.198
9 | 56.089 | 44.552 | 64.431 | 41.629
10 | 47.911 | 48.111 | 50.276 | 55.678
Average | 52.717 | 52.326 | 53.164 | 47.120
This indicates the following observations:
404 Not Found.The potential performance impact here is lower than I expected but still seems consequential. Another angle to consider is that this will likely help servers deal with load better since there will be many fewer
404 Not Foundor terminated HTTP requests.