Nugetgallery: Improve search: Re-evaluate camel-split strategy

Created on 29 Jul 2020  路  8Comments  路  Source: NuGet/NuGetGallery

_This issue has been moved from a ticket on Developer Community._


I don't know whether there is a limitation with nuget.org itself (the website is similarly basic) but searching for a nuget package in Visual Studio could be significantly improved if the search box treated space-separated strings as separate tokens and _ANDED_ these together when performing the search.

For example, when entering "imm coll l" I'd want to see "System.Collections.Immutable" - NOT every package that contains _either_ "imm" or "coll". (I can't imagine why this appears to be the default behaviour...)

Also, I'd like to be able to filter/rank based on number of downloads or "last updated" since there are a huge number of ancient packages which simply clutter up the result list


Original Comments

Feedback Bot on 5/8/2020, 01:34 AM:

Thank you for taking the time to provide your suggestion. We will do some preliminary checks to make sure we can proceed further. We'll provide an update once the issue has been triaged by the product team.

Peter Shaw on 5/15/2020, 04:27 AM:

In response to this tweet I posted:

https://twitter.com/shawty_ds/status/1261061801959178241

I'd also like to back this suggestion up.

I understand the need for some results to show up when a user is not sure what they are searching for, but right now as it stands, Visual Studio NuGet search IS UNUSABLE unless you know the EXACT spelling of the package your looking for.

Mores the point, the sub-string search is way too LAX, for example I do a search for "MapWinGis" (I Know this term is used in the package description, but is not the namespace name) and I get 100's of results, simply because many package names have "gi", "ma", "in" or "wi" ("wi" being an incredibly popular term, since we are on windows), showing these results is absolutely not helpful.

Filtering and ranking the results list is a really good idea, and improving the search to be more like say google search so we can say for example "MapWin" "GIS" and that will return packages that contain those exact phrases, or for example "Collections" -"Microsoft" will return anything with the exact phrase "Collections" in, but only if it does NOT contain the exact phrase "Microsoft"

As much as I dislike google these days, I have to admit the search operators they have and the way they work, actually make for a very powerful search syntax, to be really specific about what your searching for, having something similar in NuGet would be very, very helpful, especially when you have companies like "ThinkGeo" who have literally filled the first 10 pages of NuGet with their "free(but not actually free)" GIS library (Try a search for GIS in NuGet and see how long it takes before you see anything other than their offering), it would allow us to filter out bad actors like this that are using NuGet to drive sales, and focus on finding what we need fast.

Search feature-request

All 8 comments

I don't think this item is particularly actionable. It appears the customer would like to use this project: https://github.com/MapWindow/MapWinGIS

This project does not appear to exist on nuget.org. Their documentation reads:

Get MW5
...
Go to the GitHub site to get the source code or download installers: github.com/MapWindow

For this comment:

Mores the point, the sub-string search is way too LAX, for example I do a search for "MapWinGis" (I Know this term is used in the package description, but is not the namespace name) and I get 100's of results, simply because many package names have "gi", "ma", "in" or "wi" ("wi" being an incredibly popular term, since we are on windows)

Our prefix matching will only match results that start with MapWinGis. MapWinGis gets tokenized into map, win, and gis. There are many popular results that match to the tokens map and win (the latter is used by Windows packages).

There has been a growing set of reports related to package search on nuget.org that suggest we should investigate moving away from camel-split. Currently, the space of packages matched is often very large due to "camel-split". We could evaluate in the future via an A/B test how things perform if we leave "MapWinGis" as a single token instead of "Map" "Win" and "Gis". It's not clear at this time whether it will help more cases that it would hurt, but it's certainly worth keeping track of as an option for evaluation.

To @loic-sharma's point, if you search "mapwin" or "mapwingis" (side-stepping the camel-split tokenization), no results are returned so it's not clear what the most ideal result would be, at the top of the search results.

IMO nuget search should rank tags and downloads better than some arbitrary camelcase split.
I maintain a package called streamdecksharp initially uploaded with all lowercase package id. Little did I know that this will mess up the ranking beyond repair.

Here are the search results from a few minutes ago for "stream deck" and "streamdeck"
image

image

I also ~recently~ changed the capitalization about two years ago in the nuspec file to "StreamDeckSharp" and hoped it would fix the index but nuget doesn't seem to update casing after the initial upload.

Is there any way to fix this or work around that (how could I update the casing of my nuget package?)

Maybe I'm biased but this is what the results for "stream deck" and "streamdeck" should look like:
image

Your rename to StreamDeckSharp did help, by the way. Perhaps they took some time to propagate. For searching purposes, the latest package ID casing is used for tokenization. Thank you for the thorough explanation. We will consider this change for the future.
https://www.nuget.org/packages?q=stream+deck
image

I changed that almost exactly two years ago (File diff) and every package since (and including) 0.2.0 had a camel case package id.
_I updated my previous comment because two years ago isn't really "recently"_

It looks like something was changed recently, because my package pushed yesterday was the first update that had any effect. It "fixed" the search result for "stream deck" IMO, but searching for "streamdeck" still yields results I don't understand. My project is listed below a package that has just about 100 downloads, was last updated about year ago and doesn't even specify a project site or license and nuget ranks the package as more relevant?

Something is still off, because if you open the detail page of the package it's still lowercase.

Thanks for reporting this! These are definitely areas we can improve.

searching for "streamdeck" still yields results I don't understand

This is expected as we haven't implemented shingling yet.

The search service does something called tokenization: it chops the package ID StreamDeckSharp into the tokens Stream, Deck, and Sharp. However, we don't create "shingles" yet which are combinations of tokens like StreamDeck, or DeckSharp. This is tracked here: https://github.com/NuGet/NuGetGallery/issues/7390

I've added StreamDeckSharp to the list of results that would be improved by adding shingling.

Something is still off, because if you open the detail page of the package it's still lowercase.

The nuget.org website displays the casing that was used when the first version was uploaded. The search service will use the latest version's package ID for tokenization. Sadly, fixing this is a bit tricky so we haven't done it yet. See: https://github.com/NuGet/NuGetGallery/issues/3349

@wischi-chr, the reason the tokenization was wrong due to a flaw in our index rebuild strategy. When we rebuild the Azure Search index, metadata comes from database but subsequent package pushes, after that point in time, data comes from the NuGet package itself.

Typically database info is easily derived from the NuGet package or vise versa but in this case (as @loic-sharma mentioned) our database has a flaw where only the first casing is used. Thanks for helping us connect the dots 馃憤.

We have shipped a change to our search service which makes all camel-split tokens required in the matched package. Previously only one had to be matched but matched all tokens would give the result a boost. This has improved the situations mentioned above. As always, improving nuget.org search is an incremental process and some searches work better than others. We optimize our algorithm for the most common searches since this helps the most users but there is a long tail of less common searches that we are interested in as well.

Per the original poster, the MapWinGis now returns a smaller set of packages all related to GIS. There is no package with that exact name on nuget.org so it can't appear at the top. The ThinkGeo packages show up as mentioned.

@wischi-chr, the situation with StreamDeckSharp is improved a bit I think.

  • stream deck - StreamDeckSharp is rank 1
  • streamdeck - rank 7
  • stream deck sharp - rank 1
  • streamdeck sharp - rank 1
  • streamdeck sharp - rank 1
  • StreamDeckSharp - rank 1

Again for the StreamDeckSharp vs. streamdecksharp casing issue on the package details page, it is tracked here: https://github.com/NuGet/NuGetGallery/issues/3349#issuecomment-730649782. I've added additional context that this difference can also affect search results.

I will close this issue for now since we've made a moderate improvement in the reported areas. If you encounter additional problems, please open new issues. Thank you all for taking the time to report these pain points so that we can build a better NuGet package search, together.

Was this page helpful?
0 / 5 - 0 ratings