Chapel: Mason Search Integration to Mason Registry

Created on 22 May 2020 · 22Comments · Source: chapel-lang/chapel

The problem

In #14968 , it was discussed that a "best match" metric could be mapped to the mason packages. This is important for the future of the mason-registry, when users can have packages with similar names / functionality. Therefore, a ranking system for packages should be considered on the basis of package quality and maintenance.

The solution

We create a .toml file as cache on the registry that stores package names and points (metric) of all packages on the registry. Points will be awarded on the basis of package quality, maintenance and popularity. Some checks for awarding points could be :

Presence of README.md file
stability (chapel version)
tests (maybe check number of tests)
examples
size of the package ?
maintenance (version number of package)
popularity (left as future work)

So, every pull request to mason-registry shall be evaluated with those checks and the in-repo cache file shall be updated after the pull request is merged.

Design questions :

How would the mason-registry integrate with an user's mason search running locally ? I assume, the user will be required to update their $MASON_HOME/mason-registry and mason search will look for queries in that directory's cache file.
How often will an user require to update their local mason-registry ? Since mason-registry doesn't have versions as of now, how could an user check if the local version of their cache file is old ?
How would users be notified of this change ? What and when should an error message be displayed ?
Should there be a separate flag for mason search to enable this feature ?

Major design issue :

There's also a consideration of shifting the registry off of Github. How would it affect our feature ? I'm in favor of keeping the registry on Github as its easy to maintain.

Tools mason Design

Source

ankingcodes

👍1

Most helpful comment

I looked into how other package managers are currently doing search on their registries :

npm : uses elastic search on their indexed db. First priority given to exact match, then given priority to package maintenance and popularity.
pypi : simply uses startsWith (results)
cargo : first preference to exact match, then priority to package quality along with alphabetical ordering (results)

ankingcodes on 25 Jun 2020

👍2

All 22 comments

@ben-albrecht Kindly add the GSoC label to this issue.

ankingcodes on 22 May 2020

How would the mason-registry integrate with an user's mason search running locally ? I assume, the user will be required to update their $MASON_HOME/mason-registry and mason search will look for queries in that directory's cache file.

How often will an user require to update their local mason-registry ? Since mason-registry doesn't have versions as of now, how could an user check if the local version of their cache file is old ?

How would users be notified of this change ? What and when should an error message be displayed ?

Should there be a separate flag for mason search to enable this feature ?

Here's a starting idea: We can do something like git fetch --dry-run in each registry of MASON_REGISTRY to check if an update is needed. We can emit a message to run something like mason search --update if the user's registry is out of date.
This would need to be disabled if MASON_OFFLINE=true.

ben-albrecht on 8 Jun 2020

tests (maybe check number of tests)

I'd be cautious about requiring a fixed number of tests. Some more advanced things we could do are:

Ensure tests are passing in CI
Ensure tests run in some threshold considered a reasonable amount of time (unit tests should run quickly, though this may punish certain packages unfairly).

ben-albrecht on 8 Jun 2020

@Spartee @krishnadey30 In our last meeting, it was suggested to explore other options for this issue ( also read #14968 ). I have made a PR #15789 , which has improved the current implementation of mason search ( read more at #15788 ). So, now would be a good time to present your ideas to avoid having a cache, and its integration with mason search.

ankingcodes on 8 Jun 2020

How often will an user require to update their local mason-registry ? Since mason-registry doesn't have versions as of now, how could an user check if the local version of their cache file is old ?

We could also look to see what other package managers do here. To start us off:

homebrew: notifies user if their registry is out of date and instructs them to run brew update
cargo: updates the registry automatically

ben-albrecht on 8 Jun 2020

Summarizing the discussion that was held on weekly call -

The cache on mason-registry would have contents as follows :

[cache]
DataStructures = p1 
Gnuplot = p2
LocalAtomics = p3
Logging = p4
NumpyLike = p5 
StringUtils = p6
UUID = p7
csm = p8

where p1,p2,....p8 are points awarded to the packages.

This cache is downloaded to local .mason-registry and it integrated with mason search.
- when an user types mason search <query> the results are shown based on the points awarded to each package.
- if a direct match is found or if a part of query matches to the initial characters in package name, it will be given more preference to the rank system.

Design issue:

How would the cache be updated ?

Obviously, this cache would be updated with every PR made to mason-registry. And we would append package names and points to this cache. So, how would the local registry be updated to the github registry ?

A possible solution to this would be we keep separate tags/branches for each version of the mason-registry. We also include the current version tag of mason-registry in the mason search code. Whenever a version update is done in mason-registry and a new tag or branch has been created, we also update the version in our code (Something that has been implemented for spack in mason external). Before running mason search we check if the tag of local mason-registry matches to that of the tag(updated) in the code. If it doesn't match, we either present an error or automatically update the local mason-registry by pulling the specific updated tag from Github.

ankingcodes on 10 Jun 2020

@ben-albrecht @Spartee @krishnadey30 I have given a concise description of how the mason search would be integrated to the cache. Also, I have given a solution to the design issue, I think would be suitable to our situation as of now, considering that we keep the mason-registry on Github.
Please check out the format of the cache and confirm if its okay, also if you have other ideas for the design issue, please share in comments below 👍

ankingcodes on 10 Jun 2020

I'd add another issue regarding the format of the cache, if we could shift to something like JSON, where we could keep descriptions, author_names, source url and other info regarding a package in a single object. We might even not require to download the entire mason-registry to local. Advantages of using this in the future would be:-

User would require to download a single file i.e, the cache which would contain all details regarding a package along with rank.
Implement additional commands for mason search such as mason search <package_name> --desc or mason search <package> --author or mason search <package> url which would be very fast.
Easy to shift the registry off of github, NoSQL databases are all JSON based.
JSON would definitely help while creating the website for the registry

ankingcodes on 10 Jun 2020

I have these following suggests and feedbacks:

Users may want to download a different version. Your proposal only talks about the latest one.
I like the idea of having a JSON file for the cache. I see the following advantages:
- An easy shift to the separate registry.
- Updating points will be easy
I suggest the following structure for the JSON:

[
  {
    "name": "gnuplot",
    "author": "someone",
    "releases": [
                  {
                       "version": 3.1,
                       "releaseDate": "2020-06-12T00:00:00.000Z",
                       "score": 0.6,
                       "url": "some-random-url"
                  },
                  {
                       "version": 2.5,
                       "releaseDate": "2019-06-12T00:00:00.000Z",
                       "score": 0.5,
                       "url": "some-random-url"
                  }
                ]
  },
  {
    "name": "cdo",
   "author": "someone",
    "releases": [
                  {
                       "version": 1.2,
                       "releaseDate": "2020-04-12T00:00:00.000Z",
                       "score": 0.6,
                       "url": "some-random-url"
                  },
                  {
                       "version": 1.1,
                       "releaseDate": "2019-09-12T00:00:00.000Z",
                       "score": 0.5,
                       "url": "some-random-url"
                  }
                ]
  }
]

Regarding the update of the cache file, I see two options:
1. Developer adds a new version: In this case, you add new data
2. Developer removes a version: You remove some data
  
  I suggest two ways to do this, either download the whole cache file and update it, else download only that part which is updated(I am not sure if this is even possible). For now, I suggest downloading the whole file.
Also, you might want to check the SHA of cache files to check if there is an update.

krishnadey30 on 12 Jun 2020

@krishnadey30 - I agree that each version of a package should get a score.

Each mason search query should choose the score of the latest package version that fulfills the query. When doing a mason search we already filter down to compatible Chapel versions. Beyond that, a user may specify a version range or specific version.

Regarding format, I'll note that TOML has the advantages of:

Chapel TOML parsing/writing is much more mature than JSON
Everything else mason-related uses TOML

I believe we can support the format @krishnadey30 proposed with TOML. I agree that JSON may make life easier when integrating with other technologies down the road though (web interface / databases).

A few other thoughts:

Should we just log the SHA instead of the release date? A date can be inferred from the SHA and git URL.
Should we store the the metrics that went into the score? A user or package author may be interested to learn what went into their score. If we cache the intermediate results of things like "test score", "documentation score" (where several of these may be 0 or 1), then we can later provide tools for users to show more information behind a score value.
- The disadvantage of caching this info is (1) more space, and (2) more disruptive changes to the cache file(s) if we add/remove/modify the metrics that go into a score.

ben-albrecht on 12 Jun 2020

I believe we can support the format @krishnadey30 proposed with TOML.

@ben-albrecht I believe TOML reader has an option to convert TOML to JSON. So, we can stick to TOML for now. However, I think, if we want a TOML type cache, then we should keep the initial version simpler. Although I presented the idea of having a detailed cache file, so that we could avoid downloading the mason-registry, but I am a bit skeptical now since many mason commands rely on the mason-registry being present at $MASON_HOME and the cache is currently introduced only for mason search. That said, could you confirm that the simpler version of mason that I have described above works for you ? Maybe you could modify that and provide an example of what you have in mind.

What do you think about the versioning using branch/tags ? I understand the concern that we will eventually move out from github, but I don't see that coming very soon in future.

ankingcodes on 12 Jun 2020

I believe TOML reader has an option to convert TOML to JSON. So, we can stick to TOML for now.

However, I think, if we want a TOML type cache, then we should keep the initial version simpler.

Agreed.

Although I presented the idea of having a detailed cache file, so that we could avoid downloading the mason-registry, but I am a bit skeptical now since many mason commands rely on the mason-registry being present at $MASON_HOME and the cache is currently introduced only for mason search.

My thinking is that the cache file can serve as a stepping stone towards this eventual goal, but we don't need to implement this any time soon, especially with how small the registry is today.

That said, could you confirm that the simpler version of mason that I have described above works for you ?

Sorry, can you link to this comment? I'm not sure which part you're looking for feedback on specifically.

What do you think about the versioning using branch/tags ? I understand the concern that we will eventually move out from github, but I don't see that coming very soon in future.

I may not fully understand this yet. Can you elaborate?

ben-albrecht on 17 Jun 2020

I would like to reiterate a point @krishnadey30 made earlier -- we will need to give each package _version_ a score, since the package metadata can change per version. When doing a mason search, we will use the score of the highest version available that fulfills the query to compare.

@ankingcodes - Can you post a comment with an updated TOML format proposal with the other metadata @krishnadey30 has suggested (accounting for other feedback, such as timestamp -> SHA).

ben-albrecht on 23 Jun 2020

👍1

@ben-albrecht I believe in our last discussion, @Spartee had made a point regarding how homebrew or npm, simply updates packages on its own without permission from an user. Also, he suggested that we should only account for the latest version of packages for mason-registry.
In my opinion,we should not complicate the cache. I don't know how the score of a previous version of a package would impact mason search. I would stick to keeping latest versions which can always be updated. Also, storing package metadata in the cache makes sense only if we want to replace local mason-registry with the cache. Since, that's not the case and the intent of the cache is just to rank packages, we can keep it simple for now.

ankingcodes on 23 Jun 2020

Also, he suggested that we should only account for the latest version of packages for mason-registry.

@Spartee can correct me if I misunderstood him, but I interpreted that suggestion as we should only use the latest version of the mason-registry itself, i.e. this repository - which I agree with.

I think we should continue to track metadata per package version. Imagine if you are searching with an old version of Chapel, say 1.19.0. Your search will be limited to only packages that support 1.19, and therefore you will want mason search to account for scores of the older package versions (maybe version 1.0.0 has a score of 12, but version 0.8.0 has a score of 2).

Also, storing package metadata in the cache makes sense only if we want to replace local mason-registry with the cache. Since, that's not the case and the intent of the cache is just to rank packages, we can keep it simple for now.

I agree that we should keep it simple otherwise. I don't think we need more information than a package name, version, and score for an initial cache file.

Here's an example:

[cache]
[cache.DataStructures."0.1.0"]
score = 5
[cache.DataStructures."0.2.0"]
score = 7
[cache.Gnuplot."0.1.0"]
score = 3

ben-albrecht on 24 Jun 2020

Imagine if you are searching with an old version of Chapel, say 1.19.0. Your search will be limited to only packages that support 1.19, and therefore you will want mason search to account for scores of the older package versions (maybe version 1.0.0 has a score of 12, but version 0.8.0 has a score of 2).

@ben-albrecht I am not clear of this part. How is the cache related to Chapel version ? The registry contains packages supported by different versions of Chapel. Mason search(of any version) should look at the cache in the local mason-registry(which should be updated to the latest version of registry) and should get the package related to the query which is of the highest rank.

Here's an example:

If a version of DataStructures is updated and has a better score, why not overwrite the previous score ?

Currently the cache looks as follows:

[packages]
LocalAtomics=3
NumpyLike=5
.
.
.

ankingcodes on 24 Jun 2020

Some notes from our meeting today:

New proposed cache format:

[DataStructures."0.1.0"]
score = 5
[DataStructures."0.2.0"]
score = 7
[Gnuplot."0.1.0"]
score = 7

How will search work:

mason search plot

# Find all package names containing "plot"
# Open their TOML files and filter down to package versions support the current version of Chapel
# Compute the distance between the query and package names (O(N^2))
# Compute a rank based on 
#   (1) string distance
#   (2) whether or not the package name starts with the search query
#   (3) score
# Print packages in descending order of rank

ben-albrecht on 24 Jun 2020

Just to capture some things discussed offline:

The advantage of including package versions in the cache is:

1) It allows us to filter searches down to only package versions that support a given Chapel version
2) It gives the cache flexibility to include more package metadata in the future if we were to use only the cache for mason search in the future.

ben-albrecht on 24 Jun 2020

@ben-albrecht @krishnadey30 I looked more into the edit distance algorithm ( Levenshtein distance ). Given two different strings s1 & s2, the algorithm basically finds number of transformations required to derive s2 from s1.
eg: s1 = horse, s2 = ros.
Here edit distance = 3, since
operation 1 : horse -> rorse (replace h -> r)
operation 2 : rorse -> rose (remove r)
operation 3 : rose -> ros (remove e)

Wouldn't this be costly to the basic mason search ?
Also consider this case,
mason search Lo
If s1 = LocalAtomics , s2 = Lo, then edit distance = 10
If s2 = Logging , s2 = Lo, then edit distance = 5
Therefore, by min(edit distance), Logging shows up at above LocalAtomics which violates the alphabetical ordering feature added recently. However, .startsWith serves the purpose here and shows correct ordering.

ankingcodes on 25 Jun 2020

I'd be OK removing the first part (1) of our rank computation, so it basically comes down to:

(1) Does the package name start with the search query?
(2) If not, print the package in descending score

ben-albrecht on 25 Jun 2020

I looked into how other package managers are currently doing search on their registries :

npm : uses elastic search on their indexed db. First priority given to exact match, then given priority to package maintenance and popularity.
pypi : simply uses startsWith (results)
cargo : first preference to exact match, then priority to package quality along with alphabetical ordering (results)

ankingcodes on 25 Jun 2020

👍2

I think what I described is similar to cargo, and I like the simplicity of that (this can be revisited in the future when the registry has more packages).

first preference to exact match, then packages that start with query, then by package score.

ben-albrecht on 25 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings