Pkg.jl: curation & trust (registries are not the answer)

Created on 9 Jun 2020  Â·  39Comments  Â·  Source: JuliaLang/Pkg.jl

This is a slightly unusual issue because I'm going to post an edited for readability rant from Slack.


Everyone keeps assuming that a more curated registry is the right way to have a curated subset of packages. But I’ve said many times that’s wrong and that’s not what registries are for. I get why everyone thinks of registries for curation: registries include sets of packages and we want a curated set of packages, so seems like a great fit, right? Despite that, it’s entirely the wrong tool. The fundamental issue is that registries are about knowledge whereas curation is about trust. It’s fine to know about all public packages whether they're trustworthy or not. It’s a different thing to trust all of them and assume that it’s ok to install or use them.

Moreover, Julia Computing has already tried using a registry as a trusted subset of the public registry with the JuliaPro registry, so we have very real world experience with this and I can definitively say that it’s not a good approach, which I'll demonstrate by walking through a typical user experience with that approach.

Suppose someone using a curated registry is trying to follow an online tutorial, like people do. What’s their experience? They try installing ExperimentalPackage and what happens? Pkg says “I have no idea what that is.” They mess around for a while, wondering what’s wrong, trying it again, checking if they spelled it incorrectly. Eventually, if they're lucky, they may figure out: “oh, it’s not in the Curated registry, I guess I have to install the General registry—that’s annoying, why didn’t I have that installed in the first place?” So they install the General registry, follow along with the tutorial and move on with their lives. But... they’ve lost all protection they ever had against installing or using uncurated packages, forever. The user starts out confused and frustrated, becomes annoyed, and then ends up blithely unprotected. In short, it's an unmitigated user experience and security disaster.

Now suppose instead that the user has the normal General registry installed, so Pkg knows about ExperimentalPackage—knows what it is, how to install it, etc. But it also knows something else: there's a layer of metadata about what packages are in a “curated” set and there's a policy in place that Pkg knows about that any package in the “curated” set can be installed without approval but any other package requires an explicit approval via a prompt. Then the above experience is quite different: the person tries to follow along with the tutorial and gets a prompt saying

ExperimentalPackage (https://github.com/SomeUser/ExperimentalPackage.jl) is not pre-approved and may not be safe to use. Do you want to install it anyway?

  • [x] No, I don't trust this package, do not install it.
  • [ ] Yes, install this version of this package one time.
  • [ ] Trust and install this package now and in the future.

That way they immediately know what the problem is and they can take a look for themselves. Let’s say they look the package over and decide that it’s not dangerous and that all its dependencies are common, already-curated or trusted packages so they install it and finish the tutorial. Yay. The users was never confused or frustrated.

Better still, the next time they try to install a different uncurated package they are still protected because, unlike in the scenario where we use registries for curation and they install all of General just to get one package, in this scenario, approving the installation of a single uncurated package doesn’t eliminate all protections. The next time they try to install a different uncurated package, they are still protected.

The ultimate point is that registries are about knowledge: if a package is registered, you know what it is. Knowledge is safe—the more things you know about, the better. So the idea of a curated registry if like an ostrich sticking its head in the sand: “if I don’t know about all these bad packages, then I’ll be safe.” But of course, that's not the way things work. If things are potentially dangerous, you're safer knowing about them than not knowing about them. It’s way better to know about all the public packages, good or bad, and have a mechanism for deciding which ones to trust. Yes, we don’t _currently_ have a mechanism for that, but that does not mean that we should shove the functionality into the registry mechanism just because that mechanism already exists. We should, instead, build a layer for managing trust.


So this is the issue I'm opening for discussing that trust layer. But first, I had to post a bunch of text dispelling the apparently irresistible notion that we should use registries for this.

Most helpful comment

Also, note that even though registries have the capacity to hold metadata, I anticipate that the next argument I'll end up having to make is that the trust metadata should probably not go into the registry even though it could. Why? Because the source of trust metadata need not be the same as the source of the registry. Different third parties can provide trust metadata about the same registry or registries. There could be an official "curated packages" metadata layer, but someone else may disagree with that data set and offer their own, alternative curation. A potential policy could be to combine these and only trust packages that are on both curated sets, or in either one. Some trust metadata may not be public. In short: let's not jump to the conclusion that because registries hold some metadata that they should hold this metadata too. Trust metadata should be external. Fortunately, UUIDs and tree hashes offer a perfect mechanism for providing metadata externally in an unambiguous way.

All 39 comments

The pieces that are needed...

  • A metadata layer: a way of associating information with packages and versions of packages. This could be binary like "curated" or not. But it could also be more complex stuff like who has contributed code to this package; various parties assigning a trust level to code based on a review.

  • A policy system: a mechanism for determining for a given package version, based on information provided by the metadata layer, whether you should trust a version of not. There are three possibilities: trust, ask, deny. Trust means "just trust this, no questions asked". Ask means "someone needs to be prompted for approval to use this". Deny means: "you cannot use this, it's already been decided."

Another way to look at this is that the metadata layer is what's shared among everyone—we all get the same information about what is and isn't curated and who committed to what things. But each person has to make a decision for themselves about what to trust or not based on that information.

cc: @sbromberger

Also, note that even though registries have the capacity to hold metadata, I anticipate that the next argument I'll end up having to make is that the trust metadata should probably not go into the registry even though it could. Why? Because the source of trust metadata need not be the same as the source of the registry. Different third parties can provide trust metadata about the same registry or registries. There could be an official "curated packages" metadata layer, but someone else may disagree with that data set and offer their own, alternative curation. A potential policy could be to combine these and only trust packages that are on both curated sets, or in either one. Some trust metadata may not be public. In short: let's not jump to the conclusion that because registries hold some metadata that they should hold this metadata too. Trust metadata should be external. Fortunately, UUIDs and tree hashes offer a perfect mechanism for providing metadata externally in an unambiguous way.

Just to clarify, @sbromberger, this is not a policy issue, it is intended to be a technical issue to discuss how to design and implement a trust layer in Pkg. In that light your post seems to be largely off topic and probably better suited for the ongoing discussion already on discourse.

The next concrete steps I'd like to take here are:

  1. Figure out a format for trust metadata source (external to a registry).
  2. Figure out a mechanism for defining policies based on that metadata.

But just the fact that the package is registered is a certain sign of quality, isn't it?

@PetrKryslUCSD, I’m not sure what that’s meant to address. Is it relevant to constructing a trust metadata system?

Just to clarify, @sbromberger, this is not a policy issue, it is intended to be a technical issue to discuss how to design and implement a trust layer in Pkg. In that light your post seems to be largely off topic and probably better suited for the ongoing discussion already on discourse.

@StefanKarpinski If you intend for this issue to be a technical discussion (not a policy discussion), would it be possible for you to copy the contents of @sbromberger's comment to https://github.com/JuliaRegistries/General/pull/16058, which provides a concrete implementation of some policy changes? (I would do it myself, but I am now unable to view that comment.)

But just the fact that the package _is_ registered is a certain sign of quality, isn't it?

The fact that a package is registered should in no way imply anything about that package's quality.

That is certainly not true. I have had package registrations held up or
rejected because they did not comply with some rules about documentation or
dependencies. Passing tests is also checked.

On Tue, Jun 9, 2020, 12:23 PM Dilum Aluthge notifications@github.com
wrote:

But just the fact that the package is registered is a certain sign of
quality, isn't it?

The fact that a package is registered should in no way imply anything
about that package's quality.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://urldefense.com/v3/__https://github.com/JuliaLang/Pkg.jl/issues/1856*issuecomment-641521298__;Iw!!Mih3wA!RGEONvOEuZ1K1ynYOkbaLHO0WDuZgLXy6xDqszONHBwS97a-m_7qy7Ok8KBs--Y$,
or unsubscribe
https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACLGGWCBAXHEE7WURVKYDQDRV2D47ANCNFSM4NZNHLFQ__;!!Mih3wA!RGEONvOEuZ1K1ynYOkbaLHO0WDuZgLXy6xDqszONHBwS97a-m_7qy7Okj3eicEo$
.

That is certainly not true. I have had package registrations held up or rejected because they did not comply with some rules about documentation or dependencies. Passing tests is also checked.

Currently we have no official policy. Which means that different registry maintainers apply different standards, leading to inconsistent treatment of packages.

As Stefan pointed out, this discussion is off-topic for this issue. Let us move this discussion to https://github.com/JuliaRegistries/General/pull/16058#issuecomment-641527127

Ok, so here's one possibility. The metadata files live in $depot/metadata/$uuid.toml where uuid is the UUID of a package and has a format like this:

curated = true

[1.0-1.3]
trust-level = 0.3

[1.3-1.7]
reviewed = true
trust-level = 0.9

The top-level keys apply to any version of the package, the other key-value pairs apply if the header range matches, using the same compression scheme as the registry.

A policy is fundamentally a function that takes a collection of this kind of metadata and computes an allow/ask/deny value. I.e. it's a function where the input is a Dict{UUID, Dict{String, Any}} where the UUID is the metadata source UUID; the output is one of :allow, :ask or :deny.

If we're considering installing a version of a package, we look up its metadata across the depot path (merging dicts as we go), then pass that dict-of-dicts to the policy function and if the result is :allow we install it without any further ado; if the result is :deny we refuse to install it with an error; if there are one or more package versions to be installed for which the policy function returns :ask, then we prompt the user about them, all at once, giving links to them and enough information for them to make an informed decision. For each one, they may be asked to decide between not installing, installing this one version for now, and remembering this package and trusting it. If we mark a package as trusted then the policy determination will be skipped in the future for that package and it will just be installed without prompting.

An alternative approach for the policy part would be to allow the policy function to see the package's UUID and the version info and the metadata and allow the policy function to see the whitelist of trusted packages and make some kind of nuanced decision about whether to continue to trust it or not.

what is trust-level, and how do we reduce trust to a single metric?

It's made up. I'm just throwing some arbitrary key-value pairs in there. It's up to the curator of the metadata source what they put in there and up to the policy author (who may be the same person) what they do with it. This is just the bare bone mechanism for building these things.

I'm generally against using code for configuration, but for trust policies, I can't think of any good reason why it shouldn't be just Julia code. The main reason would be if you want to put a UI on it for generating policies, but I'm not sure how necessary that would be.

Is $depot/metadata/$uuid.toml created (potentially) by a third-party? If so, isn't it a bit too flexible?

For example, why not set the trust level for each release? Version range doesn't seem to be "static" enough. A package may release a backport with a patch version bump after the trust metadata is created.

Also, since UUID is open and it might be possible to disguise as another package if you have a different set of registries than whoever compiled the metadata, wouldn't it be better to use the tree sha? It'd be better to have a format that allows us to upgrade to a stronger hash (e.g. after Git moved to SHA‑256).

I just had a great conversation with @DilumAluthge about a problem I've been noodling on for a while, and it makes sense to bring it up here, I think. (If not, please mark as OT.)

Whatever trust mechanism we use for curation needs to curate the registry BEFORE the registry content is downloaded. This will help prevent users from downloading metadata or other content that may be illegal in their country or environment. I can go into details elsewhere.

Is $depot/metadata/$uuid.toml created (potentially) by a third-party? If so, isn't it a bit too flexible?

How so? You have to install a metadata source yourself.

For example, why not set the trust level for each release? Version range doesn't seem to be "static" enough. A package may release a backport with a patch version bump after the trust metadata is created.

We could also use individual trees as keys, but then the default is that when a new release comes out it has no trust metadata except that which is associated with the package as a whole. But perhaps that would be good.

Also, since UUID is open and it might be possible to disguise as another package if you have a different set of registries than whoever compiled the metadata, wouldn't it be better to use the tree sha? It'd be better to have a format that allows us to upgrade to a stronger hash (e.g. after Git moved to SHA‑256).

I'm unclear about what the attack model here is.

I just had a great conversation with @DilumAluthge about a problem I've been noodling on for a while, and it makes sense to bring it up here, I think. (If not, please mark as OT.)

Whatever trust mechanism we use for curation needs to curate the registry BEFORE the registry content is downloaded. This will help prevent users from downloading metadata or other content that may be illegal in their country or environment. I can go into details elsewhere.

I do think this is both off topic and logically impossible. A client side mechanism cannot filter the registry without downloading it. If the registry must be filtered before reaching the client, that must be done on the server side.

Also, since UUID is open and it might be possible to disguise as another package if you have a different set of registries than whoever compiled the metadata, wouldn't it be better to use the tree sha? It'd be better to have a format that allows us to upgrade to a stronger hash (e.g. after Git moved to SHA‑256).

I'm unclear about what the attack model here is.

Is this scenario possible?:

  • Registry1 has PkgA with UUID_A
  • Registry2 does not have PkgA
  • A metadata provider trusts PkgA in Registry1 and publish it (suppose that this metadata provider covers multiple registries, say General and Registry1)
  • A user has Registry2 (and General) and install the metadata
  • An attacker copied UUID_A and register PkgA with UUID_A in Registry2 but with completely different content

Now the user installs PkgA from Registry2 with malicious content even though the metadata provider says it's safe.

I imagine that we would be certifying trust in tuples of the form (uuid, tree_hash). You wouldn't blindly trust all versions of a package. You would trust one or more specific versions of a package.

Yeah, that's my thought, too.

But, if we want to have a way to (mildly?) trust future releases, my guess is that the only way to do this is to have a way to trust the authors and provide a mechanism to sign/verify each release. I guess this should also be outside the registry mechanism and maybe somehow integrated in the metadata framework.

Ok, that attack does make sense. Having metadata apply to each tree hash makes sense. We can include a SHA256 hash of the same tree as well since SHA1 is broken. We should probably add that info to registries as well. The compression scheme doesn’t really make sense in that case. Another option is to scope metadata attestations to registries and then only apply claims to those registries. That implicitly assumes that the registry is fairly safe.

Ok, that attack does make sense. Having metadata apply to each tree hash makes sense. We can include a SHA256 hash of the same tree as well since SHA1 is broken. We should probably add that info to registries as well. The compression scheme doesn’t really make sense in that case. Another option is to scope metadata attestations to registries and then only apply claims to those registries. That implicitly assumes that the registry is fairly safe.

Personally, I prefer the approach of storing (uuid, tree_sha1, tree_sha256). It just seems safer to me.

Also, if we have trust info for Foo 1.0.0, we can use that in any registry. So the trust info is more portable and thus more useful.

@StefanKarpinski Without thinking very long about this, I think decoupling the metadata layer and registry as much as possible would be a good guideline. At least that's how I digested the design philosophy you laid out in the OP. Exchanging only UUID and tree hash(es) between the metadata layer and the registry sounds like a good idea. I think avoiding trust being a function of the registry helps this, too. Also, it'd be nice to trust immutable data identified by the hash (which is the source of the portability @DilumAluthge mentioned) rather than a mutable data store like the registry.

Coincidentally, trust and code review (perhaps via crev) is also being discussed on the JuliaHub issue tracker https://github.com/JuliaComputing/JuliaHub/issues/2#issuecomment-641278038

We can include a SHA256 hash of the same tree as well since SHA1 is broken. We should probably add that info to registries as well.

To keep options open for deeper integration with Crev in the future, it would be nice if this could instead be BLAKE2 following the algorithm described here: https://github.com/crev-dev/recursive-digest#algorithm. I realize that either adding a dependency on a library that implements BLAKE2 or writing a Julia implementation just for this purpose may be a nonstarter though.

We can include a SHA256 hash of the same tree as well since SHA1 is broken. We should probably add that info to registries as well.

To keep options open for deeper integration with Crev in the future, it would be nice if this could instead be BLAKE2 following the algorithm described here: https://github.com/crev-dev/recursive-digest#algorithm.

I would be fine with storing the following:

  1. UUID of the package
  2. Tree SHA-1 hash - since that's what Git currently uses)
  3. Tree SHA-256 hash - because that is planned to be the successor to SHA-1 (source 1, source 2, source 3)
  4. Tree BLAKE2 hash in the format required for compatibility with crev - this is because I think crev is our best option as far as getting some kind of whole-code review system into the Julia ecosystem.

But I think we should draw the line at those four. Certainly we can't store every hash in the world. I think those four pieces of information (UUID plus three tree hashes) are sufficient.

Coincidentally, trust and code review (perhaps via crev) is also being discussed on the JuliaHub issue tracker JuliaComputing/JuliaHub#2 (comment)

I've opened https://github.com/JuliaLang/Pkg.jl/issues/1858 as a meta-issue to track the integration of crev into Pkg.

But I think we should draw the line at those four. Certainly we can't store every hash in the world.

IIUC, you don't need to store BLAKE2 in registries.

Also, I think it's actually OK to specify a release of a package with UUID and package version. This way, the registry and the metadata layer communicate via UUID-version pair. Then the metadata layer can store completely different hash (say BLAKE2). The metadata layer doesn't need to know about SHA1 because it is the system that ultimately verifies the package (maybe by interfacing with crev).

I don't know if it's possible to hook UUID into crev's data model, though: https://github.com/JuliaComputing/JuliaHub/issues/2#issuecomment-641703944

Also, I think it's actually OK to specify a release of a package with UUID and package version.

I definitely want the trust metadata to store a tree hash for each version.

I don't know if it's possible to hook UUID into crev's data model, though: JuliaComputing/JuliaHub#2 (comment)

I like your idea of doing https://pkg.julialang.org/package/$uuid as the source.

The trust metadata should store a tree hash for each version it trusts.

And registries should store a tree hash for each registered version.

If the hash function is "secure enough", then this gives us a good way of linking the trust metadata to the registry.

This is why we should at least have some "more secure" hash in addition to SHA-1, in both the trust metadata and the registry.

Also, I think it's actually OK to specify a release of a package with UUID and package version.

I definitely want the trust metadata to store a tree hash for each version.

Yes, the metadata layer definitely must have _some_ kind of hash (maybe BLAKE2). What I meant was that the "key" to identify the release record in the registry can be the UUID-version pair.

Also, I think it's actually OK to specify a release of a package with UUID and package version.

I definitely want the trust metadata to store a tree hash for each version.

Yes, the metadata layer definitely must have _some_ kind of hash (maybe BLAKE2). What I meant was that the "key" to identify the release record in the registry can be the UUID-version pair.

I see.

I would be more comfortable saying that the "key" is the tuple (uuid, tree hash) for a "secure" hash. It just feels better to me. Plus it prevents against that attack you mentioned above.

But also, I can see this situation (two registries, same package, same version number, but different trees) happening in a non-attack situation. Someone has a private registry, registers 1.0 of their package, continues to develop package, later registers it in General, but was careless and registered it in General as 1.0. Nothing malicious here, but still, we want a trust metadata repository (which should be agnostic of registry) to work with both registries simultaneously.

That makes sense. Yeah, I agree tree hash is much better for identifying a release.

I don't think I fully understand all the details so I'll just state couple of things, to maybe help you figure it by yourself:

Crev proofs can be arbitrarily extended. Implementations are supposed to ignore fields they don't understand. And except for Pkg and maybe some auxiliary aggregating tools nothing would even look into package reviews of Julia package. Each ecosystem would pretty much only care about their own subset. The trust proofs (WoT) are more valuable for sharing between communities.

I'd be happy to add any fields that you might need as optional fields in the reference implementation. I was aware that I am designing everything with limited knowledge, and I hoped that when more people would start looking at it, we will crystallize the format. uuid seems like a generic extension that would be easy to add. It's backward compatible anyway, and even if it wasn't, the format is still in version: -1 and I plan to bump it to 0 when certain level of popularity is reached. Even after that, it will always be possible (though harder and laborious for everyone) to roll even newer versions if it is ever necessary.

I have used blake2 because it was fast and seemed like a raising star. Might have been unwise of me, to go with a newer, less popular algorithm. But all hope is not lost. The whole algorithm of recursive filesystem hash that crev uses is generic over the hashing algorithm. Literally the library that implements it does not include any digest algo. dependency and requires the caller to specify it. There would be little problem of adding a digest-algorithm field, that would default to blake2 if missing, and supporting a set of digest functions. It's not even an additional dependency for Rust version of crev because I'm quite sure dalek-ed22559 used for crypto uses sha256 internally.

You seem to be talking about the metadata and such, that I don't fully understand, but just remember that you can add any extra fields you want in the proof and other clients will just ignore them.

Oh, and one more thing. crev is very modular. You can pretty much take any piece you want and do it differently. I very explicitly avoid coupling any layer/part with any other one.

One thing like this is the proof distribution. In cargo-crev I've implemented a system in which users just create github repositories (example: my "proof repository") and Ids always advertise an url that means "you can possibly find more of my stuff at this url".

But it doesn't have to be this way. This system seemed like a good start because devs tend to already have github accounts etc, and git gives a decent differential sync, and also free hosting, but I don't think it will scale very well. Once users have web of trust spanning across thousands of people downloading updates might become rather slow.

Proofs are detached from their context, so they can be distributed in any way. They could be downloaded via bittorrrent or from central location, or sent via email. They could come in a one big tarball, can be cached and aggregated in bigger websites, and project could even include their own reviews inside the package itself and so on. The point is crev does not mandate how the proofs are distributed.

When I started I didn't have any community buy-in etc. and I don't fully understand if you do, but if you could - it would probably be better if the package registry itself allowed uploading and downloading review proofs (as well as trust proofs) - it would be possibly much more efficient than having each user download repos of a thousands users periodicaly. P2p distribution via git works OK so far, but I think we will eventually hit problems of scaling it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GordStephen picture GordStephen  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

innerlee picture innerlee  Â·  4Comments

DilumAluthge picture DilumAluthge  Â·  3Comments

timholy picture timholy  Â·  4Comments