Yarn: SHA1 -> SHA256

Created on 12 Oct 2016  Â·  67Comments  Â·  Source: yarnpkg/yarn

I have done a quick check but it looks like Yarn is using SHA1 for checksumming?

If so, I think it would be a good time to switch to SHA256.

cat-compatibility cat-feature high-priority triaged

Most helpful comment

Hi y'all.

Just dropping in to point out that the npm registry has started serving more secure hashes with packages that have been published with it.

You'll find the new hashes in the dist.integrity field of package manifests, in the same object as dist.tarball and dist.shasum. The integrity field is a proper subresource integrity field. Please note that in order to be fully forward-compatible here, you should follow the subresource integrity spec and parse this field out, pick algorithms, etc (or just use the linked library). We will almost certainly use/abuse the ?opt syntax from the spec in the future, and eventually start shipping multi-hash strings. It's also possible we'll have multiple hashes for the same algorithm, only one of which will apply to the specific tarball being downloaded, but that's more a door to leave open in the future.

On the publishing side, you should continue generating sha1 hashes into shasum, but add an integrity field with a stronger algorithm. See https://github.com/npm/npm-registry-client/pull/157/files for the npm-registry-client implementation of it.

Right now, the registry only supports sha512 for tarballs that have been published using npm5, but we'll be generating integrity strings for all tarballs on the registry at some point in the near-ish future, so they should be more widely available then.

lmk if y'all have any questions! It's exciting to start moving on from sha1 and setting ourselves up for a smoother transition to sha3 and beyond :)

edit: p.s. I saw something about generating these from the command line. If you're relying on sha1sum for that, srisum should be a near-drop-in replacement.

All 67 comments

it's worth noting that SHA1 is achievable in POSIX, but SHA256 is much harder to make work cross-platform.

SHA256 is much harder to make work cross-platform.

Just need to use PowerShell everywhere, it has it built-in ($hash = (Get-FileHash -Path $file -Algorithm SHA256).Hash) 😛

Node's crypto library supports SHA256, I think we'd use that.

> var crypto = require('crypto')
undefined
> crypto.createHash('sha256').update('Hello World').digest('hex');
'a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e'

Looks like there's three hashing functions in the codebase:

"hash" in src/util/misc, uses SHA256
"HashStream" in src/util/crypto, uses SHA1
"hash" in src/util/crypto, takes the algorithm as an argument and defaults to MD5

hash in src/util/misc and hash in src/util/crypto should probably be unified and they should all use SHA256, I think we'd need to think about backwards compatibility though.

Thanks for the issue! The shasums we use are given to us by the npm registry so there's not much we can do about this unfortunately. 😢

@kittens - Can we compute our own hash once the file has been downloaded, and store that in the local cache?

might as well jump straight to SHA3

Now it is time to reconsider the issue.
A few problems to solve first:

  • forwards compatibility / migration path
  • speed concerns
  • should we just ignore sha1 coming from npm or use it anyway?

Maybe sha2/3 could be an additional field?

looks like it would be safe to change the default to sha256. I'm not sure where the yarn.lock tgz hash files are checked though.

They are checked here
https://github.com/yarnpkg/yarn/blob/master/src/fetchers/tarball-fetcher.js#L93

On 6 March 2017 at 19:52, Thomas Grainger notifications@github.com wrote:

looks like it would be safe to change the default to sha256. I'm not sure
where the yarn.lock tgz hash files are checked though.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/816#issuecomment-284511737, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACBdWHfTuPDdCEu9aZWWqPJm5IU5uJpIks5rjGPqgaJpZM4KUafq
.

@bestander where is the value from the yarn.lock file passed in?

Hi y'all.

Just dropping in to point out that the npm registry has started serving more secure hashes with packages that have been published with it.

You'll find the new hashes in the dist.integrity field of package manifests, in the same object as dist.tarball and dist.shasum. The integrity field is a proper subresource integrity field. Please note that in order to be fully forward-compatible here, you should follow the subresource integrity spec and parse this field out, pick algorithms, etc (or just use the linked library). We will almost certainly use/abuse the ?opt syntax from the spec in the future, and eventually start shipping multi-hash strings. It's also possible we'll have multiple hashes for the same algorithm, only one of which will apply to the specific tarball being downloaded, but that's more a door to leave open in the future.

On the publishing side, you should continue generating sha1 hashes into shasum, but add an integrity field with a stronger algorithm. See https://github.com/npm/npm-registry-client/pull/157/files for the npm-registry-client implementation of it.

Right now, the registry only supports sha512 for tarballs that have been published using npm5, but we'll be generating integrity strings for all tarballs on the registry at some point in the near-ish future, so they should be more widely available then.

lmk if y'all have any questions! It's exciting to start moving on from sha1 and setting ourselves up for a smoother transition to sha3 and beyond :)

edit: p.s. I saw something about generating these from the command line. If you're relying on sha1sum for that, srisum should be a near-drop-in replacement.

Thanks for commenting this, @zkat, very much appreciated!

Afaik this is not a large change to do now, so a PR with a fix is welcome

@bestander Is this still open? I would love to give it a shot :) -- Also, I hope I'm not asking for too much, but can you give me an idea on how I should get started?

@nikshepsvn yes it is, give it a go.

A good start would be to check where Yarn checks hashes, for example https://github.com/yarnpkg/yarn/blob/master/src/fetchers/tarball-fetcher.js#L83.
Another point - npm started returning sha256 or sha512 in package info, we probably should start using it.

@bestander Thank you so much, I'll do my best to understand as much as I can and get back to you with any questions -- is that okay?

Sure, no problem.

On 24 August 2017 at 10:41, Nikshep Svn notifications@github.com wrote:

@bestander https://github.com/bestander Thank you so much, I'll do my
best to understand as much as I can and get back to you with any questions
-- is that okay?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/816#issuecomment-324705797, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACBdWKgBwTlq8msPvHPt3SPqwOWBF8Wsks5sbbWzgaJpZM4KUafq
.

Currently NPM has SHA-512 for new packages and SHA-1 for old ones.

Well someone send us a PR

I would definitely not assume that field will be sha512. It just happens to be that right now, but that will not always be the case going forward. You should be able to parse SRI and use _that_.

I would love to work on this but this is my first open source contribution
and I'm a little confused with everything that is being said. If it isn't
asking for too much can someone walk me through the steps needed to
accomplish this? Sorry for asking for so much

On Oct 4, 2017 6:49 PM, "Kat Marchán" notifications@github.com wrote:

I would definitely not assume that field will be sha512. It just happens
to be that right now, but that will not always be the case going forward.
You should be able to parse SRI and use that.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/816#issuecomment-334311714, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AWKiH8iPld5eX_7iriKN6NXE3g7U3fkEks5spAttgaJpZM4KUafq
.

Hey guys, is this something I should attempt or would it be more efficient for me to find something else that is more easier?

@nikshepsvn go ahead and give it a try.
Have a look how npm CLI does the sha validation, this is where it is done in Yarn https://github.com/yarnpkg/yarn/blob/master/src/fetchers/tarball-fetcher.js#L83

@BYK @kittens @nikshepsvn Any progress on this issue?

Hey, I'm thinking of taking a crack at this... but just wondering - what is the desirable way to handle backward-compatibility?
npm differentiates between hashes as mentioned in this thread through the SRI (i.e. prefixing them with sha1- or sha512-)... do we add that distinction? Or add it just if the hash is 512 and leave it as is for sha1 (without a prefix) for backward-compatibility? This case will still break old versions if they encounter a sha512- prefix though - are we fine with this?

@imsnif maybe add a new key to yarn.lock eg "integrety":

"@babel/[email protected]", "@babel/code-frame@^7.0.0-beta.31":
  version "7.0.0-beta.32"
  integrety "sha256-8PmvR79Rp37fDU+bJYQJ4wkr8GQ9HhP6N9WbtmL6Kr8= sha384-7mvcIT3VrzxA75UsNq95ZgjGzZFodsMxGJZ6GVsj0tRmEC3nj1tHhj3RRIIdpx4d sha512-b11f7/tdtKIaK7IzVHVVcaHr3uNLW8c9hM44jNOs7P7HaV8uPaR9Ibdynysd/gef02bGN3U7t+wl8hOe7pzO0A=="
  resolved "https://registry.yarnpkg.com/@babel/code-frame/-/code-frame-7.0.0-beta.32.tgz#04f231b8ec70370df830d9926ce0f5add074ec4c"
  dependencies:
    chalk "^2.0.0"
    esutils "^2.0.2"
    js-tokens "^3.0.0"

@graingert - nice approach! I'm not entirely versed in yarn's validation, will it break if there's suddenly an extra key to the lockfile?

@imsnif well I tried that extra key as in https://github.com/yarnpkg/yarn/issues/816#issuecomment-346067131 and yarn worked fine (ignored the new key)

Very cool @graingert ! So to sum up:
The resolved field will remain exactly as it is now, and we'll add integrity with sha512 where possible and sha1 where not. New versions will validate the integrity field.
Are we fine with this? Does this need further discussion before I whip something up?

add integrity with sha512 where possible

No, it should always contain an SRI. Eg sha1 if that's the only one available.

@imsnif you should also sort the sri values so that yarn.lock is deterministic, eg sha1, sha256, sha384, sha512, sha3 ...

@imsnif sounds good to me but I'd also ask @bestander. One thing tho: instead of having multiple prefixed values under the integrity key, why not make it a higher-level key so it has sub keys:
```
"@babel/[email protected]", "@babel/code-frame@^7.0.0-beta.31":
version "7.0.0-beta.32"
resolved "https://registry.yarnpkg.com/@babel/code-frame/-/code-frame-7.0.0-beta.32.tgz#04f231b8ec70370df830d9926ce0f5add074ec4c"
integrety:
sha256 8PmvR79Rp37fDU+bJYQJ4wkr8GQ9HhP6N9WbtmL6Kr8=
sha384 7mvcIT3VrzxA75UsNq95ZgjGzZFodsMxGJZ6GVsj0tRmEC3nj1tHhj3RRIIdpx4d
sha512 b11f7/tdtKIaK7IzVHVVcaHr3uNLW8c9hM44jNOs7P7HaV8uPaR9Ibdynysd/gef02bGN3U7t+wl8hOe7pzO0A==
dependencies:
chalk "^2.0.0"
esutils "^2.0.2"
js-tokens "^3.0.0"

Hey @BYK - I'm actually nearing the final stages of this. While your suggestion does make sense, I already implemented it as one string using @zkat 's ssri module suggested above - which is great. It requires very little changes to implement and actually helps delete some code. :)

Implementing them as sub keys would require more work, and since yarn.lock is not meant for Human consumption, I thought to defer to what decreases the code complexity - which is to just dump the (sorted) stringified ssri using the module. If however you feel this is worth it, I can make changes. What do you think?

@imsnif some packages from node don't have the full integrity string (eg sha512 etc), so you should calculate them as well.

Hey @BYK - I'm actually nearing the final stages of this. While your suggestion does make sense, I already implemented it as one string using @zkat 's ssri module suggested above - which is great. It requires very little changes to implement and actually helps delete some code. :)

Nice! Thank you very much :)

Implementing them as sub keys would require more work, and since yarn.lock is not meant for Human consumption, I thought to defer to what decreases the code complexity - which is to just dump the (sorted) stringified ssri using the module. If however you feel this is worth it, I can make changes. What do you think?

yarn.lock file is meant for human consumption. That's probably the only reason (or at least the most important one) why it has its own format: easy to review but still can hold nested data structures.

If the output is an array, to convert it into an object should be as easy as something like

const HASH_SPLIT_RE = /^([a-z0-9]+)-(\w+)$/;
const integrity = hashes.reduce((res, hash) => {
    const [name, value] = hash.match(HASH_SPLIT_RE);
    res[name] = value;
    return res;
}, {});

Am I missing something?

Hey @graingert - wanting to stick as closely to the npm implementation (and make it future proof as well) I opted to use sha1 when npm uses sha1 and use sha512 (or whatever else) when npm uses it. Since having sha1's (for a package exclusively) in the npm registry is currently only temporary, and since developing a custom solution would require changing it when npm changes its behaviour in the future - I thought this would be the best way.
If there are concerns, I'd be happy to discuss them (though it would probably be easier to do once the PR is finished and we can look at the code).

@BYK - Alright. I'll take a look once everything else is working.

@imsnif feel free to ping us here or in Discord if you need any immediate help and thanks a lot one more time for doing this :)

Hey, @BYK - one question: is this a known issue?

error https://registry.npmjs.org/abab/-/abab-1.0.4.tgz: Fetch succeeded for undefined. 
However, extracting "https://registry.npmjs.org/abab/-/abab-1.0.4.tgz" 
resulted in hash "5faad9c2c07f60dd76770f71cf025b62a63cfd4e", 
which did not match the requested hash "5faad9c2c07f60dd76770f71cf025b62a63111cfd4e"

I'm referring to Fetch succeeded for undefined
It happens with the latest pull from master branch when there is a hash mismatch. I can fix it (but probably best to do so in a different PR) - but if it's not known I'd best open an issue before I forget. :)

@imsnif I remember seeing an issue around this earlier but I thought it was fixed. I'd say this is a new issue that we are not aware of so go ahead and send a PR if you have something handy. If not I'd appreciate a bug report. I'd close it as a duplicate if I find the old one open and similar enough :)

Great to see this progressing and thanks for thinking about backwards compat.
Looks good to me, too.

@BYK @imsnif If you decide to represent integrity in your own format, please note that there can be multiple valid entries for each individual hash, so the right datastructure would end up being:

{
  integrity: { sha1: ['deadbeef', 'c0ffee'], sha3: [...] }
}

Note also that order _within an algorithm_ matters. And yes, this is definitely a feature that we're planning on using in the future. You should also be sure to preserve the ?options values for each individual hash. My 2c is that keeping it as a single string is by far the best representation, but if you -really- want line-by-line, I recommend splitting the SRI on spaces and keeping each entry whole instead of trying to split it up further:

{
    integrity: [
      'sha1-deadbeef',
      'sha1-c0ffee',
      'sha512-deadc0ffee?blah'
    ]
}

This will be the most compatible long-term.

@BYK - I personally think there is great value in consistency between the two clients here (not to mention adherence to the standard), and so I would opt for the string representation (such as the one created by ssri's stringify method). I say this even though I already implemented the method discussed above and would have to go back on it. :) What do you think?

@zkat - thanks for weighing in! There has been a concern raised here about the sorting of the sri string. Right now, the ssri module does not enforce any order: https://github.com/zkat/ssri/blob/latest/index.js#L81-L85 (unless I missed something?)
So I wonder if you would be open to the ssri module sorting the algorithms when stringifying? I'd be happy to issue a PR for that before continuing with this one.

@zkat thanks a lot for weighing in, super useful feedback!

@imsnif I still think having this as a real array as @zkat suggested would be better than having an opaque string unless there's something I'm missing :)

@BYK - So, my reasoning goes like this.

The pros of having the integrity field as a string are:

  1. yarn would be adhering to the standard: https://www.w3.org/TR/CSP2/#source-list-syntax, as linked from the sri standard: https://w3c.github.io/webappsec-subresource-integrity/#integrity-metadata-description.
  2. The yarn.lock integrity field would be 100% consistent with the package-lock.json integrity field, even going so far as using the same module for parsing/stringifying/validating. I think this is very valuable and would help protect against future changes in npm.

    The cons are:

  3. A slightly less readable yarn.lock file.

  4. Potential problems with consistency and sorting.

Regarding the cons:

  1. I think this is the major rub. And I admit it's a trade-off. But on the other hand, I don't think there will be many cases in which users would need to scrutinize the hashes in the yarn.lock file, seeing as all errors that are given for inconsistencies are quite verbose. But I will of course defer to your experience in the matter.
  2. Even if providing a consistently sorted string would be an issue for the ssri module (re-reading @zkat's response, I think she mentioned the ordering of the hashes might be important - so I'd of course understand if they'd want to keep things as they are), the only thing that would change a yarn.lock in such a case is a change in the Object.keys function or in the ssri module itself (please correct me if I'm wrong and didn't think of something!) And even then, the only thing that would happen is a slight diff in the file, with all other functionality remaining intact.

In my opinion, the pros outweigh the cons here. But I will of course defer to your judgement if you disagree. :)

@imsnif I understand where you are coming from and thanks for sharing the relevant specs which helped me see a clearer picture. That said, the only reason for the large string blob concatenated with spaces seems to be the restrictions in HTML (you cannot have the same attribute name more than once in an element so when you need to express an array you either need children, which is not possible with every tag, or need a serialization method).

These are indeed a list of hash specs. If you ask me, we should denormalize this data, even more, to make it explicit which part is the hash-specifier and which part is the hash itself. I don't see any value in doing another pass of string parsing where we already have a string format data supports lists and objects.

I haven't been a part of package-lock.json file format so I cannot comment on how or why they made the decision to use a single string blob for this field but I do not see a good reason to store serialized data in an already serialized data format.

@zkat - is it possible for you to provide the reasons why you have chosen to use this field as a string blob instead of a JSON array?

I don't see any value in doing another pass of string parsing where we already have a string format data supports lists and objects.

To contradict myself, URLs themselves are serialized string blobs in an already serialized data format here but when it comes to lists, I think this makes a significant improvement in terms of reviewing changes in the lockfile. Does that make sense? :)

@BYK - alright, I understand the reasoning. Thanks for taking the time to explain it.

@zkat - I too would be very interested in hearing your reasoning, if you'd be willing.

Otherwise though, I'll continue working on this (I've got some other issues to get through, so this won't block me for now). For the time being though, just to make sure, our agreed solution is @zkat's suggestion:

{
    integrity: [
      'sha1-deadbeef',
      'sha1-c0ffee',
      'sha512-deadc0ffee?blah'
    ]
}

right?

Or, rather:

    integrity:
      sha1-deadbeef
      sha1-c0ffee
      sha512-deadc0ffee?blah

@imsnif yup :)

what's the behavior with two different sha1s?

what's the behavior if the sha1 matches but the sha512 does not match?

@graingert - You raise valid and important questions, whose answers I plan to articulate with tests. So would it be fine if we go into the details after I submit the PR? I think it'll be much easier. :)
If you'd like a sneak preview, I'm validating with https://github.com/zkat/ssri (my current implementation uses integrityStream, but it's subject to change).

The reason for having had a string was purely to be close to the actual SRI spec. That is, providing them as single, whitespace-separated strings. I can see npm bumping the lockfileVersion in the future for the sake of making it an array, but it'll be a long time before it starts making any real impact on diffs at all.

While in theory there can be a bunch of stuff in that integrity field, in practice, it's almost always going to be a single entry (until we start upgrading algorithms or using more advanced options). The idea that it would significantly affect diffs didn't really cross my mind, cause it's a relatively stable field: if you get a diff, you'll likely get a diff for the entire entry, so you won't get individual lines for each hash anyway.

As far as ordering goes: I may have misspoken here as far as whether ordering within an algorithm matters. I think it's actually fine because individual hashes within an algorithm will just be tried until one succeeds.

When you have multiple algorithms, though, for security reasons, you must always use the strongest one that you know you can use. So if you have sha512 and sha1, you must always use sha512 and never, ever fall back to sha1. OTOH, if you see a sha3 and sha1, and you're on a version of Yarn that does not support sha3, you can use sha1. This is a future-proofing feature. You'll notice that ssri does its own algorithm ordering, and it's fairly conservative about it (again, for security reasons).

@BYK - slight hitch with the array approach. It is not backward compatible. Issuing the yarn command from latest with this yarn.lock file:

# THIS IS AN AUTOGENERATED FILE. DO NOT EDIT THIS FILE DIRECTLY.
# yarn lockfile v1


[email protected]:                                                                                                                                                                                                                           
  version "1.1.1"
  resolved "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz#9a5f699051b1e7073328f2a008968b64ea2955d2"
  integrity:
    sha1-ml9pkFGx5wczKPKgCJaLZOopVdI=

Results in:

$ yarn
yarn install v1.3.2
error An unexpected error occurred: "Invalid value type 9:0 in /home/aram/code/yarn-test/yarn.lock".

If I am not mistaken, this is because the current lockfile format does not support arrays.

So seeing as we cannot use arrays (unless I'm missing something with the above? please do correct me as I'm fairly new to the codebase and could be overlooking something!)
And we can't use objects (because the spec allows for multiple identical keys and npm plan on using that option)... that brings us back to strings?
I know that is not an ideal solution as you mentioned, but given the circumstances - if my reasoning is correct - I think it's the best option. What say you?

Edit:
I already got the lockfile to parse/stringify arrays before thinking about this. It is not extremely difficult, but I feel it would be easier to save this breaking change for another occasion and handle all its ramifications separately.

Like I said above: you can totally get away with strings until such a time when you're ready for the breaking change. integrity will almost definitely only contain sha512 for at least a couple of years, is my bet. So if the issue is visual diffs, there will be no difference between the two approaches until a future where node integrates new algorithms and npm starts publishing using them (I think it'll be a while after node even adds sha3 support for npm to actually start using it -- sha512 is much more stable, implementation-wise, and there's no reason to believe it's going to be vulnerable any time soon)

@imsnif that's a bummer, for some reason I thought arrays were possible in the lockfile. Alright, let's go with the sting and see how bad it hurts. I don't like this but the only alternative without a breaking change seems to treat arrays as objects with numeric indexes which would both make the lockfile (0: sha256-xxxx) and the code using it quite weird.

Has there been any progress on this?

Given the strong security promises on the website, it is very disappointing to see SHA-1 checksums in use.

Hey @vog - you can follow our progress here: https://github.com/yarnpkg/yarn/pull/5042
Hoping to have this merged really soon.

Now that that PR has been merged, should this issue be closed?

The only missing piece now is that some dependencies (e.g. those installed from git repositories) still don't have an integrity field. npm includes an integrity field for these packages, so I'm assuming they must calculate it after downloading the package.

npm has not been generating integrity hashes for git dependencies for a while, because git packages can often change hashes unless devs are careful to make builds reproducible.

We have computed the very first chosen-prefix collision for SHA-1. In a nutshell, this means a complete and practical break of the SHA-1 hash function, with dangerous practical implications if you are still using this hash function. To put it in another way: all attacks that are practical on MD5 are now also practical on SHA-1.

https://sha-mbles.github.io/

Perhaps it would be worth using SHA1DC as a dropin replacement?

Closing since SHA256 support has been released more than a year ago 🙂

@arcanis how did this manage to get left open for so long in "high-priority" when it was already fixed?

Was this page helpful?
0 / 5 - 0 ratings