Yarn: Usage of uncompressed tarballs

Created on 7 Oct 2016  Â·  16Comments  Â·  Source: yarnpkg/yarn

Something to consider as a future enhancement, post-launch

Some people may want to store tarballs of all their dependencies in their source control repository, for example if they want a fully repeatable/reproducable build that does not depend on npm's servers. Storing compressed tarballs in Git or Mercurial is generally bad news. Every update to a package would result in a new copy of the entire file in the repo, which can make the repo very large. Every time you clone the repo, the full history is transferred including every previous version of all the packages, so even deleting the binary files has a lasting effect until you rewrite history to kill them.

Instead, we should try storing _uncompressed_ tarballs (ie. .tar files). Since the tar files are mostly plain text, in theory Git/Mercurial should be able to more easily diff changes to the files if a new version of a module is added while an old version is removed and just store the delta rather than storing an entirely new blob.

Related: This was implemented in Shrinkpack: https://github.com/JamieMason/shrinkpack/issues/40 and https://github.com/JamieMason/shrinkpack/commit/7b2f341408be4f0415714ec57534debfdaaa3fbf#comments. According to the comments on the commit, this actually sped up npm install when shrinkpack implemented it, as npm no longer needed to decompress the archive every time. This makes sense since you're removing the overhead of gzip from the installation time.

cat-feature needs-discussion triaged

Most helpful comment

+1

shrinkpack has become a huge part of our development workflow. When packages are upgraded and the build is "shrinkpacked", individual tar files are created for only the new packages, and the outdated versions are automatically dropped. That's because the name of the resulting .tar files are a function of the package versions. Here's a short snapshot of what an node_shrinkwrap directory would look like:

screen shot 2016-11-25 at 7 56 08 am

You can explicitly follow the git history on this directory to figure out which dependencies were upgraded and when, i.e. react-native-animatable in this example...

screen shot 2016-11-25 at 8 05 32 am

...with quick and easy access to the backup:

screen_shot_2016-11-25_at_8_05_56_am

With shrinkpack, the diffs in GitHub are hyper reflective of the commit message and the actual changes being made. Commiting and pushing the result of a new shrinkpack is a better experience, IMO, than doing the same after a yarn pack, because as mentioned, changes are handled at the package version level, rather than repository version level. So you're only pushing up individual .tar files, which is fast, especially if you're using Git LFS, and you don't need to touch your package.json version number at all.

All 16 comments

A few arguments from an internal discussion:

  • There is no difference if file is binary or text for Mercurial (what about Git?). Nonetheless larger binary files will have bigger negative impact on a source control system.
  • If file changes, mercurial tries to store only a diff in storage, that is why large mutable files are better not to be compressed - more chances you save some space.
  • Files from npm that we store in source control are saved as package-x.y.z.tar.gz, they are immutable, they never change, so the optimisation from above will never kick in
  • For example, full React Native node_modules is 200 MB and 37 000 files when installed. However in the mirror we store 800 files of 25 MB total with most .tar.gz files around 100KB. That was considered fine for the Mercurial monorepo we have at FB

Saying that, we can't deny the speed improvement when unzipping non compressed tars, so there may be a reason to consider this feature

+1

shrinkpack has become a huge part of our development workflow. When packages are upgraded and the build is "shrinkpacked", individual tar files are created for only the new packages, and the outdated versions are automatically dropped. That's because the name of the resulting .tar files are a function of the package versions. Here's a short snapshot of what an node_shrinkwrap directory would look like:

screen shot 2016-11-25 at 7 56 08 am

You can explicitly follow the git history on this directory to figure out which dependencies were upgraded and when, i.e. react-native-animatable in this example...

screen shot 2016-11-25 at 8 05 32 am

...with quick and easy access to the backup:

screen_shot_2016-11-25_at_8_05_56_am

With shrinkpack, the diffs in GitHub are hyper reflective of the commit message and the actual changes being made. Commiting and pushing the result of a new shrinkpack is a better experience, IMO, than doing the same after a yarn pack, because as mentioned, changes are handled at the package version level, rather than repository version level. So you're only pushing up individual .tar files, which is fast, especially if you're using Git LFS, and you don't need to touch your package.json version number at all.

@joncursi, we have offline mirror feature that does what you want https://yarnpkg.com/blog/2016/11/24/offline-mirror.
The only thing missing is cleanup that we don't do on purpose because the storage of tars is used by multiple projects

@bestander very cool, thank you for sharing that blog post. I didn't catch this feature ability by reading the CLI docs. This would be a lovely addition to https://yarnpkg.com/en/docs/cli/config

I use shrinkpack local to each project, rather than globally for multiple projects. I would like to do the same with yarn, which would require old tar files to be removed when packages are upgraded. I only care about maintaining the latest working version of the package; if I need to dig up an older package version, it's always there in the git history. But I don't need or want to store it directly in the mirror forever.

My use-case is to implement the mirror less-so for offline purposes, and more-so for maintaining a concise list of package backups incase packages are suddenly unpublished from NPM. Risk control. As far as I know, that was largely the intent behind shrinkpack in the first place.

Is there a smarter way to automate package removal from the mirror when a new package version is added? Perhaps a config option in .yarnrc to specify this (feature request)? ATM it seems I have to manually do...

yarn add package@new-version && rm -rf yarn_mirror/package@old-version

Also, the same issue presents itself when removing a package from use in the repo entirely...

yarn remove package && rm -rf yarn_mirror/package-*

@joncursi, this is a bit offtopic of this issue, better come up with an RFC discussion of what is needed.

As for the cleanup, it can be a 10 line JS/bash script you can run on the side of yarn until we implement it.
The script should be:

  • remove all files in offline mirror that are not present in yarn.lock file

This issue is specifically for switching from compressed (.tar.gz) to uncompressed (.tar) tarballs, anything else should be discussed in a separate task 😄

From an implementation standpoint, what sort of risks and level of effort would you foresee simply by making this a flag that you can pass to the CLI? Shrinkpack is written so that uncompressed tarballs are the default, but you can opt into compressed packages with a flag. What would the impact be for simply implementing the inverse behavior (opt-in to uncompressed with a flag)?

It seems like this would address the issue of potentially unpleasant changes for those already using the offline mirror to commit modules locally, while allowing the uncompressed behavior for those who don't mind aliasing a couple of yarn commands.

Edit: Even more simply, the flag could just be defined in the .yarnrc

This is actually the main thing preventing us from switching to yarn, as it already admirably solves the determinism issue and the offline mirror feature (thanks for the link, btw!) takes care of the rest. However, it leaves us with the undesirable (from our perspective) situation of committing binary packages. In our experience, Git does very well with simple tar, as most updated packages are recognized as renamed with tiny deltas, and the compression does all the rest. Thus, the actual bandwidth used is dramatically lower.

Yarn puts the same tarballs that it downloads from the registry into
offline mirror folder.

To allow non compressed tarballs you would need to unzip it first and then
zip it again.

Also the tarballs have versions in file names, so git won't be able to
track version updates as small diffs.

On Wed, 7 Jun 2017 at 03:40, Brian Frichette notifications@github.com
wrote:

From an implementation standpoint, what sort of risks and level of effort
would you foresee simply by making this a flag that you can pass to the
CLI? Shrinkpack is written so that uncompressed tarballs are the default,
but you can opt into compressed packages with a flag. What would the impact
be for simply implementing the inverse behavior (opt-in to uncompressed
with a flag)?

It seems like this would address the issue of potentially unpleasant
changes for those already using the offline mirror to commit modules
locally, while allowing the uncompressed behavior for those who don't mind
aliasing a couple of yarn commands.

This is actually the main thing preventing us from switching to yarn, as
it already admirably solves the determinism issue and the offline mirror
feature (thanks for the link, btw!) takes care of the rest. However, it
leaves us with the undesirable (from our perspective) situation of
committing binary packages. In our experience, Git does very well with
simple tar, as most updated packages are recognized as renamed with tiny
deltas, and the compression does all the rest. Thus, the actual bandwidth
used is dramatically lower.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/541#issuecomment-306669841, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACBdWINjiFuNdwuHxPYhFcCAng-KdXK7ks5sBg2HgaJpZM4KQspY
.

You wouldn't need to unzip then zip again, you'd simply need to decompress
the tarball. The inner .tar can stay the same, it'll just not be
compressed.

Not sure about Git, but Mercurial tracks copied files, so it could track
new versions of dependencies as copies of old ones if they're similar
enough.

Sent from my phone.

On Jun 7, 2017 6:27 PM, "Konstantin Raev" notifications@github.com wrote:

Yarn puts the same tarballs that it downloads from the registry into
offline mirror folder.

To allow non compressed tarballs you would need to unzip it first and then
zip it again.

Also the tarballs have versions in file names, so git won't be able to
track version updates as small diffs.

On Wed, 7 Jun 2017 at 03:40, Brian Frichette notifications@github.com
wrote:

From an implementation standpoint, what sort of risks and level of effort
would you foresee simply by making this a flag that you can pass to the
CLI? Shrinkpack is written so that uncompressed tarballs are the default,
but you can opt into compressed packages with a flag. What would the
impact
be for simply implementing the inverse behavior (opt-in to uncompressed
with a flag)?

It seems like this would address the issue of potentially unpleasant
changes for those already using the offline mirror to commit modules
locally, while allowing the uncompressed behavior for those who don't mind
aliasing a couple of yarn commands.

This is actually the main thing preventing us from switching to yarn, as
it already admirably solves the determinism issue and the offline mirror
feature (thanks for the link, btw!) takes care of the rest. However, it
leaves us with the undesirable (from our perspective) situation of
committing binary packages. In our experience, Git does very well with
simple tar, as most updated packages are recognized as renamed with tiny
deltas, and the compression does all the rest. Thus, the actual bandwidth
used is dramatically lower.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/541#issuecomment-306669841, or
mute
the thread
ACBdWINjiFuNdwuHxPYhFcCAng-KdXK7ks5sBg2HgaJpZM4KQspY>

.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/541#issuecomment-306726562, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAFnHVDswWH9cTWxFh7IrnUB3_s7Q_t2ks5sBl70gaJpZM4KQspY
.

Thanks, Daniel, good to know.

Although someone needs to show that this advanced mercurial/git tracking
would happen on a real example then before we consider this change, right?

On Wed, 7 Jun 2017 at 09:35, Daniel Lo Nigro notifications@github.com
wrote:

You wouldn't need to unzip then zip again, you'd simply need to decompress
the tarball. The inner .tar can stay the same, it'll just not be
compressed.

Not sure about Git, but Mercurial tracks copied files, so it could track
new versions of dependencies as copies of old ones if they're similar
enough.

Sent from my phone.

On Jun 7, 2017 6:27 PM, "Konstantin Raev" notifications@github.com
wrote:

Yarn puts the same tarballs that it downloads from the registry into
offline mirror folder.

To allow non compressed tarballs you would need to unzip it first and then
zip it again.

Also the tarballs have versions in file names, so git won't be able to
track version updates as small diffs.

On Wed, 7 Jun 2017 at 03:40, Brian Frichette notifications@github.com
wrote:

From an implementation standpoint, what sort of risks and level of effort
would you foresee simply by making this a flag that you can pass to the
CLI? Shrinkpack is written so that uncompressed tarballs are the default,
but you can opt into compressed packages with a flag. What would the
impact
be for simply implementing the inverse behavior (opt-in to uncompressed
with a flag)?

It seems like this would address the issue of potentially unpleasant
changes for those already using the offline mirror to commit modules
locally, while allowing the uncompressed behavior for those who don't
mind
aliasing a couple of yarn commands.

This is actually the main thing preventing us from switching to yarn, as
it already admirably solves the determinism issue and the offline mirror
feature (thanks for the link, btw!) takes care of the rest. However, it
leaves us with the undesirable (from our perspective) situation of
committing binary packages. In our experience, Git does very well with
simple tar, as most updated packages are recognized as renamed with tiny
deltas, and the compression does all the rest. Thus, the actual bandwidth
used is dramatically lower.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/541#issuecomment-306669841, or
mute
the thread
ACBdWINjiFuNdwuHxPYhFcCAng-KdXK7ks5sBg2HgaJpZM4KQspY>

.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/541#issuecomment-306726562, or
mute
the thread
<
https://github.com/notifications/unsubscribe-auth/AAFnHVDswWH9cTWxFh7IrnUB3_s7Q_t2ks5sBl70gaJpZM4KQspY
>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yarnpkg/yarn/issues/541#issuecomment-306728462, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACBdWNPXiyEZpnNFpKLpD7FoIXNe6NqYks5sBmDQgaJpZM4KQspY
.

Hi @bestander, we use git with bitbucket & npm + shrinkwrap on some projects. Here is what it looks like when minor version of the tar changes:
image

Here are sample tar files for package from screenshot that was tracked as renamed:
tars.zip

Thanks

Although someone needs to show that this advanced mercurial/git tracking would happen on a real example then before we consider this change, right?

I've been meaning to test it out, I just haven't had time to do so.

Hey there! It's been awhile, and since you're busy, I thought I'd make this as painless as possible.

Check out this shrinkpack tar proof of concept

This seems like a reasonable idea after all.

So how would it work?

  1. (if file is missing in offline mirror) download tar.gz from registry
  2. unzip and copy tar file to offline mirror
  3. unpack tar to cache
  4. if prune-offline-mirror is enabled and a tarball of a package was added to offline mirror and a another version was removed then register add/remove with git/hg mv

Results:
A. Potential CPU wins because step 2 will be skipped when installing from offline mirror.
B. Space wins if tarball contents are similar at step 4
C. Checking in unzipped tarballs gives a negative impact on repo size
D. Step 4 seems a bit complex with all sorts of edge cases

So if A + B > C + D then why not?
A, B and C can be measured, although D subjective.

Bumpity bump! I can work on this if you guys want?

@bfricka, of course, give it a try.
We would need to see a few real life examples with the impact this feature provides though.

Was this page helpful?
0 / 5 - 0 ratings