Berry: [Discussion] Zero-install & repository size

Created on 19 May 2019 · 16Comments · Source: yarnpkg/berry

I really like the concept of getting rid of the yarn install step for deploying to production.

But I have projects that have 500MB of dependencies. Adding .yarn would dramatically increase the repo size.

The problem is that git hosts have limits on repo size. (E.g. GitHub recommends repo sizes to be <1GB.)

Git LFS could be a solution but it's seem to be fairly expensive.

I'm curious about yours thoughts on this.

discussion

Source

brillout

👍2

Most helpful comment

I feel like that is being erased

Zero-install is optional. If you don't like it, don't use it. It's a bit like ranting against this new weapon called "swords" because it's easier to cut your own fingers with it than with a club.

We're not stuck in the old days anymore where npm could take several minutes to install. The modern versions of the package managers that we use are fast, especially when you have cached versions of your dependencies on your system already.

I worked on package setups for the past 2+ years as my daily job. I saw situations where cached installs still amounted to more than twenty-four hours a day. What's the cost in term of feedback loop? What's the cost in actual CPU time? What's the cost in build failures because of bugs during yarn install?

Additionally, Yarn's main selling point, perhaps above its speed, is the stability of its builds. We aim to guarantee you zero surprises. With yarn install this is only partially true, because it's quite possible that you'll forget to run an install and unknowingly compile your code against boggus dependencies. Zero-installs are a way to push back the theoretical limits of this statement - you can now ensure that the state of your project is always right, regardless where you are in the history.

Finally, whether you share those concerns isn't really the point - Yarn is used by millions nowadays. We're used by small companies, by medium companies, by very large companies. They share different use cases, and sometimes need different solutions. I believe this one is applicable to many scenarios (we're using it developing Yarn itself, and I think it proved great so far), but maybe you're not the target.

arcanis on 22 Jul 2019

👍14

All 16 comments

At the time I started considering this option I contacted some folks at GitHub to make sure it wouldn't become a problem on their side. From what they told me it should be perfectly fine to host this amount of binary data (at least on GitHub).

What's nice with this approach (and maybe it would be even more true in the wake of the GitHub package registry) is that it's open to various optimizations. For example, assuming that many projects use the same version of the Lodash archive, I would assume GitHub could be able to eventually merge them in a single version in their "store". From the consumer perspective it wouldn't change a thing, except that their storage wouldn't be affected by the number of package they use.

Finally, Zero-Install is optional and as always there's a tradeoff. Smaller libraries might not really need it, and they can just use the common yarn install workflow everyone is used to. Enterprise applications and large projects with many contributors, however, will likely be fine with trading some MB against the improved DevX and guaranteed stability.

arcanis on 19 May 2019

👍2

While cloud-based services like github and bitbucket might support large repositories because they've got a lot of resources, on-prem solutions are more limited. Our enterprise git server becomes totally unresponsive when a new designer is on-boarded and cloning the designs repo, leading to developer frustration, CI failures and—most importantly—CD failures. While an initial size of ± 250 MB is still okay, this size will increase significantly once we've updated our dependencies a couple of times.

Zero install is still a great feature though. It solves the "I switched branch and now a dependency is missing" problem, it speeds up CI significantly and it lowers our dependence on our on-prem npm registry.

I've been thinking about this for a while now, mostly when trying to sleep, and this is where I'm at:

Commit them into git
- + no install necessary, "clone & go"
- - the size of the git repository increases significantly upon updating dependencies
- - on-disk duplication of dependencies between projects
Use git-lfs
- + no install necessary, "clone & go"
- + size of the git repo stays small
- ? is this going to work well? we don't have large files but many files (berry's .yarn contains 2208 zipfiles, the entire folder is 180MB so on average less than 81.5kB per zip)
- - on-disk duplication of dependencies between projects
Set yarn cache folder to a system folder
- + size of the git repo stays small
- + no on-disk duplication of dependencies between projects
- - install necessary
Use a separate git project mounted as .yarn submodule
- + no install necessary, "clone --recurse-submodules & go"
- + size of main git repo stays small, dependencies project can be retired and replaced once it becomes too large
- + dependencies project can be used in multiple projects, sharing dependencies
- - submodules are hard
- - on-disk duplication of dependencies between projects, unless you have git fu

bgotink on 19 May 2019

👍5

A library cache-manager that manages the git submodule would be nice.

It would manage a .cache git submodule to be used by tools such as yarn or parcel.

Symlinks would be taken care of:

~shell
$ file .yarn
.yarn: symbolic link to ./cache/yarn
~
~shell
$ file frontend/.parcel
frontend/.parcel: symbolic link to ../cache/parcel
~

The library would abstract away the git submodule complexity and, from the user perspective, it would just work. The only thing the user would need to do is to save the cache repo address in a file .cacherepo:
~shell
$ cat .cacherepo
[email protected]:brillout/my-awesome-app__cache
~

Every time yarn runs, and prior to any operation, it would call require('cache-manager').init('.yarn/') where .yarn/ is the path of yarn's cache directory. It then automatically sets up the symbolic link and initializes the git submodule.

It would also occasionally git push --force and squash old commits to reduce the cache repo size.

If the user doesn't set the .cacherepo file then cache-manager is disabled and no .cache git submodule is created.

It would declutter the code repo while reaping the benefits of zero-install.

@arcanis would yarn be interested in using such library?

@bgotink what do you think? You seem to have thought a lot about this.

Would be nice to have other tools on board, such as parcel.

brillout on 11 Jun 2019

It's an interesting idea. The most complicated part would be this:

It would also occasionally git push --force and squash old commits to reduce the cache repo size.

At the moment the cache is implemented within the core and cannot be replaced, but I'd like to offer a way for plugins to replace it by whatever implementation they'd like. It's not too hard technically, the only subtlety is that contrary to how plugins currently work we could only have one cache system at a time.

Under this approach you wouldn't need symlinks etc - your cache implementation would just use the submodule as it is.

arcanis on 11 Jun 2019

Heh. Let's do quick math what to expect with this feature:

Say project has 250MB of dependencies (not uncommon)
Say during project's history dependencies are updated 20 times (not uncommon)

It means that to clone this project you need to download around 5GB of data vs downloading only 250MB (possibly cached from other projects) with yarn install. I didn't even mention branches.

Other concerning things:

Cloning dependencies from GitHub is far less performant than downloading them from CDN that lives 10ms from you instead of other side of ocean. Git is also not perfect at parallelising these downloads.
People who worked in data science and tried to put "big" datasets in git repositories know git operations become slow when there are big files committed: checkout, merge, rebase you name it. It quickly becomes annoying.
When working with monorepos you don't always need all dependencies of all projects available. It's quicker to clone monorepo and install only what you need
For production you don't need devDependencies only production dependencies

So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them".

sheerun on 14 Jun 2019

👍5

Your analysis isn't much better than a guess. There are various factors in play. For one, you assume that each and every package will be upgraded 20 times. From my experience this is rarely the case. Yarn (and even moreso the v2) is pretty good at reusing packages during upgrades. Upgrading from Webpack 3 to Webpack 4 yields an addition of only 80 packages, compared to the ~400 parts of Webpack itself.

Of course, there's no denying that a git clone will be slower with more data - but it's still faster than a clone and an install - especially when you factor the amount of time you make a clone per day versus the amount of time you make an install. The balance might tip off at some point, but it remains to be seen how much time it takes in practical cases, and whether the possible solutions alleviate the issue.

People who worked in data science and tried to put "big" datasets in git repositories know git operations become slow when there are big files committed: checkout, merge, rebase you name it. It quickly becomes annoying.

The Zero Install approach originates from another feature, the Offline Mirror, which follow the exact same principle except that the installs still need to be run. It got released in 2016. Since then, we never heard a single time that this features was causing issues. In fact, not only did we heard the exact opposite, but I even witnessed it myself by working on such a codebase. So why does it work well?

Your tradeoffs are not everyone's tradeoffs. A large repository might be a cost that you don't wish to pay and that's ok, but for someone else this might not be true. Deployment stability and developer experience are two area typically very hard to scale, but repository size is easily measurable and optimizable. And git clone --depth is a thing, too. Not perfect, but as the incentives shift so does the tooling.
Both the offline mirror and the zero-install approach are completely optional. If you are in the case I mentioned and don't wish to pay the cost, just put enableGlobalCache in the yarnrc file at the root of the repository and you won't ever have to mind it ever again. It's a default, but not a requirement.

So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them".

While not directly related, I don't find productive or very ethical to post FUD on Twitter without even waiting to hear what others have to say about your findings.

arcanis on 14 Jun 2019

👍1

This guess this is good feature for Facebook-sized private repositories where everyone is one the same page about git clone --depth and there are optimisations in place for whole team for handling enormous repositories, but I would be annoyed if I found someone using this feature in the wild because of all the reasons I've mentioned.

Offline Mirror is useful feature and I don't think it's comparable because you don't need to create mirror directory inside project or commit it (e.g. you can upload it and download from S3 for production). On the other hand it seems that "zero-install" will encourage committing big files into repositories.

Exact numbers for my analysis don't matter because downloading even 2x more dependencies than necessary is not good. Also you cannot solve not downloading devDependencies for production even if you use git clone --depth and usually they weight more than production dependencies.

You're right about the message on twitter I should have waited at least until you answered. Unfortunately all of my arguments still hold and I posted about it because I would find it harmful if someone decided to do something like this on public repository. I guess it would be fine if this feature could be enabled only with "private": true, because in such case I don't care.

sheerun on 14 Jun 2019

I mean, some of your points make sense, and we don't necessarily have answers to all of them. Still, my opinion based on the people I discussed with is that Yarn caters to two different audiences: independant developers, and companies. The two don't always have the same needs, and having options for both is important.

Overall, I think we agree that the feature makes sense but the messaging should be made more clear. I think a table like the one @bgotink started (with the various pros and cons) would be a good addition to the documentation (maybe on a separate page, for example behind a "Should I use Zero-Install?" link). If you're willing to give us a hand, we'd be happy to review such a PR! 🙂

arcanis on 14 Jun 2019

👍2

I also agree. One more comment: I think one of the reasons why Yarn implemented this feature is somehow decentralising package management (good cause) by committing all code including dependencies into git repositories, but I think it could backslash because some operations on such repositories would be very hard to perform without centralized service like GitHub (for example git blame or git log -p -- package.json needs whole history downloaded).

sheerun on 14 Jun 2019

Some quick data obtained from the Berry repository (which is about 6 months old). The size on-disk of the cloned repo is 253M. After running a tree filter + an aggressive gc, the size went down to 149M. The size of the cache before being purged is 88M. That would give 16M of extraneous data (~6%).

It would be interesting to run a similar experience on a product application 🤔

arcanis on 24 Jun 2019

I think you might not have pruned all zip files from berry repository. Here's how to properly do it:

git clone --mirror https://github.com/yarnpkg/berry
cd berry.git
du -sh .
161M

Then download bfg tool: https://rtyley.github.io/bfg-repo-cleaner/

java -jar ~/Downloads/bfg-1.13.0.jar --delete-files '*.zip' --no-blob-protection .
git reflog expire --expire=now --all && git gc --prune=now --aggressive
du -sh .
17M .

So the overhead seems 950%

If you do just shallow clone then repository is 97MB:

git clone --mirror https://github.com/yarnpkg/berry --depth 1
cd berry.git
du -sh .
97M .

It means full clone downloads extra 64MB of historical .zip dependencies and 80MB of current dependencies.

sheerun on 24 Jun 2019

👍2

Some numbers from an angular repo at work:

# initial zero-install at angular 7
$ du -sh .git
108M    .git
$ du -sh .yarn
 90M    .yarn

# after updating to angular 8
$ du -sh .git
167M    .git
$ du -sh .yarn
110M    .yarn

Adding 60MB per upgrade is too much for us to safely commit it into our repository. The repo would grow by at least 200 MB per year and this is only our repo with the smallest number of dependencies (it's the root of our internal stack, all other repos depend on packages from this repo).
Especially internal dependencies are updated often, so I'd expect the number to be a lot more than 200MB/year for some other repos.

bgotink on 26 Jun 2019

👍3

I am really struggling to see the benefits of zero install through the dozens of down sides that I see.

The entire reason I switched to Yarn was because of the way Plug'n'Play shares a global cache by default. When I think about how zero install is currently implemented, I feel like that is being erased. Yes, .yarn/cache will be smaller than node_modules, but is committing that folder to version control and wasting all of the correlating disk space really worth not have to wait for a few seconds for Yarn to install the dependencies on its own? We're not stuck in the old days anymore where npm could take several minutes to install. The modern versions of the package managers that we use are fast, especially when you have cached versions of your dependencies on your system already. If you have to clone down a git repo that already has the dependencies managed, not only do you have to download the dependencies that you don't have cached, you have to download the ones that you do have cached, and every single version of every dependency that the repo has ever used, even if it hasn't been used in years.

The community as a whole decided years ago that committing node_modules to source control was a bad idea. How is this any better?

partheseas on 22 Jul 2019

👍1

I feel like that is being erased

Zero-install is optional. If you don't like it, don't use it. It's a bit like ranting against this new weapon called "swords" because it's easier to cut your own fingers with it than with a club.

We're not stuck in the old days anymore where npm could take several minutes to install. The modern versions of the package managers that we use are fast, especially when you have cached versions of your dependencies on your system already.

arcanis on 22 Jul 2019

👍14

At the very least, I currently only see a way to opt out on a per project basis. Will it be possible to have a system wide or at least workspace wide opt out?

partheseas on 23 Jul 2019

Sure. Just put a ~/.yarnrc file with enableGlobalCache: true, and Yarn will have the same behavior as before (except that it'll still be using PnP).

arcanis on 23 Jul 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings