Pnpm: Content-addressable storage

Created on 9 Apr 2020  Β·  34Comments  Β·  Source: pnpm/pnpm

Just came across this package which "Replaces files that have identical content with hardlinks". We could do the same. We already use hardlinks to link package from the store to node_modules. Why don't we also hardlink identical files in the store? This could save even more disk space and potentially make installations even faster.

The easiest way to do it would be to save every file of every unpacked package in a location that contains the hash of the file (.pnpm-store/<file hash>). And hardlink files from that location. However, how to prune the store in this case? How could we know if some files are not used anymore?

Progress

  • [x] implement CAFS #2487
  • [x] implement pruning of CAFS #2533
  • [x] in case of identical files, when one has executable bit and another does not, then they should be two different files in the store. #2504
  • [x] make side-effects cache work with CAFS (probably won't be done) #2562
performance XL breaking change feature

Most helpful comment

Well, I've checked on linux/ext4. I don't have ntfs (windows) or apfs (mac os).
But it looks like these FS has links count too.

Actually, you don't need all links, you just need a number https://nodejs.org/api/fs.html#fs_stats_nlink

All 34 comments

I work on a proof of concept of a content-addressable filesystem for pnpm's store, to see if it will provide performance/disk space improvements.

Isn’t it basically what git does? πŸ˜€

It is.

Alexey Ten notifications@github.com ezt Γ­rta (idΕ‘pont: 2020. Γ‘pr. 19., V
17:40):

Isn’t it basically what git does? πŸ˜€

β€”
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pnpm/pnpm/issues/2470#issuecomment-616150248, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAOWTGYQAECXYIXVMYJLT2LRNMEMZANCNFSM4MEJEODA
.

So I did a PoC. The disk saving is about 10% when there are 1000 packages in the store. ~There is also a small boost in speed but I am not sure about that.~ There is also room for speed improvement. My current version is as fast as the current pnpm.

Also, there is almost no difference between using sha1 and sha256.


I guess, now the biggest question is, how to prune such a store. We need to save information about each file's dependents.

I'm not sure how you implemented this. But here is my idea and pruning is as simple as counting number of hardlinks. store folder contains files named by their hash. packages are using these files.

$ tree --inodes
.
β”œβ”€β”€ [5513219]  packages
β”‚Β Β  β”œβ”€β”€ [5513220]  a
β”‚Β Β  β”‚Β Β  └── [5510601]  index.js
β”‚Β Β  └── [5513221]  b
β”‚Β Β      β”œβ”€β”€ [5510601]  index.js
β”‚Β Β      └── [5510603]  second.js
└── [5513218]  store
    β”œβ”€β”€ [5510603]  anotherhash
    └── [5510601]  somehash

4 directories, 5 files
$ find store/ -type f -links 1
# nothing to prune

# now remove package b
$ rm -r packages/b
$ tree --inodes
.
β”œβ”€β”€ [5513219]  packages
β”‚Β Β  └── [5513220]  a
β”‚Β Β      └── [5510601]  index.js
└── [5513218]  store
    β”œβ”€β”€ [5510603]  anotherhash
    └── [5510601]  somehash

3 directories, 3 files
$ find store/ -type f -links 1
store/anotherhash
# ↑ this file has only one link - meaning there is no hardlinks to it except itself.

BTW, there is small caveat. If I have identical files, but one has executable bit and another does not, then they should be two different files in store.

@alexeyten if the solution you described works on Windows, then it is perfect! (this is what I have found on SO

Well, I've checked on linux/ext4. I don't have ntfs (windows) or apfs (mac os).
But it looks like these FS has links count too.

Actually, you don't need all links, you just need a number https://nodejs.org/api/fs.html#fs_stats_nlink

Thank you! That seems like what we need. Great! I will work on pnpm v5 then. This will be a killing feature! I created a WIP PR at #2487

@pnpm/collaborators let me know about any objections because I think a content-addressable storage in v5 is a really good idea.

how to prune the store in this case? How could we know if some files are not used anymore?

How about creating a database (e.g. ~/.pnpm/3/usages.json) that maps package names to places that use them? Every time user runs pnpm install, this database will be updated.

But if alexeyten's suggestion works well then just ignore this.

My main concern was regarding pruning. Managing state (through a usage.json, for instance), would cause a performance and disk space penalty. But using the filesystem capabilities is perfect.

I'm not sure how to handle differences on in file modes in general (not only executable bit). E.g. should files which differs only in writable bit be hardlinked? If so, what mode of common file shoud be.
Also I'm not sure if windows has a concept of executable bit or it relies on file extension.

I would prefer all files within node_modules be read-only (i.e. -r--r--r-- or -rx-rx-rx), and maybe within the cache too (if I'm not mistaken, in UNIX, write permission of a file is not required to delete or rename that file).

Regarding hardlinking file with different permissions, it is permitted as long as you own the file.

From http://kernel.opensuse.org/cgit/kernel/commit/?id=800179c9b8a1

Hardlinks:

On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.

The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.

I meant that all hardlinked files will have the same set of permissions. So if I have file 1.txt with -rw-r--r-- mode and file 2.txt with -r--r--r-- mode, with the same content inside should they be:

  1. two different entries in store?
  2. link to one entry?

In second case, what file mode should store entry (and hence both files) have?

P.S. git just ignores everything except executable bit.

I don't think differing permissions should not affect the "sameness" of identical files with identical content. Packages are logically read-only, so rw bits should not be considered.

As for how to reconcile two versions with the same content but different permissions, we could:

  • Use the one with the most or least permissive state.
  • Always let the newest one win.

If we posit that packages are logically read-only, then the least permissive state makes sense. On the other hand, I sometimes temporarily edit package source within the store to debug problems or figure out undocumented behavior, so in that case the most permissive state makes sense.

As for the executable bit changing for an identical file, I think we should link to the most recent version. The assumption has to be that this was a deliberate change by the package author.

Most recent could be quite hard to track.
E.g. I've added package@2 to my deps. And then add other package that depends on package@1. In terms of installation time package@1 is more recent than package@2.

So I would stick with git, that tracks only x (and symlinks, btw)

There's also another solution.

All the files with default permissions that are not executables are saved to .pnpm-store/v3/files/xx/xxxxxxxxxx

We can create a subdir for files with exec permissions at .pnpm-store/v3/files/exec/xx/xxxxxxxxxx. Other modes may have different subdirectories.

That would work. What do you plan to do about files with identical content but differing rw permissions?

By default, all the files are with write permission, right? I think we can have a subfolder for the readonly files. Like .pnpm-store/v3/files/readonly/xx/xxxxxxxxxx

What I propose is that rw should not be considered at all. There is no functional difference (from a package execution perspective) between two files with differing rw bits.

But I think it is valuable to be able to edit the dependencies for debugging purposes, I also do this frequently:

I sometimes temporarily edit package source within the store to debug problems or figure out undocumented behavior, so in that case the most permissive state makes sense.

In that case, if there are two versions with identical content but differing permissions, hardlink to the most permissive version. No need to keep two versions, they are functionally equivalent.

After some more optimizations, pnpm with content-addressable store is ~16% faster on a project with 1133 packages.

I wonder if I should use atomic write for adding files to the store... That will make it a bit slower.

Without atomic write, if someone stops pnpm, while it runs, some broken files may appear in the store. But by default, pnpm always verifies the content of the files, before linking, so that might be fine.

EDIT:

seems like renaming the file after creation is not increasing the installation time, so I'll add it

As long as you can guarantee integrity, then optimize any way you can.

I released the first prerelease version of pnpm v5: [email protected]

Two things stand out:

  • with lockfile is much slower than the others (resolution?)
  • update is much slower than npm

Any ideas why?

@aparajita I wrote the code that does the update. The way it works is that it replaces the requested versions of all dependendencies with "*" in the package.json and then runs the install command again. For example, from "react": "^16.2.0" to "react": "*".

This is purely a hunch that isn't based on any specific knowledge, but could it be that npm is somehow more eager to fetch newer versions during the initial install? Then when the version is set to "*" it doesn't need to "catch up" as much as yarn and pnpm.

The update is probably slow due to pnpm's resolution algorithm, which scans everything in node_modules, when an update happens. This is not in any way related to the changes that are done for the content-addressable storage, so we should discuss this in other issues.

Create #2503 to discuss resolution further.

~I will probably not implement side-effects cache for the new store. Side-effects cache is false by default anyway, and it was only added for Glitch. But Glitch still use pnpm v2, and I haven't heard any plans from them about an update.~

Was this page helpful?
0 / 5 - 0 ratings