Inspired by @layus's NixCon talk on the topic of reducing disk usage on binary caches and network traffic after mass rebuild upgrades, I came up with a different idea of achieving the same without switching to the intensional store model or reimplementing full binary diffs.
Many of the store paths on Nix are identical except for the paths they depend on. Consider the hello package; currently the only differences between 17.09 and master are in the path references: two for glibc (the dynamic linker and the RPATH for glibc) and two self-references:
--- /nix/store/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs-hello-2.10 (17.09)
+++ /nix/store/0k5zxamwph8pi984y2w7x6xin9rsk600-hello-2.10 (master)
โโโ bin
โ โโโ hello
โ โ โโโ readelf --wide --program-header {}
โ โ โ Program Headers:
โ โ โ INTERP 0x000270 0x0000000000400270 0x0000000000400270 0x000053 0x000053 R 0x1
โ โ โ - [Requesting program interpreter: /nix/store/xzx1bv1d7z4mgg6sg6ly0jx609qvka4x-glibc-2.25-49/lib/ld-linux-x86-64.so.2]
โ โ โ + [Requesting program interpreter: /nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49/lib/ld-linux-x86-64.so.2]
โ โ โโโ readelf --wide --dynamic {}
โ โ โ - 0x000000000000001d (RUNPATH) Library runpath: [/nix/store/xzx1bv1d7z4mgg6sg6ly0jx609qvka4x-glibc-2.25-49/lib]
โ โ โ + 0x000000000000001d (RUNPATH) Library runpath: [/nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49/lib]
โ โ โโโ readelf --wide --decompress --hex-dump=.rodata {}
โ โ โ @@ -33,17 +33,17 @@
โ โ โ 0x00404c40 68656c70 2f3e0a00 2f6e6978 2f73746f help/>../nix/sto
โ โ โ - 0x00404c50 72652f77 35773476 3239716c 30717771 re/w5w4v29ql0qwq
โ โ โ - 0x00404c60 68637a6b 64787339 34697832 6c683769 hczkdxs94ix2lh7i
โ โ โ - 0x00404c70 6267732d 68656c6c 6f2d322e 31302f73 bgs-hello-2.10/s
โ โ โ + 0x00404c50 72652f30 6b357a78 616d7770 68387069 re/0k5zxamwph8pi
โ โ โ + 0x00404c60 39383479 32773778 3678696e 3972736b 984y2w7x6xin9rsk
โ โ โ + 0x00404c70 3630302d 68656c6c 6f2d322e 31302f73 600-hello-2.10/s
โ โ โ 0x00404c80 68617265 2f6c6f63 616c6500 00000000 hare/locale.....
โ โ โ @@ -136,12 +136,12 @@
โ โ โ 0x004052b0 00000000 00000000 2f6e6978 2f73746f ......../nix/sto
โ โ โ - 0x004052c0 72652f77 35773476 3239716c 30717771 re/w5w4v29ql0qwq
โ โ โ - 0x004052d0 68637a6b 64787339 34697832 6c683769 hczkdxs94ix2lh7i
โ โ โ - 0x004052e0 6267732d 68656c6c 6f2d322e 31302f6c bgs-hello-2.10/l
โ โ โ + 0x004052c0 72652f30 6b357a78 616d7770 68387069 re/0k5zxamwph8pi
โ โ โ + 0x004052d0 39383479 32773778 3678696e 3972736b 984y2w7x6xin9rsk
โ โ โ + 0x004052e0 3630302d 68656c6c 6f2d322e 31302f6c 600-hello-2.10/l
โ โ โ 0x004052f0 69620000 00000000 ib......
Thus, if all references to store paths were stripped out from the NAR serialization and stored out-of-band (i.e. in the .narinfo or a separate file referred by the .narinfo), just one copy of the .nar needs to be stored in the binary cache. So the changes to the .narinfo for those two paths would look like:
diff -ru binary-cache-old/0k5zxamwph8pi984y2w7x6xin9rsk600.narinfo binary-cache-new/0k5zxamwph8pi984y2w7x6xin9rsk600.narinfo
--- binary-cache-old/0k5zxamwph8pi984y2w7x6xin9rsk600.narinfo 2017-10-29 23:24:42.302081607 +0200
+++ binary-cache-new/0k5zxamwph8pi984y2w7x6xin9rsk600.narinfo 2017-10-29 23:59:13.703134245 +0200
@@ -1,5 +1,6 @@
StorePath: /nix/store/0k5zxamwph8pi984y2w7x6xin9rsk600-hello-2.10
-URL: 18ii6x4rh5i2gscl2jcqz1p7hpq57jsm8i8ij7hdbyb5y45g7bxn.nar.xz
+ReflessNarURL: 13cdkdq61b5mwa66bki41yzhz78sq7mzhb8hdxpsbi60g0hyimyh.nar.xz
+RefOffsets: [["/nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49",1024],["/nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49",3412],["/nix/store/0k5zxamwph8pi984y2w7x6xin9rsk600-hello-2.10",19928],["/nix/store/0k5zxamwph8pi984y2w7x6xin9rsk600-hello-2.10",21576]]
+ReflessNarHash: 1rwrk1829j2v99n942ad29f7byqz53h7w42ja8hzm5cjc3ynzm5h
Compression: xz
FileHash: sha256:18ii6x4rh5i2gscl2jcqz1p7hpq57jsm8i8ij7hdbyb5y45g7bxn
FileSize: 43104
diff -ru binary-cache-old/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs.narinfo binary-cache-new/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs.narinfo
--- binary-cache-old/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs.narinfo 2017-10-29 23:24:42.303081584 +0200
+++ binary-cache-new/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs.narinfo 2017-10-29 23:58:51.748658719 +0200
@@ -1,5 +1,6 @@
StorePath: /nix/store/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs-hello-2.10
-URL: 11m7hyrjibxw2jwpr2ndwrp76fsnwkb453im8y54srfj7g6mfbj2.nar.xz
+ReflessNarURL: 13cdkdq61b5mwa66bki41yzhz78sq7mzhb8hdxpsbi60g0hyimyh.nar.xz
+ReflessNarHash: 1rwrk1829j2v99n942ad29f7byqz53h7w42ja8hzm5cjc3ynzm5h
+RefOffsets: [["/nix/store/xzx1bv1d7z4mgg6sg6ly0jx609qvka4x-glibc-2.25-49",1024],["/nix/store/xzx1bv1d7z4mgg6sg6ly0jx609qvka4x-glibc-2.25-49",3412],["/nix/store/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs-hello-2.10",19928],["/nix/store/w5w4v29ql0qwqhczkdxs94ix2lh7ibgs-hello-2.10",21576]]
Compression: xz
FileHash: sha256:11m7hyrjibxw2jwpr2ndwrp76fsnwkb453im8y54srfj7g6mfbj2
FileSize: 43100
Now for reducing the network traffic to binary caches, as long as the ReflessNarHash is also stored in the Nix database, the ref-less NARs can be computed from built store paths. So e.g. a client having say, the 17.09 version of hello locally wanting to install the master version doesn't need to download the
ReflessNarURL: 13cdkdq61b5mwa66bki41yzhz78sq7mzhb8hdxpsbi60g0hyimyh.nar.xz but can just compute it from the 17.09 version.
Nice!
Please take into account that just stripping all the references does not allow to reinsert them, because you would no longer know where to reinsert what. To do that, the out of band info should also contain the location (in the narinfo) of the strings to be replaced.
This is also more powerful than the content addressed storage with respect to bandwidth usage and binary cache disk space. May be worth trying it :-)
Yes it is there in the example: ["/nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49",1024]: means 'insert the string "/nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49" at offset 1024'.
I decided to experiment and run the numbers on this method on my ARM binary cache (with a minor change of making the RefOffsets: encoding smaller). The result was 68.8 GB of .nar.xz files and 56.8 MB of .narinfo files got compressed down to 52.1 GB of .rnar.xz files and 208.7 MB of .narinfo files. In other words, a 24% reduction of disk space use. I need to double-check if I messed up somewhere but if not, this does sound like it would be worth it.
I think RefOffsets can be avoided by replacing each reference by a canonical value, e.g. h_n = hash("reference-
But that is vulnerable to the string appearing in other parts of the NAR dump?
I think the probability of that is small enough that we can safely ignore it? Similarly to how placeholder works.
Most helpful comment
I decided to experiment and run the numbers on this method on my ARM binary cache (with a minor change of making the
RefOffsets:encoding smaller). The result was 68.8 GB of.nar.xzfiles and 56.8 MB of.narinfofiles got compressed down to 52.1 GB of.rnar.xzfiles and 208.7 MB of.narinfofiles. In other words, a 24% reduction of disk space use. I need to double-check if I messed up somewhere but if not, this does sound like it would be worth it.