I'm sorry for the very general name of this issue, if anyone can come up with a better title for this problem please make a suggestion :-)
I am looking for the best solution for the following problem that I run into quite a lot:
The application I am trying to package has dependencies from it's language's packaging system, that is not tracked in nixpkgs (npm, rubygems, maven, rust crates, ...). There is tooling to adapt the dependency definitions from the language's package management system to nix (yarn2nix, bundix, ..), but since the original dependency definitions don't contain usable hashes for all dependencies, or are missing the hashes for git dependencies, these dependency definitions need to be combined with information from the internet. There are multiple solutions that I see being used:
Which one is actually more favorable? The first one results in difficult-to-maintain packages and spams nixpkgs with large files, the second one is not strictly pure, and it is kind of working around nix using the fixed-output derivation, I think I read edolstra discouraging it.
There was some discussion on IRC, but there was no conclusion, so I raise the issue here, because I think there should be a consensus on how to handle this problem in nixpkgs.
Maybe there is an even better third approach that I don't know yet.
Approach 2 (ab)uses fixed-output derivations. Why this might be a bad idea is described here: https://github.com/NixOS/nix/issues/2270
Glossary:
Here is a brain dump of what I have learned so far:
The fixed-output derivation is the level zero support. It doesn't take much effort to create and maintain. The language package manager (LPM) commands can be used directly like in the developer documentation.
The biggest downside is that the hash is not automatically invalidated when one of the input files are changing. This can create surprising situations when the lockfile is updated but the old program is still running (because it's still reading from the old hash). The hash has to be invalidated manually but changing it to something else, run nix-build and wait for nix to tell you the right hash. Then re-run the build from scratch.
Another downside is that not all the tools have a stable on-disk output. Two developers not sharing a binary cache might get different output hashes. I've seen that happen with the cargo tools for a while for example.
A lockfile is generated that downloads all the dependencies using nix fetchers. Then the aggregate is used to start a fake registry process that the tools can talk to.
This solves the outdated hash problem, and since the APIs are usually publicly documented they are also pretty stable. The only implementation of that idea that I know of is https://github.com/nmattia/napalm
The biggest downside is that we need to build the API for all the languages.
This is similar to (1) but instead of providing an API, the files are placed on disk where the LPM expects to find them. There is often an offline mode that we can re-use.
In my experience the on-disk locations are no necessarily documented or stable between releases. To implement it properly, the nix developer is often forced to look at the LPM source code to find the location and duplicate the logic in nix.
This is going even further than (2) in the LPM integration. The LPM usually controls traversing the dependency tree and running each individual builds, which we hijack and replace with individual derivations that are being built. So here each dependency is built in isolation in it's little sandbox, and then stringed together and presented to the application at the end. Even more heuristic of the LPM is encoded into Nix. Examples of that can be found for ruby and python.
The main advantage of this approach is that it minimized the rebuild between two releases. The rebuilds are more incremental. And in theory the dependencies can also be shared between two or more programs.
The main downside is that it's a lot of nix code that is running to build a single package. The Hydra evaluator is now running on a 64GB node because evaluating nixpkgs takes a lot of memory.
And while sharing is nice, in practice ruby/node/rust projects very rarely share exactly the same dependency set.
This is a great approach for a company monorepo. Or when maintaining a package set snapshot like Stackage or the python modules.
For single packages in nixpkgs, I now believe that this is going one step too far.
And while sharing is nice, in practice ruby/node/rust projects very rarely share exactly the same dependency set.
Derivations don't need the exact same dependency set for sharing to be useful though, as long as there are some common dependencies.
IME sharing is very common even cross-projects, especially in nodejs
where some dependencies are in almost every larger project.
I quickly checked duplication across gemset.nix
in nixpkgs and this was the result: https://gist.github.com/adisbladis/730b982cc6b1a8013581529639c40ce0 and did a similar check for buildGoModule
in https://github.com/NixOS/nix/issues/2270#issuecomment-508768325.
:+1: on adding data to what is essentially a belief right now.
I think you have to show that the derivation outputs are the same. For example addressable-2.6.0
is used in 16 projects but has a dependency on public_suffix >= 2.0.2, < 4.0 . There are 5 different versions of public_suffix
in the gist so potentially 5 different derivation outputs for the same addressable-2.6.0
gem. Basically leafs are shared each level below rest is exponentially less likely to be.
That being said, the sharing also happens between multiple versions of nixpkgs. Having more granular derivations also allows to minimize rebuild on package updates, and minimize downloads from the user.
To really know we would need a big differential equation that balances build times, evaluation times and download times.
Actually I was missing the last step:
In this scenario, there is no LPM. Nix has entirely replaced the LPM tooling. Nix is building each an every object of a project in it's own derivation and composing them all together. This is the ultimate incremental rebuild, and the ultimate memory and Nix evaluation hog.
An example of such implementation: https://github.com/nmattia/snack/
At that point you very much wish that Nix had an Intentional Store to minimize rebuilds.
The issue then becomes that you have to put generated nix files in git, and these can be very large. Do you have any ideas how this could be solved? I had some ideas about using git lfs or some content addressable storage, like ipfs.
@offlinehacker This could potentially be addressed by nix flakes (& splitting up nixpkgs into subsystems).
Also there's one other option that's variation of fake registry option described above:
Recording/Caching http proxy
You redirect all requests of LPM through local http proxy. This proxy records all requests and transforms in a way that can be later used for reply during installation process. The problem is that package manager not only loads tar archives and git repos, but also makes api requests to something like npm. You need to make response transformations that are specific to each package manager, but the whole service could be generalized with plugins.
During installation process you start proxy again with generated configuration from first step as an input.
The benefit is that you now no longer require fake registry for every package manager but you have more generalized solution.
@adisbladis in any case even if you split repo, you still polute other repos with basically files that are large text blobs, but yeah I agree that this would still help. The problem is we can't package some things because generated files are too large, for example take a look here: https://github.com/NixOS/nixpkgs/pull/49082
The issue then becomes that you have to put generated nix files in git, and these can be very large. Do you have any ideas how this could be solved? I had some ideas about using git lfs or some content addressable storage, like ipfs.
The best solution that I know of is to extend the nix capabilities to allow recursive nix calls. Recursive Nix is when nix is being called from inside a derivation. I would look a little bit like this:
stdenv.mkDerivation {
pname = "xxx";
version = "1.2.3";
src = fetchFromGitHub { ... };
buildPhase = ''
nix-build -I nixpkgs=${pkgs.path} ./default.nix
# or ${./inner.nix} if upstream doesn't have a default.nix or we don't want to use it
'';
installPhase = ''
ln -s $(readlink ./result) $out
'';
}
(obviously we would extract this pattern in a new pkgs.mkRecursive
function)
The nice thing here is that import-from-derivation can be allowed in the inner build. It's not going to affect nix-env -qaP
. And the lockfile is sourced directly from upstream instead of having to duplicate it in nixpkgs. And if upstream has already packaged the project with nix we can also defer that to them (except the meta and passthru attributes).
So overall it would make hydra builds a bit slower because nixpkgs has to be re-evaluated again on each build. For the users, the nixpkgs evaluation becomes faster because the complicate IFD happens only at build time. If we start using nix files from upstream it might make refactoring of nixpkgs a bit harder. Package dependencies are harder to follow since they are not passed to the outer default.nix.
Recursive nix was also discussed in this RFC: https://github.com/NixOS/rfcs/pull/40
It goes even further because it also pre-generates derivations
When there's something I can try and maybe an example of how to use it, I would love to start experimenting with it to get rid of all the yarn.nix files, where possible.
However, I do not see how this could solve the issue for example with ruby tooling, where the hashes are not included in the lockfile.
The topic of how to do language package managers came up in #78810 again.
@Mic92 described another issue that was not considered here, which is time and memory required for evaluating nixpkgs (like when doing nix-review).
I am wondering: Would this issue be solved by recursive nix?
By the way: This pattern could avoid the expression size explosion: https://github.com/NixOS/nixpkgs/pull/87258/files#diff-97ddd5942a260ac035c022c0c57de234R20
_Originally posted by @Mic92 in https://github.com/NixOS/nixpkgs/pull/78810#issuecomment-625808078_
I think this discussion should not be held in the Mastodon PR, because it is not specific to mastodon or even yarn2nix. The bundler tooling and some Go tooling works the same and has the same issues.
I think this is just moving the problem from expression size / evaluation speed to hidden impurities (see https://github.com/NixOS/nix/issues/2270).
This is a general issue and I think it might even be good to create some kind of working group of people who are interested in finding a community concensus and solving this problem for all language package managers in the long-term. I would certainly be interested in it.
I marked this as stale due to inactivity. → More info
Most helpful comment
Glossary:
Here is a brain dump of what I have learned so far:
0. Fixed output derivations
The fixed-output derivation is the level zero support. It doesn't take much effort to create and maintain. The language package manager (LPM) commands can be used directly like in the developer documentation.
The biggest downside is that the hash is not automatically invalidated when one of the input files are changing. This can create surprising situations when the lockfile is updated but the old program is still running (because it's still reading from the old hash). The hash has to be invalidated manually but changing it to something else, run nix-build and wait for nix to tell you the right hash. Then re-run the build from scratch.
Another downside is that not all the tools have a stable on-disk output. Two developers not sharing a binary cache might get different output hashes. I've seen that happen with the cargo tools for a while for example.
1. Fake registry
A lockfile is generated that downloads all the dependencies using nix fetchers. Then the aggregate is used to start a fake registry process that the tools can talk to.
This solves the outdated hash problem, and since the APIs are usually publicly documented they are also pretty stable. The only implementation of that idea that I know of is https://github.com/nmattia/napalm
The biggest downside is that we need to build the API for all the languages.
2. Pre-download the dependencies
This is similar to (1) but instead of providing an API, the files are placed on disk where the LPM expects to find them. There is often an offline mode that we can re-use.
In my experience the on-disk locations are no necessarily documented or stable between releases. To implement it properly, the nix developer is often forced to look at the LPM source code to find the location and duplicate the logic in nix.
3. Build each dependency in it's own derivation
This is going even further than (2) in the LPM integration. The LPM usually controls traversing the dependency tree and running each individual builds, which we hijack and replace with individual derivations that are being built. So here each dependency is built in isolation in it's little sandbox, and then stringed together and presented to the application at the end. Even more heuristic of the LPM is encoded into Nix. Examples of that can be found for ruby and python.
The main advantage of this approach is that it minimized the rebuild between two releases. The rebuilds are more incremental. And in theory the dependencies can also be shared between two or more programs.
The main downside is that it's a lot of nix code that is running to build a single package. The Hydra evaluator is now running on a 64GB node because evaluating nixpkgs takes a lot of memory.
And while sharing is nice, in practice ruby/node/rust projects very rarely share exactly the same dependency set.
This is a great approach for a company monorepo. Or when maintaining a package set snapshot like Stackage or the python modules.
For single packages in nixpkgs, I now believe that this is going one step too far.