Julia: How precompile files are loaded need to change if using multiple projects are going to be pleasant

Created on 4 Jun 2018 · 27Comments · Source: JuliaLang/julia

Precompile files are currently stored only based on the UUID of the package.
So if you change your project it is likely that you will have to recompile everything. And then again when you swap back etc.
This will be very annoying for people trying to use multiple packages and people will likely just use one mega project like before.
https://github.com/JuliaLang/julia/pull/26165 also removed any possibility for users to change the precompile path so there is no way to workaround this right now.

We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.

packages

Source

KristofferC

👍32

Most helpful comment

yowza. could we please prioritize this with a milestone?

bjarthur on 14 Jan 2019

👍5

All 27 comments

We'll need advice and input from @vtjnash on this one.

StefanKarpinski on 4 Jun 2018

Could you consider #28518 when fixing it?

Re: implementation, I suppose you can de-duplicate precompile cache by using hash tree? What I mean by that is to generate the path of the precompile file using a hash that depends on its own git-tree-sha1 (or version?) and the hash of all of its dependencies. What I suggested in #28518 was to make it also depend on the package options (https://github.com/JuliaLang/Juleps/issues/38).

ref: https://github.com/JuliaPy/pyjulia/issues/173

tkf on 1 Sep 2018

Chiming in that for me this is pretty useful

Argument:
I have dev shared environment .
What is annoying is that I have to recompile my 'clean' environment whenever I work on the development packages and then switch back to my clean environemnt, even though none of the packages in the default environment have been touched.

At least an optional flag for new environments not to share the the precompile cache would be awesome.

musm on 25 Oct 2018

👍1

As different system images may contain different versions of packages, I suppose it makes sense for the cache path to depend on (say) the path of the system image as well? I think it also helps to decouple stdlib more from Julia core.

tkf on 30 Oct 2018

@StefanKarpinski I don't think implementing what I suggested above https://github.com/JuliaLang/julia/issues/27418#issuecomment-417826061 is difficult. Does this conceptually work?

function cache_path_slug(env::Pkg.Types.EnvCache, uuid::Base.UUID)
    info = Pkg.Types.manifest_info(env, uuid)
    crc = 0x00000000
    if haskey(info, "deps")
        for dep_uuid in sort(Base.UUID.(values(info["deps"])))
            slug = cache_path_slug(env, dep_uuid)
            crc = Base._crc32c(slug, crc)
        end
    end
    crc = Base._crc32c(uuid, crc)
    if haskey(info, "git-tree-sha1")
        crc = Base._crc32c(info["git-tree-sha1"], crc)
    end
    # crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
    return Base.slug(crc, 5)
end

cache_path_slug(Pkg.Types.EnvCache(), Base.identify_package("Compat").uuid)

(By "conceptually", I mean that I'm grossing over that probably Base shouldn't be using Pkg. Also, above function as-is without memoization may be bad for large dependency trees.)

Some possible flaws I noticed:

It requires GC. But I guess we can do that it in Pkg.gc.
It is supposed to share the common sub-tree of the projects. However, unless the projects are updated at the same time, none of them share substantial sub-tree (e.g., if they have different Compat.jl version, they probably do not share anything.). I'm not sure how problematic it can be.

tkf on 5 Nov 2018

Related: I was benchmarking julia master vs. a branch using two different directories & builds. The two compete against one another for the ownership of the compiled package files.

timholy on 3 Dec 2018

FWIW, we found that using a different DEPOT_PATH for each frequently-used environment is a decent (if cumbersome) work-around until there's a fix.

cstjean on 3 Dec 2018

👍3

That's what I was doing too but recently I ran into a case where, surprisingly, that didn't work. I was rushing and didn't have time to document it, but I will see if I can remember what was involved.

timholy on 3 Dec 2018

tangentially adjacent or interwoven?
every time I make a change to julia source code in ArbNumerics, pkg insists on regenerating all the c library files _oblivious to the fact that nothing at all has occured which benefits therefrom_

JeffreySarnoff on 5 Dec 2018

This is also the cause of https://github.com/timholy/Revise.jl/issues/205

timholy on 30 Dec 2018

yowza. could we please prioritize this with a milestone?

bjarthur on 14 Jan 2019

👍5

We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.

Under what conditions can it be guaranteed that one or more precompile files are shareable? If we can nail down the varying inputs to precompilation, it should at least be possible to put in a hack to stop truly unnecessary precompilations, at least until a better mechanism is devised.

jpsamaroo on 14 Jan 2019

My home dir is typically shared by many different machines (os/proc type). I would need to have different build place for *.ji files related to each machine.

DEPOT_PATH defines the location /Users/monty/.julia

So i will need /Users/monty/.julia-redhat, /Users/monty/.julia-linux, /Users/monty/.julia-ubuntu14, /Users/monty/.julia-ubuntu16, etc.

Is there a better way?

montyvesselinov on 8 Mar 2019

I'm replying to @jpsamaroo's comment in this discourse thread here since this discussion belongs to here than there. Please read my comment (and the follow-up) and @jpsamaroo's comment for the full context.

Therefore, I think initially we should focus on just precompiling each project which is loaded in isolation, before any further activate s occur.

I think it does not handle many common cases. For example, if you have using Revise in startup.jl then you can't capture even the first activate in this scheme. Also, what do you do after first activate? Switch to a --compiled-modules=no mode (I don't know if you can toggle this flag dynamically)? Since you also need to address chicken-and-egg problem in this approach by adding a TOML-parser in Base or persistent cache (or something else) to get dependencies before locating cache path (they are hard problem on its own), and since we know that this cannot capture many use-cases, I think it makes sense to implement the fully dynamic solution ("in-memory dependency tree") from the get go.

But I actually don't know if it is such a bad idea as the first implementation. As switching project trigger precompilation anyway ATM, it is an improvement if julia automatically turns off recompilation. Also, if people care reproducibility maybe they use --project/JULIA_PROJECT most of the time. In that case, full dynamism may not be required for precompilation. Also, a con for fully dynamic solution is "GC" of *.ji files. It'll create more precompilation files than static solution and it's hard to know what files are needed or not.

tkf on 26 Mar 2019

I'd be interested in elaboration on this "in-memory dependency tree" and how it can solve the issue of dynamic activations. I only consider my "solution" a temporary improvement for certain commons cases anyway, but you're definitely right that it might make other common cases worse instead of better.

jpsamaroo on 26 Mar 2019

I don't see why we can't just have 1 complile cache directory, per exact stack of enviroments.
At least as a short term solution.
I feel like this would generally lead to less than 3 compile caches per enviroment.
And sure it might duplicate a bit of compile time but it wouild be less than we have now.

And sure it woud use more harddrive space, but harddrive space is cheap.
Cheaper than my time that I spend waiting for compilation when I switch enviroments.
Probably would want some gc complilecache all to clear all compile caches,
and maybe gc compilcache dead to clear all compile caches that we can no longer locate all Manifest.tomls for.

oxinabox on 26 Mar 2019

@oxinabox

I don't see why we can't just have 1 complile cache directory, per exact stack of enviroments.
At least as a short term solution.

I think it's not a crazy plan provided that there is a mechanism to switch to the mode that acts like --compiled-modules=no when precompilation does not work.

To illustrate what I mean by "precompilation does not work", consider the following setup:

Default (named) project v1.2 with packages:

A
[email protected] (package C of version 1.0)

custom_project with packages:

B
[email protected] (package C of version 1.1)

Further assume that packages A and B both only require C >= 1.0. (custom_project gets [email protected], e.g., due to the timing it is created.)

If you do

julia> using A  # loads [email protected]

pkg> activate custom_project

julia> using B

this Julia session (hereafter Session 1) loads [email protected] while if you do

pkg> activate custom_project

julia> using A  # loads [email protected]

julia> using B

then this Julia session (hereafter Session 2) loads [email protected]. Notice that at the point using B, both sessions have exactly the same environment stack. However, if you want to precompile package B, you need to compile it with [email protected] in Session 1 and [email protected] in Session 2.

@jpsamaroo This is what I meant by "in-memory dependency tree." The information that [email protected] must be used in Session 1 and that [email protected] must be used in Session 2 is only in the memory of each session. This information has to be passed to the subprocess compiling package B. Actually, "in-memory dependency tree" is misleading and I should have called it "in-memory manifest" which includes the list of exact package versions (or maybe rather file path to the source code directory of the given version ~/.julia/packages/$package_name/$version_slug/).

tkf on 27 Mar 2019

This is all great thinking. Unfortunately, the current issue is just so much more mundane than all that. We actually already have all of that great "in-memory dependency tree" logic and stacks of caches and more! So what's the problem, since that's clearly not working for the default user experience? Well, at the end of the precompile step, it goes and garbage collects the old files right away. So there's nary a chance for it to survive for even a brief moment to be found later and used. If it only could just stop doing that until some later explicit step (like the brand new Pkg.gc() operation), life would be much happier for everyone.

vtjnash on 27 Mar 2019

❤1

Right, that's a good point. But we do still need to ensure we know how to locate the previously-generated *.ji files deterministically in a manner that is guaranteed to load the correct ones. Currently it seems this issue is avoided by blowing everything away and starting from scratch the moment any little thing changes with respect to the conditions that generated the previous *.ji files.

jpsamaroo on 27 Mar 2019

We actually already have all of that great "in-memory dependency tree" logic and stacks of caches and more!

@vtjnash Do you mind let us know where it is implemented? The closest thing I could find was Base._concrete_dependencies but it only records the pair of PkgId and build_id. IIUC, the actual dependencies are still recorded in the header of the cache file (together with build_id of them). It's great for integrity check but it looks to me that there are no dependencies (list of upstream packages uuid _and version_ for each package) stored in memory.

tkf on 28 Mar 2019

@vtjnash It would be great if you could elucidate a little more concretely what needs to change inside of base; I don't quite follow precisely what needs to change. Clearly the naming of precompile files needs to change, and I think what you're saying is that we need a way to determine which precompile files are used and which are not used so that we don't just slowly fill up a disk with stale precompile caches?

staticfloat on 16 May 2019

Another perspective; there are situations where having user-control over which precompile file gets loaded is desirable. Let us imagine a user wanting to distribute a docker container with Julia GPU packages pre-installed; the Julia GPU packages need to do some setup when they see a new generation of GPU hardware attached, and so right now in the docker container we are forced to set JULIA_DEPOT_PATH=~/.julia_for_hardware_x, precompile for all different configurations in a for loop (with different hardware attached each time), then ship the whole thing to the user. (This is to avoid needing to precompile every time you launch the docker container)

It would be much preferable if there were some kind of mechanism that allowed packages to expose a user-defined function that gets called to add some salt into the hash; an extremely coarse-grained version could be an environment variable JULIA_CODELOAD_SALT=hardware-x, which would then shift ALL precompile files by the hash of that string, (thereby saving on space by having multiple depots) but I could imagine finer-grained versions as well.

Of course, the problem of how to intelligently garbage collect these files remains.

staticfloat on 5 Jun 2019

Yes, it would be nice to integrate this with package options https://github.com/JuliaLang/Juleps/issues/38

Meanwhile, you can build a patched system image with which you can add arbitrary salt to it via an environment variable. This works because child processes (which precompile Julia packages) inherit environment variables. More precisely, here is the code snippet that does this (used in jlm; a similar trick is also used in PyJulia):

Base.eval(Base, quote
    function package_slug(uuid::UUID, p::Int=5)
        crc = _crc32c(uuid)
        crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
        crc = _crc32c(get(ENV, "JLM_PRECOMPILE_KEY", ""), crc)
        return slug(crc, p)
    end
end)

(You can get this system image by running JuliaManager.compile_patched_sysimage("PATH/TO/NEW/sys.so").)

tkf on 6 Jun 2019

I would very much like a functionality like this for our lab computers, since it would make it possible to have multiple precompiled versions of commonly used libraries. Attached is a simplistic patch that adds the same version slug that is used in
./packages/<name>/<slug/>

loading.jl.patch.txt

AndersBlomdell on 4 Sep 2020

This was already implemented in 2019.

StefanKarpinski on 9 Sep 2020

Stefan: is there documentation that specifies exactly what parts of the
system should be placed where, or at least an example configuration for
such a centralized read-only setup? Where does the compiled directory go,
what about packages that the user explicitly wants to override etc.

On Wed, Sep 9, 2020 at 8:33 PM Stefan Karpinski notifications@github.com
wrote:

This was already implemented in 2019.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/JuliaLang/julia/issues/27418#issuecomment-689711490,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAN43JUUZ7OTUZHYJOMPDVDSE637ZANCNFSM4FDDVQ7Q
.

denizyuret on 10 Sep 2020

This was already implemented in 2019.

Actually, no; loading.jl uses three different slugs:

function package_slug(uuid::UUID, p::Int=5) PKG-SLUG used for
- determine what goes in the cache_file_entry
function version_slug(uuid::UUID, sha1::SHA1, p::Int=5) VER-SLUG based on package UUID
and directory hash used for
- locating the requested package in explicit_manifest_uuid_path
project_precompile_slug as defined in function compilecache_path(pkg::PkgId)::String PRJ-SLUG
- crc = _crc32c(something(Base.active_project(), "")) crc = _crc32c(unsafe_string(JLOptions().image_file), crc) crc = _crc32c(unsafe_string(JLOptions().julia_bin), crc) project_precompile_slug = slug(crc, 5)
- determine where precompiled code lives

These parts are then used to place package source code in package/<name>/<VER-SLUG>/ and
precompiled code in compiled/v<MAJOR>.<MINOR>>/<name>/<PKG-SLUG>_<PRJ-SLUG>.ji[the validty of the precompiled
code is checked in _require_from_serialized]

With this scheme the number of files in the precompiled files is kept low, since new versions of a precompiled
package will overwrite the old one, there will also be a sharing of compatible precompiled code between project
the same packages, since all precompiled code starting with /<PKG-SLUG>- is checked before a new precompilation
is done. It is not a good scheme for a shared environment, though; I would rather suggest

introduce a new compiled_slug based on the data checked in _require_from_serialized CMP-SLUG
place package source code package/<name>/<VER-SLUG>/ [i.e. no change]
place precompiled code in compiled/v<MAJOR>.<MINOR>/<name>/<VER-SLUG>_<CMP-SLUG>.ji
maybe all this should be based on some flag to keep the pressure on the filesystem low for
systems used by a single individual?

BTW: the previous loading.jl.patch contained some bugs, so here we go again
julia-loading.jl.patch.txt

AndersBlomdell on 10 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings