Precompile files are currently stored only based on the UUID of the package.
So if you change your project it is likely that you will have to recompile everything. And then again when you swap back etc.
This will be very annoying for people trying to use multiple packages and people will likely just use one mega project like before.
https://github.com/JuliaLang/julia/pull/26165 also removed any possibility for users to change the precompile path so there is no way to workaround this right now.
We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.
We'll need advice and input from @vtjnash on this one.
Could you consider #28518 when fixing it?
Re: implementation, I suppose you can de-duplicate precompile cache by using hash tree? What I mean by that is to generate the path of the precompile file using a hash that depends on its own git-tree-sha1
(or version
?) and the hash of all of its dependencies. What I suggested in #28518 was to make it also depend on the package options (https://github.com/JuliaLang/Juleps/issues/38).
Chiming in that for me this is pretty useful
Argument:
I have dev
shared environment .
What is annoying is that I have to recompile my 'clean' environment whenever I work on the development packages and then switch back to my clean environemnt, even though none of the packages in the default environment have been touched.
At least an optional flag for new environments not to share the the precompile cache would be awesome.
As different system images may contain different versions of packages, I suppose it makes sense for the cache path to depend on (say) the path of the system image as well? I think it also helps to decouple stdlib more from Julia core.
@StefanKarpinski I don't think implementing what I suggested above https://github.com/JuliaLang/julia/issues/27418#issuecomment-417826061 is difficult. Does this conceptually work?
function cache_path_slug(env::Pkg.Types.EnvCache, uuid::Base.UUID)
info = Pkg.Types.manifest_info(env, uuid)
crc = 0x00000000
if haskey(info, "deps")
for dep_uuid in sort(Base.UUID.(values(info["deps"])))
slug = cache_path_slug(env, dep_uuid)
crc = Base._crc32c(slug, crc)
end
end
crc = Base._crc32c(uuid, crc)
if haskey(info, "git-tree-sha1")
crc = Base._crc32c(info["git-tree-sha1"], crc)
end
# crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
return Base.slug(crc, 5)
end
cache_path_slug(Pkg.Types.EnvCache(), Base.identify_package("Compat").uuid)
(By "conceptually", I mean that I'm grossing over that probably Base
shouldn't be using Pkg
. Also, above function as-is without memoization may be bad for large dependency trees.)
Some possible flaws I noticed:
Pkg.gc
.Related: I was benchmarking julia master vs. a branch using two different directories & builds. The two compete against one another for the ownership of the compiled package files.
FWIW, we found that using a different DEPOT_PATH for each frequently-used environment is a decent (if cumbersome) work-around until there's a fix.
That's what I was doing too but recently I ran into a case where, surprisingly, that didn't work. I was rushing and didn't have time to document it, but I will see if I can remember what was involved.
tangentially adjacent or interwoven?
every time I make a change to julia source code in ArbNumerics, pkg insists on regenerating all the c library files _oblivious to the fact that nothing at all has occured which benefits therefrom_
This is also the cause of https://github.com/timholy/Revise.jl/issues/205
yowza. could we please prioritize this with a milestone?
We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.
Under what conditions can it be guaranteed that one or more precompile files are shareable? If we can nail down the varying inputs to precompilation, it should at least be possible to put in a hack to stop truly unnecessary precompilations, at least until a better mechanism is devised.
My home dir is typically shared by many different machines (os/proc type). I would need to have different build place for *.ji files related to each machine.
DEPOT_PATH defines the location /Users/monty/.julia
So i will need /Users/monty/.julia-redhat, /Users/monty/.julia-linux, /Users/monty/.julia-ubuntu14, /Users/monty/.julia-ubuntu16, etc.
Is there a better way?
I'm replying to @jpsamaroo's comment in this discourse thread here since this discussion belongs to here than there. Please read my comment (and the follow-up) and @jpsamaroo's comment for the full context.
Therefore, I think initially we should focus on just precompiling each project which is loaded in isolation, before any further
activate
s occur.
I think it does not handle many common cases. For example, if you have using Revise
in startup.jl
then you can't capture even the first activate
in this scheme. Also, what do you do after first activate
? Switch to a --compiled-modules=no
mode (I don't know if you can toggle this flag dynamically)? Since you also need to address chicken-and-egg problem in this approach by adding a TOML-parser in Base
or persistent cache (or something else) to get dependencies before locating cache path (they are hard problem on its own), and since we know that this cannot capture many use-cases, I think it makes sense to implement the fully dynamic solution ("in-memory dependency tree") from the get go.
But I actually don't know if it is such a bad idea as the first implementation. As switching project trigger precompilation anyway ATM, it is an improvement if julia
automatically turns off recompilation. Also, if people care reproducibility maybe they use --project
/JULIA_PROJECT
most of the time. In that case, full dynamism may not be required for precompilation. Also, a con for fully dynamic solution is "GC" of *.ji
files. It'll create more precompilation files than static solution and it's hard to know what files are needed or not.
I'd be interested in elaboration on this "in-memory dependency tree" and how it can solve the issue of dynamic activations. I only consider my "solution" a temporary improvement for certain commons cases anyway, but you're definitely right that it might make other common cases worse instead of better.
I don't see why we can't just have 1 complile cache directory, per exact stack of enviroments.
At least as a short term solution.
I feel like this would generally lead to less than 3 compile caches per enviroment.
And sure it might duplicate a bit of compile time but it wouild be less than we have now.
And sure it woud use more harddrive space, but harddrive space is cheap.
Cheaper than my time that I spend waiting for compilation when I switch enviroments.
Probably would want some gc complilecache all
to clear all compile caches,
and maybe gc compilcache dead
to clear all compile caches that we can no longer locate all Manifest.tomls for.
@oxinabox
I don't see why we can't just have 1 complile cache directory, per exact stack of enviroments.
At least as a short term solution.
I think it's not a crazy plan provided that there is a mechanism to switch to the mode that acts like --compiled-modules=no
when precompilation does not work.
To illustrate what I mean by "precompilation does not work", consider the following setup:
Default (named) project v1.2
with packages:
A
[email protected]
(package C
of version 1.0)custom_project
with packages:
B
[email protected]
(package C
of version 1.1)Further assume that packages A
and B
both only require C >= 1.0
. (custom_project
gets [email protected]
, e.g., due to the timing it is created.)
If you do
julia> using A # loads [email protected]
pkg> activate custom_project
julia> using B
this Julia session (hereafter Session 1
) loads [email protected]
while if you do
pkg> activate custom_project
julia> using A # loads [email protected]
julia> using B
then this Julia session (hereafter Session 2
) loads [email protected]
. Notice that at the point using B
, both sessions have exactly the same environment stack. However, if you want to precompile package B
, you need to compile it with [email protected]
in Session 1
and [email protected]
in Session 2
.
@jpsamaroo This is what I meant by "in-memory dependency tree." The information that [email protected]
must be used in Session 1
and that [email protected]
must be used in Session 2
is only in the memory of each session. This information has to be passed to the subprocess compiling package B
. Actually, "in-memory dependency tree" is misleading and I should have called it "in-memory manifest" which includes the list of exact package versions (or maybe rather file path to the source code directory of the given version ~/.julia/packages/$package_name/$version_slug/
).
This is all great thinking. Unfortunately, the current issue is just so much more mundane than all that. We actually already have all of that great "in-memory dependency tree" logic and stacks of caches and more! So what's the problem, since that's clearly not working for the default user experience? Well, at the end of the precompile step, it goes and garbage collects the old files right away. So there's nary a chance for it to survive for even a brief moment to be found later and used. If it only could just stop doing that until some later explicit step (like the brand new Pkg.gc()
operation), life would be much happier for everyone.
Right, that's a good point. But we do still need to ensure we know how to locate the previously-generated *.ji files deterministically in a manner that is guaranteed to load the correct ones. Currently it seems this issue is avoided by blowing everything away and starting from scratch the moment any little thing changes with respect to the conditions that generated the previous *.ji files.
We actually already have all of that great "in-memory dependency tree" logic and stacks of caches and more!
@vtjnash Do you mind let us know where it is implemented? The closest thing I could find was Base._concrete_dependencies
but it only records the pair of PkgId
and build_id
. IIUC, the actual dependencies are still recorded in the header of the cache file (together with build_id
of them). It's great for integrity check but it looks to me that there are no dependencies (list of upstream packages uuid _and version_ for each package) stored in memory.
@vtjnash It would be great if you could elucidate a little more concretely what needs to change inside of base; I don't quite follow precisely what needs to change. Clearly the naming of precompile files needs to change, and I think what you're saying is that we need a way to determine which precompile files are used and which are not used so that we don't just slowly fill up a disk with stale precompile caches?
Another perspective; there are situations where having user-control over which precompile file gets loaded is desirable. Let us imagine a user wanting to distribute a docker container with Julia GPU packages pre-installed; the Julia GPU packages need to do some setup when they see a new generation of GPU hardware attached, and so right now in the docker container we are forced to set JULIA_DEPOT_PATH=~/.julia_for_hardware_x
, precompile for all different configurations in a for loop (with different hardware attached each time), then ship the whole thing to the user. (This is to avoid needing to precompile every time you launch the docker container)
It would be much preferable if there were some kind of mechanism that allowed packages to expose a user-defined function that gets called to add some salt into the hash; an extremely coarse-grained version could be an environment variable JULIA_CODELOAD_SALT=hardware-x
, which would then shift ALL precompile files by the hash of that string, (thereby saving on space by having multiple depots) but I could imagine finer-grained versions as well.
Of course, the problem of how to intelligently garbage collect these files remains.
Yes, it would be nice to integrate this with package options https://github.com/JuliaLang/Juleps/issues/38
Meanwhile, you can build a patched system image with which you can add arbitrary salt to it via an environment variable. This works because child processes (which precompile Julia packages) inherit environment variables. More precisely, here is the code snippet that does this (used in jlm
; a similar trick is also used in PyJulia):
Base.eval(Base, quote
function package_slug(uuid::UUID, p::Int=5)
crc = _crc32c(uuid)
crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
crc = _crc32c(get(ENV, "JLM_PRECOMPILE_KEY", ""), crc)
return slug(crc, p)
end
end)
(You can get this system image by running JuliaManager.compile_patched_sysimage("PATH/TO/NEW/sys.so")
.)
I would very much like a functionality like this for our lab computers, since it would make it possible to have multiple precompiled versions of commonly used libraries. Attached is a simplistic patch that adds the same version slug that is used in
./packages/<name>/<slug/>
This was already implemented in 2019.
Stefan: is there documentation that specifies exactly what parts of the
system should be placed where, or at least an example configuration for
such a centralized read-only setup? Where does the compiled directory go,
what about packages that the user explicitly wants to override etc.
On Wed, Sep 9, 2020 at 8:33 PM Stefan Karpinski notifications@github.com
wrote:
This was already implemented in 2019.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/JuliaLang/julia/issues/27418#issuecomment-689711490,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAN43JUUZ7OTUZHYJOMPDVDSE637ZANCNFSM4FDDVQ7Q
.
This was already implemented in 2019.
Actually, no; loading.jl uses three different slugs:
function package_slug(uuid::UUID, p::Int=5)
PKG-SLUG used for
cache_file_entry
function version_slug(uuid::UUID, sha1::SHA1, p::Int=5)
VER-SLUG based on package UUID
and directory hash used for
explicit_manifest_uuid_path
project_precompile_slug
as defined in function compilecache_path(pkg::PkgId)::String
PRJ-SLUG
crc = _crc32c(something(Base.active_project(), ""))
crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
crc = _crc32c(unsafe_string(JLOptions().julia_bin), crc)
project_precompile_slug = slug(crc, 5)
These parts are then used to place package source code in package/<name>/<VER-SLUG>/
and
precompiled code in compiled/v<MAJOR>.<MINOR>>/<name>/<PKG-SLUG>_<PRJ-SLUG>.ji
[the validty of the precompiled
code is checked in _require_from_serialized
]
With this scheme the number of files in the precompiled files is kept low, since new versions of a precompiled
package will overwrite the old one, there will also be a sharing of compatible precompiled code between project
the same packages, since all precompiled code starting with /<PKG-SLUG>-
is checked before a new precompilation
is done. It is not a good scheme for a shared environment, though; I would rather suggest
compiled_slug
based on the data checked in _require_from_serialized
CMP-SLUGpackage/<name>/<VER-SLUG>/
[i.e. no change]compiled/v<MAJOR>.<MINOR>/<name>/<VER-SLUG>_<CMP-SLUG>.ji
BTW: the previous loading.jl.patch
contained some bugs, so here we go again
julia-loading.jl.patch.txt
Most helpful comment
yowza. could we please prioritize this with a milestone?