Julia: package interactions

Created on 12 Jan 2013  Â·  76Comments  Â·  Source: JuliaLang/julia

Especially in the presence of multiple dispatch, there are situations where there exists glue code that you want to load only when using both of two packages. For example, the k-nearest neighbors algorithm makes perfect sense to apply just to plain old matrices – but of course one _also_ wants it to work for data frames, data matrices and various other containers of data. Currently the only way to make this work is to have the kNN package depend on DataFrames and add the appropriate DataFrame-specific methods. This is going to get out of hand very quickly.

I can think of two solutions. One way is to write the kNN code in a more generic fashion so that it isn't coupled with the DataFrames package but uses an interface for containers of data which DataFrames happens to provide. This is generally a good idea, but I kind of suspect that it may be rather hard to make work in all cases. The other way is to provide a mechanism for loading glue code only when both kNN and DataFrames are loaded.

packages

Most helpful comment

The plan for glue modules:

  1. A module that glues A and B together is some module, say, AB which depends on A and B; it is not inside of module A or module B, it is an external module that depends on them.

  2. We need some mechanism for registering (with Julia) that whenever A and B are both loaded, the code for the glue module AB will also be loaded.

  3. Some way of making it possible for the code for AB to live inside of the repo of A or the repo of B, this will be addresses as part of Pkg3.

  4. The ability to express in package registries, that if an environment includes A and B for certain versions of these package, then it also depends on AB with compatibility constraints between all the versions of the three packages. Normal package resolution will then guarantee that when A and B are both present in an environment, AB will as well.

I'm putting this issue on the 1.0 milestone instead of https://github.com/JuliaLang/julia/issues/6195, which is a particular implementation we're not doing, and https://github.com/JuliaLang/julia/pull/21743 which is another implementation we're not doing.

All 76 comments

One partial way to cope with this is to establish canonical types used as interfaces between packages: this is part of the reason that we created vector and matrix in DataFrame. Then you can write

knn(a::Any, b::Any) = knn(matrix(a), matrix(b))

The trouble is that so many methods will need to have DataFrame's as the canonical type if the method is robust to missing data.

I think the only thing core julia can do to help with this situation is some kind of conditional loading (your second option).

Right, but I'm thinking of a very particular kind of conditional loading: require("kNN") when DataFrames has already been loaded or require("DataFrames") when kNN has already been loaded both trigger the loading of the following two files if they exist:

  • kNN/glue/DataFrames.jl
  • DataFrames/glue/kNN.jl

This arrangement allow you to provide glue code for a package to make it work nicely with as many other packages as you want, without any of the packages depending on each other. If you happen to load both, you get the appropriate glue; if you only load one or the other, then you don't.

This seems like it will put a big burden on DataFrames, no?

It should be more like an optional dependency, so only one of those glue directories is needed.

Neither glue directory is required – they're only loaded _if they exist_. The main reason to look for both of them is so that the order in which requires occur doesn't affect what gets loaded. Afaict, "optional dependency" is an oxymoron.

Typically for a foundational package like DataFrames, the other packages will provide the glue.

What's wrong with a separate glue package?

Though see my last comment in #1809. If you need to _override_ (not just extend) the behavior of another module to achieve what you want, I guess

evalfile(fname::String, mod::Module) = eval(mod, parse(readall(fname))[1])

might be useful.

The issue with a separate glue package is that it we support loading a third package when kNN or DataFrames is used, but not when kNN and DataFrames, which is what you want for glue packages. Glue packages could modify existing code and they could be guaranteed to be loaded _after_ both of the packages they connect.

I suspect that all this points towards making requirements declarative rather than imperative.

Let me elaborate on that. I think I've figured out what "optional dependency" means: if A is an optional dependency of B then if A and B are both required, A should be loaded before B. If we can arrange for that to happen, we don't need a special glue mechanism since B can simply check for the presence of A when it's loading and execute "glue code" conditionally. However, it seems to me that this entire notion implies that requirements must be declarative since otherwise you can't know if A is _going_ to be required if B is loaded before A.

This looks related with my actual situation: https://groups.google.com/forum/?hl=es&fromgroups=#!topic/julia-users/wwxKj0QoKzM

I'm thinking on this too... When you need a package on another, you penalized the load. For example:

julia> @elapsed require("DataFrames")
3.3809280395507812       

Package Benchmark uses Dataframes [ https://github.com/johnmyleswhite/Benchmark.jl ] :

julia> @elapsed require("Benchmark") 
3.6832501888275146       

And the time of load of this small package is huge because the load time of DataFrames.

Maybe the better option is, compile only what its used. In order you don't compile the full DataFrame package if you only call a type and two methods from DataFrames.

When Julia becomes compilable, are this things going to happen?
But maybe check at run time can be useful for avoid this and allow load conditional dependencies only what they are need it ?

You can't use only a few thing of a package/module...

julia> using DataFrames.DataFrame
invalid using statement: name exists but does not refer to a module

Diego, your focus on how fast Julia programs load hints to me that you may be doing something wrong. Why is starting Julia such a bottleneck for you? That said, the package interactions thing is clearly an issue (hence me opening this issue).

You can only use a module as a module.

I usually run scripts programs. A lot of times. I know I can't avoid this running everything inside Julia. The problem I see it's not a problem for Julia, it's a problem for a lot of languages. For example... Y created scripts on Python using Bio, for me and for sharing with co-workers in my group. Doing this, I note the load time of modules. Yes... Are seconds! But I use to run them in pipelines 188086 times (at one second... gives me a little more that two days only loading packages). I'm affray of Julia going in that way. Maybe when becomes compilable this it's not going to be a problem.... But at the moment I don't know if design a faster-to-load package or not ?

If the answer if trying to make a little faster to load package (even for a compilable Julia)... interactions between them is a problem for make it possible.

Loading Julia programs will be fast when we have a compiler. Until then it will continue to be slow.

As fast that I don't have to be worry for load times and size package... Or is it good trying to make it smaller ?

It's always good to make things smaller. Honestly though, if you're starting a program 1 BILLION TIMES, you should really consider trying to run things in a single long-running process. Starting a C program that just exits is not instantaneous either.

Julia is a general-purpose language with good support for running external programs--perhaps consider using Julia as the glue for your pipelines?

I read about that, but I didn't get a chance yet. I'm used to use bash. It would be a good idea, I'm going to try it ;)

Getting back to the point of this issue (excuse me for the noise)

I think can be great to be able to define a method for a DataFrame without import the package for example. It's going to be useful the declaration of method for types without importing all the packages.

Are you proposing lazy loading of dependencies?

I don't know if lazy loading it the expression... And I don't know if Stefan it's saying the same with declarative instead of imperative.

I'm saying that if you are going to use k-means on a matrix, you don't need to load DataFrames.
But if you load DataFrames, you can use k-means on DataFrames if the method is defined.

Maybe its more like the ability of define methods for types you don't load, in order that Julia can use it when their are already loading.

Maybe lazy loading (load only when you need it) can be a good option too.

+1 for allowing loading glue when a specific set (pair) of packages is installed

It seems to me that this "glue plan" calls for introducing a "CONFLICTS" file for packages alongside "REQUIRES": suppose packages A and B have some glue code stored within A, then B gets updated in an incompatible way, and the glue code in A doesn't work with the new version of B. Since the two packages do not explicitly depend on each other, the packaging system would have no way to know this, unless told explicitly somehow. Maybe there are better ways to deal with this situation than introducing conflicts, but this is the easiest I can think of.

BTW I'd also like to have the "glue code" feature.

There is now a request to add a JSON serialiser for DataFrames. aviks/JSON.jl#10

However, given the relative sizes, and use-cases, of the two packages, I am loath to add a dependency on Dataframes to JSON. Any thoughts on a way out at present?

I don't see any reason not to put the dependency inside of DataFrames since we'll eventually want to have both readjson and writejson functions for serializing DataFrames.

That works of course, but wouldn't it be nice for all conversions to json be available via JSON.print(...) dispatching on the parameter?

EDIT: Scratch that, what was I thinking. For writing, JSON exports a print_json method. Dataframes can import it. For reading, it has to be a separate function.

You don't like the idea of readjson and writejson as function names? Those more explicit names would make it much easier for Package A and B to supplement those methods with new definitions that dispatch on the types defined in A and B. Reusing print seems slightly odd to me, because it's that not just the case that you're doing multiple dispatch on the type of thing being printed: you've also added a new version of print that does a different type of printing than the standard print function. In the current setup, print(A) and JSON.print(A) do different things. For me, the virtue of multiple dispatch is that print(A) and print(B) do different things. When I want different actions on A, I'd prefer to call them print1(A) and print2(A).

@johnmyleswhite I was talking out of my hat. See my edit to the comment above.

The function in JSON is now called print_json. Happy to change that to writejson. No, I dont want to overload print.

Reading is a different beast, since the difference is in what the method returns, rather than what its parameters are.

Ah, I forgot that multiple dispatch is basically worthless for reading in formats. I think we can just define a print_json method in DataFrames then and leave JSON tightly focused.

I do like the idea of changing the names to something more in line with the newer Julia names, but let's open another issue for that to get a community discussion.

bump

What's the latest on "glue code" in the new Pkg?

There is no support, but I'm kind of thinking we need it.

Agreed.

Only partially related, but in my lab's repository I'm getting some value out of the following function:

# Load a specific feature from a package
function use(pkgname::String, filename::String)
    pkgloader = Base.find_in_path(pkgname)
    if !isfile(pkgloader)
        error("Package ", pkgname, " not found")
    end
    path, _ = splitdir(pkgloader)
    include(joinpath(path, filename))
end

That way you can cherry-pick particular files from packages, without loading the whole dang thing. One downside: this requires that package authors make their files more "atomic," most obviously getting import Base: ... set up correctly in each file rather than once for the whole package. I've not yet asked anyone else to do that; the main way I have used this was to load Grid's Counter type without loading the rest of Grid.

Also of potential interest (but again only partially relevant): Images has implemented it's own version of lazy code-loading, to load file format-specific code only when a user reads or writes files of a particular type (e.g., the NRRD code gets imported only if the user reads or writes an NRRD image, etc).

I've also implemented some very simple logic in RandomMatrices to enable extra functionality if GSL is installed (the extra special functions are needed for some computations). It tries to run a GSL function at load time, catches any errors that arise (eg if libgsl is not installed), and sets a module-level variable if it is found.

Taking a second look at what I've done, maybe the right thing to do is a load-time macro that selectively loads dummy methods in place of the actual methods when an optional dependency is absent.

Thanks for the mention on lazy loading Tim; I've actually been trying to find the optimal way to do this with Datetime and timezone data (only loading the data when a particular timezone is needed). I'll explore Images a little more for ideas.

@mlubin, this is the closest yet to the actual issue at hand.

Having a glue folder seems like a good solution, but just for the sake of throwing another idea out there, how about some kind of load hook?

Pkg.on_load("DataFrames") do
  # overload DataFrames methods
end

Then Pkg can trigger these, like event listeners, either at package load time or when "DataFrames" is loaded as appropriate.

Also, it might be useful to support glue that depends on multiple packages (e.g. glue/DataFrames+Gadfly.jl), and is only loaded once both are available.

That's a good idea. For this kind of thing, you really want as much power and flexibility as possible. If you can register to listen to package load events and see what has already been loaded, then I think you can do anything that one might possibly want to do.

+1 for something like this.

One other idea would be a merge command, that could combine commands in different modules with the same name into one. No idea how this would work though.

For the sake of another example, here's an approach I used:
https://github.com/JuliaStats/KernelDensity.jl/blob/master/src/KernelDensity.jl#L15
One problem with this is scoping: is there a way to control the module in which the glue code is run?

I like @one-more-minute's @require macro https://github.com/one-more-minute/Jewel.jl/blob/master/src/lazymod.jl#L1-L28

It listens for package loads (by overriding Base.require) and calls a registered handler function. A simple solution to this common problem.

Might also be helpful to see it in action. Personally I think there's a lot to be said for (a) not having N submodules/files/whatever and (b) keeping related code together in this way.

@one-more-minute I presume the block the code inside @require runs when that module is loaded? I like this.

Yup, that's it – or it will run immediately if the required module is already loaded.

The plan for glue modules:

  1. A module that glues A and B together is some module, say, AB which depends on A and B; it is not inside of module A or module B, it is an external module that depends on them.

  2. We need some mechanism for registering (with Julia) that whenever A and B are both loaded, the code for the glue module AB will also be loaded.

  3. Some way of making it possible for the code for AB to live inside of the repo of A or the repo of B, this will be addresses as part of Pkg3.

  4. The ability to express in package registries, that if an environment includes A and B for certain versions of these package, then it also depends on AB with compatibility constraints between all the versions of the three packages. Normal package resolution will then guarantee that when A and B are both present in an environment, AB will as well.

I'm putting this issue on the 1.0 milestone instead of https://github.com/JuliaLang/julia/issues/6195, which is a particular implementation we're not doing, and https://github.com/JuliaLang/julia/pull/21743 which is another implementation we're not doing.

Just to point out – this will be a great replacement for Requires.jl, which is good at connecting two interfaces together. It won't help us with "backend-like" conditional dependencies of the kind that Plots has or Flux used to have, or that are coming up in GPU work, where package B provides an implementation but not an interface.

Of course, that might only be a trivial extension of this system, e.g. change the rule to "if A is loaded and B is installed / available".

It kind of does help even with that, e.g.:

using MLFramework, CuArrays
# when both are present `MLFramework_CuArrays` glue package is loaded

The glue package can do things like set the backend for MLFramework. However, it's a little weird that CuArrays might never be used but it's mere loading affects the behavior of MLFramework. Wouldn't it be better for them both being loaded to introduce code that allows CuArrays to be selected as a backedn for MLFramework, but still require MLFramework.backend(CuArrays) or something like that to actually activate the backend?

"if A is loaded and B is installed / available"

This is a bit vague. As I remarked on the triage call, "having a package decide what to do based on some nebulous, ill-defined condition of your hard drive seems like a very bad idea". @Keno suggested that "installed / available" could be taken to mean "in your project / manifest", which would be a somewhat reasonable possibility, but I think I'd prefer some kind of global configuration mechanism (which could be included in the project file).

Sure, that case isn't an issue because B is providing an interface (the CuArray type itself). I'm thinking more of cases that currently look like:

Flux.convert_to_mxnet(model)
Plots.use_gadfly_backend()

In these cases it's less obvious how to frame this as tying a function to a type. You could just dispatch on a reasonable type from the backend (e.g. backend(Gadfly.Plot), like your suggestion above), but this requires some coordination with backends to make it consistent. In Flux's case I could have done:

Flux.convert_model(TensorFlow.Tensor, model)
Flux.convert_mode(MXNet.SymbolicNode, model) # Or MXNet.NDArray?

But aside from being unpredictable, this unnecessarily exposes an implementation detail of the backend that users wouldn't see otherwise. Again you have to coordinate and define an interface to fix that, which is what this was supposed to avoid in the first place.

This is nowhere near as common or important as the Requires.jl use case, I'm just raising the flag for reference.

I kind of favor a mixed approach. For example, in R dtplyr is the glue package for dplyr and data.table. It makes sense for the glue package to operate at this level since one provides most of the data manipulation / wrangling for dataframes and data.table provides the dataframe struct. However, I much rather have a StatsBase like solution. One package provides a unified API / framework and each package can implement the necessary methods for their structs. For example, if I provide a child of StatsBase.RegressionModel with all applicable methods in the API another package should be able to interact with any of my structs without any issues operating at the abstract level. I fear that providing a glue solution would incentivize writing at structs level when it would optimal to encourage methods at the abstract level.

Is there some visible development on this?

I believe at this point this can and needs to be added as a feature in 1.x.

so the 30-sec wait until display of first plot in Gadfly waits until then?

I don't believe that delay is related to conditional dependencies? Am I missing something?

Yes, Gadfly doesn't even have conditional dependencies. It doesn't use Requires.jl and it doesn't @eval using statements (which is Plots.jl's issue). Gadfly's first time to plot is completely orthogonal to this and just due to precompilation not capturing most of what users "think it would/should".

@ChrisRackauckas Just for the record, Gadfly renders via Compose and Compose has some infrastructure like https://github.com/GiovineItalia/Compose.jl/blob/master/src/Compose.jl#L30-L47, so you're technically right that Gadfly doesn't have conditional dependencies, but it depends on it ...

Interesting. I didn't know Compose did that. Then it is the same problem as Plots.jl. Why does it have to be lazy though? It's the lazy loading part that makes it difficult.

Maybe we see (compared to Plots.jl) some covergence here: similar problems bounded by the same constraints lead to similar solutions. Compose actually manages two backends, a homegrown SVG and a link to Cairo for other formats.

The ideas for fixing Plots.jl backends is much simpler than having some kind of Base hook for conditional deps. Instead its to pull in the backends with using and work off of that syntax, i.e. using PlotsGR instead of gr() doing that kind of stuff, and then having it add new dispatches to core functions using some abstract type. I think that's a sane thing to do and it fixes the precompilation problem. It just requires a re-write of the backend code to do it.

It just requires a re-write of the backend code to do it.

What do you mean by just? Isn't that just giving up on modularization (which is (imho) re-using code without changing it)?

No. It's putting the backend code into a separate package like PlotsGR and having that implement a documented function interface by implementing dispatch on a concrete subtype of some abstract backend dispatching type. It's more modular and allows more code re-use, at the cost of having to have the backend code in a separate repo. But if Pkg3 can handle separate submodules in the same package well (with precompilation), then it can be one repo.

Sorry, i'm lost. I thought, that the backend code is already in a separate package (i.e. GR.jl). And in your example, isn't the PlotsGR the abovementioned glue package? And when is the decision taken to execute/precompile PlotsGR?

When the user calls using PlotsGR. That would be how a backend is chosen, then the package's init call could set a global in Plots to make the backend choice reflected in the latest using. Then each plot call can have an optional argument passing through this global that says what the current backend is in terms of a type, and then core functions can be overloaded for specific backends by new dispatches in PlotsGR. So the decision to execute PlotsGR code is done when the user calls using PlotsGR, and the code to precompile are the new dispatches.

Now that we have Base.package_callbacks as a "blessed" interface, in my opinion https://github.com/MikeInnes/Requires.jl/pull/46 seems like a non-objectionable solution to the other half of this problem. If that gets merged then perhaps we can close this.

@timholy That sounds great. Could you explain a bit what that PR does? Does it fix all issues with the approach currently adopted by Requires?

At one time Requires did a lot of "sneaky stuff" (i.e., overwrite methods in Core and Base), but over time it has worked more harmoniously with base julia; in particular, the addition of Base.package_callbacks in 0.6 gave us an official interface for calling a function whenever a new package has been loaded. The interface for a package callback is f(id::Base.PkgId), thus passing information to the callback about which package just got loaded. The entire list of callbacks gets called every time you load a new package. You'll note a comment that the interface was marked as experimental, but it hasn't changed during the entire 0.7 cycle (lots of stuff about loading has changed, but not the package_callbacks interface) and since we're about to release I think we can consider it safe. At least Revise.jl uses the same interface, so Requires is not the only consumer.

All this is provided by Base; now, on to Requires. First, let me describe the state of Requires master, which is largely the work of @MikeInnes. Requires defines a single callback function and pushes it to Base.package_callbacks. Requires also maintains a Dict of thunks for use by its callback function; the Dict is indexed by PkgId, which does not require that the module itself exists (yet); the values stored in the Dict are just lists of functions (thunks) to call conditional on the loading of that package.

This Dict gets populated through @require calls. @require is a bit complicated in master, so now let me turn to my PR. In my PR, @require just does this (I've edited this heavily so it looks more like regular code):

julia> macroexpand(Main, quote
           @require JSON="682c06a0-de6a-54ab-a142-c8b1cf79cde6" include("morecode.jl")
       end)

        if !Requires.isprecompiling()
            Requires.listenpkg(Base.PkgId(Base.UUID("682c06a0-de6a-54ab-a142-c8b1cf79cde6"), "JSON")) do 
                Requires.withpath(@__DIR__) do 
                    Requires.err(@__MODULE__, "JSON") do 
                        const JSON = Base.require(JSON [682c06a0-de6a-54ab-a142-c8b1cf79cde6])
                        include("morecode.jl")
                    end
                end
            end
        end

All that basically does is register the following:

const JSON = Base.require(JSON [682c06a0-de6a-54ab-a142-c8b1cf79cde6])
include("morecode.jl")

to be executed (whenever JSON gets loaded) inside whichever module you used @require in (that's what the @__MODULE__ is about). The last touch is setting the path (via @__DIR__) for finding "morecode.jl". In my PR, this @require statement must occur inside the module's __init__ function, which means that we register this JSON-dependency at the time of module initialization.

In the master branch of Requires, @require does a little bit more stuff because it supports having a @require statement outside __init__; it basically stores all the @require calls in a module-global array __inits__ and then creates an __init__ function that iterates through the list and registers them. IMO this is a bad idea because it is exclusive with having a user-written __init__ function. (You can iterate over __inits__ yourself, but this appears to be undocumented, and the lack of error output about why it failed is problematic.) So I would describe this as the one dicey remaining feature in Requires, which is why I stripped it out. So on that branch Requires plays well with precompilation, custom initialization, and all the other fancy things we now know we need.

If that gets merged, I think it's fair to say that Requires is a clean and straightforward solution(*) to the problem of executing code that is dependent upon other modules having been loaded. That may not be the full list of ways we want to support interaction among packages, but it's the big one, and the one for which there aren't as good alternatives. Again, most of this progress has been from the work of @MikeInnes and those who designed the Base.package_callbacks interface; all I did was give this a nudge to fix a couple of bugs and strip out the last bit of problematic behavior.

Unfortunately, I don't think it's a deprecatable change, so it's pretty heavily breaking.


(*) some might object to monkeying with task_local_storage to set the path, but by my reading (I could be wrong) it's safe.

That sounds great, @timholy! In the future it might be good to make this entire business a little nicer to use and more official, but for now it sounds like we have everything we need. I support making the breaking change now so that Requires becomes a "clean and straightforward solution".

@timholy You mention above this a solution of one half of the problem, what would be the other?

@lobingera, meaning Base.package_callbacks makes it possible to do this correctly (it's "the backend") and Requires is what implements the specific logic ("the frontend").

Tim's PR is merged and tagged; I concur that it's a strong and stable solution for package negotiations.

Was this page helpful?
0 / 5 - 0 ratings