Pkg.jl 🚀 - Pkg + BinaryProvider

Okay, let's get started on the first bullet point of this list; defining a BinaryArtifact type within Pkg. We need to create a new datatype within Pkg that represents not a Julia package, but a BinaryArtifact, which is distinct in the following ways:

BinaryArtifacts are chosen not only by version, but also by runtime-reflected properties (CPU architecture, OS, libgfortran version, etc....)
Allow packages to list BinaryArtifacts as something they require, complete with version bounds.
Provide an interface for BinaryArtifacts to either "export code" or "bundle metadata". Things like "LibFoo.jll exports the abspath location of libfoo.so", or a wrapper function that sets environment variables before invoking Git.jll's bundled git.exe.

staticfloat on 14 Jan 2019

👍2

I guess we can create an AbstractDependency type with PackageSpec and BinaryArtifact as subtypes? Then we replace most current occurrences of PackageSpec with AbstractDependency.

00vareladavid on 14 Jan 2019

Is the idea to download a BinaryArtifact and then key into it with runtime information to determine what tarballs should be downloaded? Or is a BinaryArtifact the tarball itself?

00vareladavid on 14 Jan 2019

How about just calling it Dependency since we're not going to have Dependency <: AbstractDependency, we're going to have PackageSpec, BinaryArtifact <: Dependency.

StefanKarpinski on 14 Jan 2019

👍2

Ok, and theses types of nodes will be mostly indistinguishable until we hit what is currently build_versions. At which point, we key into them with runtime information(i.e. choose_download) to determine the exact tarball which needs to be set up. Is that roughly the plan?

00vareladavid on 14 Jan 2019

Sounds reasonable to me; I'd be happy to discuss this further and nail down more of an implementation plan during the Pkg call tomorrow?

staticfloat on 14 Jan 2019

👍5

Version constraints are against the version of the library, not the version of the thing that builds the library. But you want to be able to lock down a specific build of a library. But a specific build is completely platform-specific. There are some layers of versioning:

Artifact identity. The exact identity of the binary artifact that was used in a configuration. We want to record this or be able to reconstruct it somehow, but it's too specific.
Build script version. The version of the build scripts that produces that binary artifact. This will typically support multiple different platforms. This is probably what should be in the manifest.
Library version. The version of the library that the build script is building. This is what compatibility constraints should work with.

Is this correct and complete? The artifact identity should be completely determined by some "system properties" tuple that captures all the things that determine which artifact generated by a build script one needs. The end user mostly only needs to care about the library version, which is what determines its API and therefore usage. There might, however, be situations where one needs compatibility constraints on both the library version and the build script version: e.g. an older build was configured in some way that makes the resulting artifact unusable in certain ways.

StefanKarpinski on 15 Jan 2019

Does a given version of a build script always produce just a single version of a given library?

StefanKarpinski on 15 Jan 2019

How would this work with packages that use BinaryProvider but fall back to compiling from source if a working binary is not available (typically for less-popular Linux distros)? e.g. ZMQ or Blosc IIRC. You need some kind of optional-dependency support, it seems, or support for a source “platform”.

stevengj on 22 Jan 2019

For building from source, we will support it manually by allowing users to dev a jll package, then they just need to copy their .so files into that directory. This is analogous to allowing users to modify their .jl files within a dev'ed Julia package.

I do not think we should ever build from source automatically. Looking at ZMQ, it looks like you have full platform coverage; under what circumstances are you compiling?

staticfloat on 22 Jan 2019

👍1

Another example to add to Steven's list is SpecialFunctions, which falls back to BinDeps when a binary isn't available from BinaryProvider. Once upon a time that was used on FreeBSD, before we had FreeBSD support in BinaryProvider, but now I don't know when it's used aside from on demand on CI.

ararslan on 22 Jan 2019

Looking at ZMQ, it looks like you have full platform coverage; under what circumstances are you compiling?

We needed it on CentOS, for example (JuliaInterop/ZMQ.jl#176), because of JuliaPackaging/BinaryBuilder.jl#230.

There are an awful lot of Unix flavors out there, and it's nice to have a compilation fallback.

stevengj on 24 Jan 2019

Regardless of the many UNIX variations, the only things you really need are the right executable format and the right libc, which we can pretty much cover at this point.

StefanKarpinski on 24 Jan 2019

And the right libstdc++, which is apparently harder to cover.

(This was why I had to enable source builds for ZMQ and Blosc. Are we confident that this is fixed, or are we happy to go back to breaking installs for any package that calls a C++ library?)

stevengj on 24 Jan 2019

I think our libstdc++ problems should be largely solved now that https://github.com/JuliaPackaging/BinaryBuilder.jl/issues/253 has been merged. We now build with GCC 4.8.5 by default, using a libstdc++ version of 3.4.18, so we are guaranteed to work with anything at least newer than that. I'm not entirely sure it's possible to build Julia with GCC earlier than 4.8 at the moment, (the Julia README still says GCC 4.7+, but I'm pretty sure LLVM requires GCC 4.8+) so this seems like a pretty safe bet to me. I would be eager to hear how users are running Julia with a version of libstdc++ older than 3.4.18.

staticfloat on 24 Jan 2019

Should https://github.com/JuliaPackaging/BinaryBuilder.jl/issues/230 be closed then?

stevengj on 25 Jan 2019

Yes I think so.

staticfloat on 25 Jan 2019

I'm very supportive in managing the binary artifacts by Pkg. I'd just like to point out that the implementation of library loading should be flexible enough to include some strategy for AOT compilation and deployment (to a different computer). The app deployed to a different computer will have to load libraries from different locations and the hardcoding of paths in deps.jl makes this pretty difficult, see JuliaPackaging/BinaryProvider.jl#140. The best way would be either not have deps.jl at all or no need to store absolute path to the library.

phlavenk on 31 Jan 2019

👍1

Yes, that's the plan: you declare what you need, referring to it by platform-independent identity instead of generating it explicitly and then hardcoding its location, instead allowing Pkg to figure out the best way to get you what you need and telling you where it is.

StefanKarpinski on 4 Feb 2019

Progress! There is some code behind this post, and other things remain vaporware, with the aspiration of striking up some discussion on whether these are the aesthetics we want.

Building a builder repository results now in the tarballs (typically uploaded to a GitHub release like this one) as well as an Artifact.toml. These currently look something like this:

name = "JpegTurbo_jll"
uuid = "7e164b9a-ae9a-5a84-973f-661589e6cf70"
version = "2.0.1"

[artifacts.arm-linux-gnueabihf]
hash = "45674d19e63e562be8a794249825566f004ea194de337de615cb5cab059e9737"
url = "https://github.com/JuliaPackaging/Yggdrasil/releases/download/JpegTurbo-v2.0.1/JpegTurbo.v2.0.1.arm-linux-gnueabihf.tar.gz"

    [artifacts.arm-linux-gnueabihf.products]
    djpeg = "bin/djpeg"
    libjpeg = "lib/libjpeg.so"
    libturbojpeg = "lib/libturbojpeg.so"
    jpegtran = "bin/jpegtran"
    cjpeg = "bin/cjpeg"

[artifacts.i686-w64-mingw32]
hash = "c2911c98f9cadf3afe84224dfc509b9e483a61fd4095ace529f3ae18d2e68858"
url = "https://github.com/JuliaPackaging/Yggdrasil/releases/download/JpegTurbo-v2.0.1/JpegTurbo.v2.0.1.i686-w64-mingw32.tar.gz"

    [artifacts.i686-w64-mingw32.products]
    djpeg = "bin/djpeg.exe"
    libjpeg = "bin/libjpeg-62.dll"
    libturbojpeg = "bin/libturbojpeg.dll"
    jpegtran = "bin/jpegtran.exe"
    cjpeg = "bin/cjpeg.exe"
...

My plan is to embed this file into the Registry in the same way that Project.toml files are embedded right now. Artifacts will be analogous to Project.toml files with the following similarities/differences:
- They will contain Compat.toml, Deps.toml and Versions.toml entries, which will function exactly the same as a normal Registry entry, except that the downstream DAG of Artifacts can _only_ contain other Artifacts; an Artifact cannot depend on a general Julia package, so in that sense the dependency links are restricted somewhat.
- They will not contain Manifest.toml, Project.toml or Package.toml, only the afore-mentioned Artifact.toml. This is mostly for simplicity, I don't see why we need these, but I am aware that I may not be thinking this through completely.
Pkg is now binary platform-aware, by essentially gutting code from BinaryProvider to instead live inside of Pkg. This allows me to ask things like "what is the ABI-aware triplet of the currently-running host?" (you now get that by calling Pkg.triplet(Pkg.platform_abi_key())).
When the user expresses a dependency on one of these Artifact objects (e.g. through Pkg.add("LibFoo_jll")) it will get added to the dependency graph as usual, but when being concretized into a URL to be downloaded, an extra step of indirection is applied by reaching into the Artifact.toml's dictionary, finding dict["artifacts"][triplet(platform_abi_key())] and using the embedded entries as the url and hash to download and unpack into a directory somewhere.
After downloading and unpacking the binaries, Pkg will generate a wrapper Julia package that exposes an API to "get at" these files, so that client code (such as LibFoo.jl, the fictitious julia-code side of things) can use it in as natural a way as possible. Example generated Julia code:

# LibFoo_jll/src/LibFoo_jll.jl
# Autogenerated code, do not modify
module LibFoo_jll
using Libdl

# Chain other dependent jll packages here, as necessary
using LibBar_jll

# This is just the `artifacts` -> platform_key() -> `products` mappings embedded in `Artifact.toml` above
const libfoo = abspath(joinpath(@__DIR__, "..", "deps", "usr", "lib", "libfoo.so"))
const fooifier = abspath(joinpath(@__DIR__, "..", "deps", "usr", "bin", "fooifier"))

# This is critical, as it allows a dependency that `libfoo.so` has on `libbar.so` to be satisfied.
# It does mean that we pretty much never dlclose() things though.
handles = []
function __init__()
    # Explicitly link in library products so that we can construct a necessary dependency tree
    for lib_product in (libfoo,)
        push!(handles, Libdl.dlopen(lib_product))
    end
end
end

Example Julia package client code:

# LibFoo.jl/src/LibFoo.jl

import LibFoo_jll

function fooify(a, b)
    return ccall((:fooify, LibFoo_jll.libfoo), Cint, (Cint, Cint), a, b)
end
...

staticfloat on 8 Mar 2019

❤1 🎉1

I like it in general. I'll have to think for a bit about the structure of the artifacts file. There's a consistent compression scheme used by Deps.toml and Compat.toml; we'll want to use the same compression scheme for the artifact data in the registry which somewhat informs how you want to structure the data in the file as well.

Do you think I think we'll eventually want to teach ccall about libraries so that we can just write ccall(:libfoo, ...) and have it know to find the LibFoo shared library? That seems like the nicest interface to this possible—just declare the dependency in your project file and ccall it with the right name and everything just works.

StefanKarpinski on 8 Mar 2019

That seems like the nicest interface to this possible—just declare the dependency in your project file and ccall it with the right name and everything just works.

I am actively shying away from teaching Pkg/Base too much about dynamic libraries; it's a deep rabbit hole. In this proposal I'm even not baking in the platform-specific library searching awareness (e.g. "look for libraries in bin on windows, lib elsewhere). I want to keep Pkg as simple as possible.

On the other hand, I would like it if dlopen() was able to tell me, for instance, that trying to use libqt on a Linux system that doesn't have X11 installed already isn't going to work. It would know this because it would try to dlopen("libqt.so") and fail, and it would inspect the dependency tree and notice that libx11.so was not findable. This is all possible with not much new code written, but it does mean that we need to bring in things like ObjectFile.jl into Base, and that's a lot of code.

It would be nice if we could do things like search for packages that contain libfoo.so. That's actually one advantage to listing everything out in the Artifact.toml within the registry like that.

staticfloat on 8 Mar 2019

There's a consistent compression scheme used by Deps.toml and Compat.toml

I'm not entirely sure what you mean by this, but I will await your instruction. I have no strong opinions over the Artifact.toml organization, except for the vague feeling that I want to make it as small as possible to avoid bloating the registry and making things slow to download/install/parse/search.

staticfloat on 8 Mar 2019

After downloading and unpacking the binaries, Pkg will generate a wrapper Julia package that exposes an API to "get at" these files, so that client code (such as LibFoo.jl, the fictitious julia-code side of things) can use it in as natural a way as possible. Example generated Julia code:

const libfoo = abspath(joinpath(@__DIR__, "..", "deps", "usr", "lib", "libfoo.so"))
const fooifier = abspath(joinpath(@__DIR__, "..", "deps", "usr", "bin", "fooifier"))

This automatic wrapper generation with const assigning the absolute path is exactly the the thing that prevents AOT with deployment to a different computer. So during AOT PackageCompiler will need to modify every single artifact_wrapper_jlpackage to get rid of the baked-in absolute path.

If the code is auto-generated, why cannot this functionality be part of some function or macro-call that would open the handles and generate the const paths on-the-fly? In that case PackageCompiler could just pre-collect all the artifact to a "deployment depot" and let the dlopen reach for this "configurable" path. Or would redefine this const-path generator for the AOT build.

And is the constantness of the lib path really necessarily for efficient ccall?

phlavenk on 8 Mar 2019

👍4

Is there any idea for how to integrate non-BP artifacts/dependencies? e.g. Conda.jl, or software which requires separate installers?

Similarly, what about providing a mechanism for overriding BP choices, e.g. the infamous Arpack issue, or cluster-specific MPI implementations?

simonbyrne on 20 Mar 2019

For overriding choices like for Arpack, I think doing dev Arpack_jll, then just installing/copying/linking whatever libraries you want into ~/.julia/dev/Arpack_jll is the right solution.

staticfloat on 20 Mar 2019

I think it would be awesome if one could say something like pkg> use_system_libs Arpack_jll, and then wouldn't need to link anything manually into ~/.julia/dev/Arpack_jll but somehow it would just pick up whatever version of the binary dependency is installed on the system.

davidanthoff on 20 Mar 2019

I want to second what @phlavenk said about generating these hardcoded paths being precisely why packages that require a build step are currently non-relocatable, so let's avoid that.

StefanKarpinski on 20 Mar 2019

And is the constantness of the lib path really necessarily for efficient ccall?

According to Jameson, yes, it really is. Fortunately, we can work around this because by manually dlopen()'ing everything in __init__() we don't have to pass the absolute path in, we just need to pass in the SONAME of each library. To deal with this, I've added code in BB to ensure that the SONAME of a library (e.g. libjpeg.so.62 on Linux) is always openable (e.g. by ensuring that symlinks exist) and then always recording the SONAME as the name of the library in the Artifacts.toml, and using that as the const value that we pass into ccall. We don't need to worry about ambiguity errors here, because we will have already dlopen()'ed the correct value within __init__(), which we can do at runtime with dynamically calculated paths.

staticfloat on 21 Mar 2019

👍3

@staticfloat and I came up with a plan that we then ran by @KristofferC and everyone is on board with. It's one of those stupid simple designs that seems totally obvious and like the first thing we should have come up with, but that's how design works, so 🤷‍♂. Here goes explaining it.

The Artifacts File

The core addition is an Artifacts.toml file which lives next to the Project.toml and Manifest.toml file. When installing a project (usually a package but it would make sense for apps too) which has an artifacts file, Pkg will look through the file and install any artifacts which are relevant to the current platform. (There should probably also be a way to also install for other platforms for cases where one is using Pkg to setup pre-installed package setups in shared directories for multiple platforms.)

The format of the Artifacts.toml file is as follows:

[dataset-A]
hash.sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = "https://somedomain.com/path/to/dataset.csv"

[nlp-model-1]
hash.sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
url = "https://server.com/nlp-model-1.onnx"

[[libfoo]]
hash.sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
sys.os = "Linux"
sys.arch = "armv7l"

[[libfoo]]
hash.sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.tar.gz"
sys.os = "Windows"
sys.arch = "i686"

[[libfoo]]
hash.sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.tar.gz"
sys.os = "macOS"
sys.arch = "x86_64"

What this means is:

Each top-level key names an artifact
If the artifact name maps to a dictionary, the artifact is platform-independent
- The url value describes where to download the artifact from
- The hash is a dict of hash algorithms to hash values of the downloaded file
If the artifact name maps to an array of dictionaries, the artifact is platform-specific
- Each entry describes a variation of the artifact
- The sys entry determines which systems this variant applies to
- with keys like os, arch, libc, libstd++, etc.
- the first entry which matches the current system is the one that is selected
- The url value describes where to download the artifact variant from
- The hash is a dict of hash algorithms to hash values of the downloaded file

So, for example when a package with this Artifacts.toml file in its root is installed, Pkg will look at this file after installation and download three additional files into the ~/.julia/artifacts directory:

dataset-A
nlp-model-1
one of the libfoo variants based on the current OS and architecture

If there is no variant of some artifact that matches the current platform, then there is a package installation error, much as if downloading the package itself had failed. Inside of a package which has an artifacts file, one will be able to write something like artifact"dataset-A" to get a path to the downloaded dataset-A artifact. Similarly, artifact"libfoo" will provide the location of the variant of the libfoo artifact which matches the current platform.

Note that the url entries in artifacts file should be considered "advisory" not permanent: they give a location where the artifact may be found, but if it has moved, then the artifact may be found by some other means by its hashes. This is similar to how I've proposed adding advisory repo locations in manifest files in the discussion on https://github.com/JuliaLang/Pkg.jl/issues/635 (contrary to the original desire there to put the URLs in the project file, which I don't think we should do).

Usage by BinaryBuilder & BinaryProvider

BinaryBuilder will generate libfoo packages which provide the API to load and use the libfoo binary dependency. These are normal Julia packages except that they are generated rather than written by hand. They are versioned and registered like normal Julia packages and it is these versions which the package resolver reasons about. The resolver does not know or care about specific variants of artifacts—it just picks a version of libfoo from the registry and then installs it. Once a chosen version is installed, Pkg looks at the installed Artifacts.toml file inside of libfoo and will see a set of [[libfoo]] stanzas for all the variants of the artifact which are provided by this version of the libfoo package. It will install the first one that matches the current platform. The source of the libfoo package will use the artifact"libfoo" API to find the location of the library and load it. The end user is presented with a simple API where they just write using libfoo to load and use the libfoo library.

In this design the manifest file remains platform independent: it contains an entry for the libfoo package, which is platform-independent. The libfoo package is the only place that needs to concern itself with variants of the libfoo artifacts and where to find them. It also avoids putting all the platform-specific information about artifact variants into the manifest file, which would lead to a lot of bloat, especially since it would be repeated in each manifest that depends on a platform-specific artifact. Instead, this design avoids repeating that information at all—it all lives in one place, in the package which uses the artifact. We may, however, want to allow the resolver to reason about which platforms a particular BB package version supports. This could be exposed in the registry to allow the resolver to pick a version that support the desired (usually current) platform. What does not need to go into the registry, however, is the details of the platform-specific artifacts—it only needs to know which platforms a version of a package supports—once a version is chosen, it is only in the install phase that Pkg needs to know about where to get artifact variants.

StefanKarpinski on 12 Jun 2019

Other things we might want to support:

Unpacking of artifacts like tarballs and zip files.
Optional/lazy artifacts that aren't installed with the package by default but later, on demand.

StefanKarpinski on 12 Jun 2019

One thing that came up during the design discussion is: why have a separate Artifacts.toml file? Instead, one could have an [artifacts] section in the Project.toml file. There are a few reasons:

It potentially contains a lot of very verbose, messy data (lots of long hashes), so keeping it out of the main project file, which is fairly easy to read and edit is probably a good thing.
Deep nesting of TOML sections gets kind of ugly. If it was in the project file then the sections would be [artifacts.nlp-model-1] rather than just [nlp-model-1]. In a previous design iteration, the platform for artifact variants was in the section header, which made this more of an issue. In this iteration, the header is just the name of the artifact.
If feels (to me) like a different thing than what goes in the project file. But that's pretty subjective.

We could potentially support having an [artifacts] section in the project file OR a separate Artifacts.toml file. Maybe BinaryBuilder-generated packages will be the only ones that use platform-specific variants, in which case having this in the project file wouldn't be so bad since those would be machine-generated and not often looked at or modified by humans, whereas package that people actually write would tend to have platform-independent artifacts that aren't so verbose.

StefanKarpinski on 12 Jun 2019

This sounds awesome!

Couple of random thoughts:

I think this kind of design would work for many more scenarios than just BinaryBuilder stuff, right? For example, I think I could entirely get rid of the build.jl scripts in cases like this or this? If that was the case, it would be fantastic.

Where would artifacts (and extracted artifacts) be stored? Ideally not in the package folder, right? But in something like .julia/artifacts? That way if a package gets updated, but still needs the same artifact, the artifact wouldn't have to be redownloaded/extracted, right?

I like Artifacts.toml, and I wouldn't allow that stuff to also appear in Project.toml. I generally think for something like that it is better to not offer choices, it just gets confusing, and then one also needs to support all these different options in all the tools, and I just don't think it is worth the extra effort.

Could there be a "fallback" script option for an artifact that is invoked if there is no binary for the current platform? This could be optional. But one could imagine something like:

[[libfoo]]
build_script = "debs/build.jl"

If a section like that is present, and there is no binary for the current platform for libfoo, then this script runs. And then folks could still try to compile for exotic platforms in that script, or do something else.

And one question: I assume artifact acquisition would just happen during build?

davidanthoff on 12 Jun 2019

I think this kind of design would work for many more scenarios than just BinaryBuilder stuff, right? For example, I think I could entirely get rid of the build.jl scripts in cases like this or this? If that was the case, it would be fantastic.

Yes, this isn't BinaryBuilder-specific at all and would be perfect for doing that kind of thing without a build step. Our long-term goal is to get rid of deps/build.jl altogether and make packages completely immutable — install them and never change them. Of course artifacts will also be content-addressed and immutable. There's a theme here 😁

Where would artifacts (and extracted artifacts) be stored? Ideally not in the package folder, right? But in something like .julia/artifacts? That way if a package gets updated, but still needs the same artifact, the artifact wouldn't have to be redownloaded/extracted, right?

Yes, I mentioned ~/.julia/artifacts above but forgot to elaborate. It should be content-addressed and immutable, so the simplest version of this would be that an artifact with hash 95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92 would get installed at ~/.julia/artifacts/95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92. However, there a few issues with that:

It's a very long path name and some tools do not like long path names (Conda, I'm looking at you). That's why Pkg uses five-character slugs instead of the full UUID and tree hash—that's just way too long.
It doesn't include the name of the artifact, making it pretty annoying to navigate manually, which is a thing one does find oneself occasionally doing.

So the obvious solution is to use the name of the artifact with a slug derived from its content hash, so something like this:

~/.julia/artifacts/libfoo/Z94Fh

There are a few problems with that though:

Five characters may not be enough to avoid collisions of different versions of the same artifact, but it's probably good enough—you'd need to have about 30k versions of an artifact with that name installed in order to have a 50% chance of a collision. So maybe this isn't such a big problem.
Different packages may want to use an artifact with a different name. I'm not sure if we want to worry about that or not. We could add a name entry that can be different from the artifact section name and determines the part that goes before the slug.

Neither issue seems fatal, so I think that's probably what we should do.

I like Artifacts.toml, and I wouldn't allow that stuff to also appear in Project.toml. I generally think for something like that it is better to not offer choices, it just gets confusing, and then one also needs to support all these different options in all the tools, and I just don't think it is worth the extra effort.

I think you're probably right. There's one other reason for a separate file that just occurred to me—this file needs to be parsed at runtime in order to find artifacts, so we want it to be pretty simple and regular and not have to look at a lot of different options or variations. This is similar to how code loading needs to parse through the manifest file, so the scheme for finding code needs to be fast and simple for that. Similarly, artifact finding needs to be fast and simple and I think that suggests a separate file.

Could there be a "fallback" script option for an artifact that is invoked if there is no binary for the current platform?

That's certainly a possibility. I'd prefer to only do this if it turns out to be necessary, but it might.

I assume artifact acquisition would just happen during build?

I think it would happen before after installation and before build. After all, it shouldn't depend on the build at all since it's just a matter of installing some thing and putting them in the right place, and that way if there is a build step it can rely on artifacts already being present.

StefanKarpinski on 12 Jun 2019

👍2

Just to spell this out, the artifact loading process is that when Julia sees artifact"dataset-A" in the code of a package, it looks for Artifacts.toml in that package root and looks for a top-level stanza named dataset-A. Assuming that this is a table, i.e. [dataset-A], it then looks for the hash of the artifact in that table and then looks for $depot/artifacts/dataset-A/$(slug(hash)) and returns that path. For artifact"libfoo" where presumably one finds a series of [[libfoo]] stanzas instead, the process is one keeps looking through these and until one finds one where the sys.os and sys.arch and such selectors match the current system. Once a matching stanza is found, one again looks up $depot/artifacts/libfoo/$(slug(hash)).

StefanKarpinski on 12 Jun 2019

What if one wants to build a binary locally instead of using the one provided by BinaryBuilder? Would they need to edit the Artifacts.toml in order to be able to load their binaries?

giordano on 12 Jun 2019

Could there be a "fallback" script option for an artifact that is invoked if there is no binary for the current platform?

I don't like this, because it implicitly makes wherever this artifact should live mutable, and we don't like that. I think mutable state package state should be something else; perhaps some kind of "workspace" API that could define lifecycles for data that are longer than just the lifecycle of a project.

For most of these kinds of projects, what you truly want is a DSL to express data processing flows that allows for arbitrary steps (e.g. if input files A, B or C change, then regenerate D, which would then also cause E to be regenerated, etc...) similar to a Makefile. That would be best provided by a package that builds on top of a filesystem organizational system; perhaps provided by Pkg, or perhaps not and just manually set up by you. I think the need for those kinds of processes is outside the scope of Pkg.

staticfloat on 12 Jun 2019

It also occurs to me that unlike packages which are fairly self-documenting, artifacts are just blobs of data, so we may want to add an extra layer of information identifying artifacts. Maybe this:

~/.julia/artifacts/
    libfoo/
        Z94Fh/
            Artifact.toml # info about the artifact: full hash, origin URL, system info
            content/
                # actual files go here

StefanKarpinski on 12 Jun 2019

For most of these kinds of projects, what you truly want is a DSL to express data processing flows that allows for arbitrary steps (e.g. if input files A, B or C change, then regenerate D, which would then also cause E to be regenerated, etc...) similar to a Makefile.

We already have one half-baked, poorly documented make replacement in BinDeps, please let's not add another. I just want an "if binary isn't available, please run myfallback(installpath) …"

it implicitly makes wherever this artifact should live mutable

Why can't the fallback install to the same location?

stevengj on 12 Jun 2019

Why can't the fallback install to the same location?

Because we're trying to make this immutable and content-addressable. In particular, think about having platform-specific artifacts living together on a shared file system.

StefanKarpinski on 12 Jun 2019

Any kind of "run arbitrary code as a fallback to binaries not being available" is, in my mind, a step backwards. The reason I say that this introduces mutable state, is because if all we are doing is downloading and unpacking a tarball, that's a one-step process. Excluding the small possibility that something goes wrong mid-extraction (not something I have seen very often), the files are either there, or they are not. With an arbitrary code fallback, the state of the build directory very often causes problems when the build tries to be run a second time. I have to address an issue with SpecialFunctions.jl about once every ten days, where a user contacts me because something isn't working and it is always, without fail, due to a previous fallback invocation messing something up for future fallback invocations. Even worse, the number of things that can go wrong in that case are many, many times larger than the number of things that can go wrong when downloading and extracting something.

staticfloat on 12 Jun 2019

I am also not eager to support arbitrary execution fallbacks. The whole point of this BinaryBuilder endeavor is to make it so that all you ever have to do anywhere is unpack some files.

StefanKarpinski on 12 Jun 2019

I think the fallback would clearly be hardly used. Presumably almost all uses of build.jl would just disappear. But there _are_ platforms where there is no binary, and without a fallback there is currently no good story for those cases, as far as I can tell. So I have a hard time seeing "a step backwards", because I'm imagining that this would only kick in if the normal artifact procedure didn't work for a rare platform. At that point nothing works, so it is difficult to see how a fallback could make things worse.

davidanthoff on 12 Jun 2019

But there are platforms where there is no binary, and without a fallback there is currently no good story for those cases, as far as I can tell.

Specifically, what kinds of platforms are you speaking of?

At that point nothing works, so it is difficult to see how a fallback could make things worse.

In the SpecialFunctions case (not to harp on that package in particular, but just because it's the first example that comes to mind), all the complaints I get are from users who should be using the BB-built tarballs, but have somehow managed to force themselves to use the fallback. My most common piece of advice is to just delete the entire deps/usr directory, and when they try to build again, it all just works.

I think when you give package authors the ability to embed arbitrary Julia code into their build process, it is very difficult for them to avoid the temptation to use it to solve minor problems, which then transform into major problems. I don't blame them for trying to solve problems; I blame us for giving them inadequate tools.

Even after merging this, Pkg.build() will still work. If you try to install a package with an Artifact.toml that does not include your platform, there will simply be no artifact installed. You could write your own deps/build.jl to detect that situation and run whatever script you want then.

staticfloat on 12 Jun 2019

But there _are_ platforms where there is no binary, and without a fallback there is currently no good story for those cases, as far as I can tell.

The story is: add BinaryBuilder support for that platform.

At that point nothing works, so it is difficult to see how a fallback could make things worse.

Because now we have to support a complex, hardly used fallback mechanism...

Maybe we could do something like if no platform variant exists, just look for

~/.julia/artifacts/libfoo/fallback

and if that exists, use that instead of failing. I would want to leave it entirely up to the end user in such situations to figure out how to put something there that works though.

StefanKarpinski on 12 Jun 2019

Maybe we could do something like if no platform variant exists, just look for...

I think that will tie in with our answer to @giordano's question above as well:

What if one wants to build a binary locally instead of using the one provided by BinaryBuilder? Would they need to edit the Artifacts.toml in order to be able to load their binaries?

I think ideally no, what I would want is for you to do something like say pkg> dev LibFoo_jll, then go to ~/.julia/dev/LibFoo_jll/deps/usr, plop your libraries into the lib directory and call it good. But this is, of course, breaking the "warranty is void if broken" sticker; at that point you're on your own if the libraries do or don't work.

staticfloat on 12 Jun 2019

👍2

I mostly remember users from obscure large cluster systems?

Maybe we could do something like if no platform variant exists, just look for ~/.julia/artifacts/libfoo/fallback

I like that! I think that is easier to handle than deving things and then copying stuff over. Why require the extra dev step?

davidanthoff on 12 Jun 2019

I like that! I think that is easier to handle than deving things and then copying stuff over. Why require the extra dev step?

The dev step is necessary to denote to the resolver that you don't want this package to participate in things like pkg> upgrade events. You want the rest of Pkg to just ignore it and not touch it (at least for this environment; perhaps a different environment should use the BB-sourced binaries).

Additionally, we don't really want people adding/removing things from ~/.julia/packages or ~/.julia/artifacts, as those are "managed" by Pkg and should be considered read-only.

staticfloat on 12 Jun 2019

👍1

I have a great feeling about the direction the julia dependencies (immutability, central management by Pkg) are evolving. And I hold my thumbs for Tom Short's shoot on slimmed-down AOT static compilation -- JuliaLang/Julia#32273. My concern now is, if we have all the infrastructure in Pkg and have a "no-sysimg use-so-libs" static compilation, will it be only a matter of defining new "static" architecture (like "i686-so") and rerunning the BB for the static library generation? So at the end if I change my project to "static-library" architecture, the Pkg resolver will download all the "static-library" artifacts?

phlavenk on 13 Jun 2019

Key pointss from discussion on slack:

Artifact objects are only understood really by there maching _JLL packages.
Their identity

It is a strong assumption of the system that any file being distributed by the artifact system what handcrafted expressly for this purpose.
This is probably good for keeping scope constrained. (and it isn't like e.g. DataDeps, or BinDeps is going to stop working so...)
Thus there is no support for arbitary post processing.
It is either extract (which will likely only support tar.gz) or do not extract.
Do not extract is the default, but for anything created by BinaryBuilder, extract will be used.
There is no support for hash's other than SHA256.
(So can't use say MD5 hash's provided by others.)

The hash is also used for the identity of the artifact object from within the _jll package.
From outside the _jll package, then the _jll package itself is the identity of the object.
Thus when talking about version of a binary dependency (or data data versions),
one is talking about versions (and UUIDs) of _jll packages.
But when a _jll package is sorting out downloading data,
then it is saying to the server "Give me the object with this SHA256 hash".

Supporting multiple URLs for mirroring purposes may be a thing.
URLs are intended as "advisory" and are only used for unregistered packages.
During registration @StefanKarpinski wants to actually rehost the files somewhere else.
At least for the General registry.

oxinabox on 13 Jun 2019

During registration @StefanKarpinski wants to actually rehost the files somewhere else.
At least for the General registry.

I want to back them up somewhere so that even if the origin vanishes, we have a copy.

StefanKarpinski on 13 Jun 2019

👍2

I think @staticfloat's dev plan for built libraries makes the most sense. The point about wanting to keep ~/.julia/packages and ~/.julia/artifacts for Pkg-managed stuff alone is a good one.

StefanKarpinski on 13 Jun 2019

👍1

we may want a ] gc --things_with_artifacts option.

Like I avoid running ] gc in general because I like it when Pkg works on a plane.
and harddrive is cheap.
but some of these could get large, so I would want to remove things that are being cached but not used anymore.
maybe just ] gc is fine though.

Also if we have lazy ones.
The ability to delete all the lazy artifacts, even if the package is still being used
would be good.

(A oleasing things with DataDeps is the ability to do that. At one point my supervisor complained that we were running out of diskspace on the shared workstation. And so I just deleted everything knowing i would get back the things i still needed when I needed them. It was very satifisfying)

oxinabox on 13 Jun 2019

Default ] gc should clean up no-longer-used artifacts by default though, right? We could have options to gc to only clean package or only clean artifacts, e.g. gc --packages and gc --artifacts. Behavior would be: no flags = all, flags = only the give flags.

StefanKarpinski on 13 Jun 2019

Default ] gc should clean up no-longer-used artifacts by default though, right?

Yes.

The rest I am not so sure about.
Cleaning artifacts, that are not lazily installed, will just break packages.
Such that you can't go and use them in a new enviroment.
Right? because when an orphaned (or otherwise) package is connected to a new enviroments, it doesn't do ] build does it?

And removing orphaned packages without removing there artifacts seems pointless.
Like if you are freeing up disk space, by deleting the package, why would you not want to delete the far larger artifact?

oxinabox on 13 Jun 2019

Cleaning artifacts, that are not lazily installed, will just break packages.

If a package still uses that artifact, it will not be GC'ed.

Right? because when an orphaned (or otherwise) package is connected to a new enviroments, it doesn't do ] build does it?

This is all independent of ] build; installation of non-lazy artifacts happens at ] add time.

Like if you are freeing up disk space, by deleting the package, why would you not want to delete the far larger artifact?

I also do not see a situation where you would want to delete one and not the other.

staticfloat on 13 Jun 2019

Cleaning artifacts, that are not lazily installed, will just break packages.

What? No, you only clean an artifact once no installed packages refer to it. So you clean packages that no longer appear in any known manifest and then you clean artifacts that are no longer used by any packages. You could skip the actual deletion step for packages, but maybe that's not a good idea since then those package versions will not work since they'll have missing artifacts. So gc --packages would only clean out packages and skip the following artifact cleaning step. The gc --artifacts call would skip the package cleaning phase and just delete any artifacts that are no longer referenced by and installed packages (regardless of whether those packages would be cleaned out).

Also, if an artifact is missing, whether it's lazy or not, the artifact"dataset-A" call should probably go get the artifact if it is absent for whatever reason. Or maybe just print a message that someone should do ] instantiate to go get any missing artifacts. Yeah, that's probably better.

StefanKarpinski on 13 Jun 2019

You could skip the actual deletion step for packages, but maybe that's not a good idea since then those package versions will not work since they'll have missing artifacts.

Right, that is what I am saying will break.

I should have said _will result in those package versions being broken if they are add'd_
but we could build around that, by doing some checks when you add somethings that it is still constistant.
But then we might have the problem that for unpacked things we can't check the SHA256 any more.

oxinabox on 13 Jun 2019

I should have said will result in those package versions being broken if they are add'd

That's what I'm saying is wrong; you're saying "if I delete Foo that Bar depends on, then try to add Bar, it will fail because Foo is missing". That's not how Pkg works; when you want to install Bar, it will automatically install Foo because it knows that Foo is a dependency of Bar. That's how artifacts will work as well; all the installation happens at ] add time, not ] build time (we're explicitly moving away from being able to have mutable state; this means that everything needs to be installed by the time you finish the Pkg.add() operation).

staticfloat on 13 Jun 2019

👍1

Right, that is what I am saying will break.

Yeah, there's no good reason to support that.

I'm also having trouble coming up with realistic scenarios where you need to clean out packages but not artifacts or vice versa. But the operation proceeds in two fairly separate phases:

Figure out which packages are no longer reference by any manifests and delete them.
Figure out which artifacts are no longer reference by any installed packages and delete them.

You can do one or the other independently and not break things, or one then the other which should be the default and cleans up the most space.

StefanKarpinski on 13 Jun 2019

One thing that I really like about this new approach that occurred to me is that by not having artifacts inside of packages, it allows artifacts to live in different ~repos~ depots than packages do. So you could have a pre-installed system copy of an artifact that is used by one or more user-installed copies of a package. That's quite cool, and potentially useful, imo.

StefanKarpinski on 13 Jun 2019

That's not how Pkg works; when you want to install Bar, it will automatically install Foo because it knows that Foo is a dependency of Bar. That's how artifacts will work as well;

Right, ok, I had the picture wrong in my head.
I thought artifacts would not resolve just like packages.
(since they don't have UUIDs or versions)
but I guess there is indeed nothing stopping that.

One thing that I really like about this new approach that occurred to me is that by not having artifacts inside of packages, it allows artifacts to live in different repos than packages do. So you could have a pre-installed system copy of an artifact that is used by one or more user-installed copies of a package. That's quite cool, and potentially useful, imo.

That is nice. The DataDeps way of doing the same is a bit scary and unsafe. and kind encourages being unsafe (will probably have to change it eventually. I am now super sold on this whole naming things using their SHA); DataDeps just uses the name.
But because the artifacts are identified by SHA (On futher thought, I assume the even after unpacking the SHA is available? Because it will be used as a folder name?)

Ok cool things are much clearer now.

oxinabox on 13 Jun 2019

(since they don't have UUIDs or versions)

They don't need UUIDs or versions because they're content-addressed. You don't really care if one libfoo is "the same artifact" as a different libfoo—they're either the same data or they aren't.

it allows artifacts to live in different repos than packages do.

Oops, I meant "depots" not "repos".

But because the artifacts are identified by SHA (On futher thought, I assume the even after unpacking the SHA is available? Because it will be used as a folder name?)

This comment was about keeping metadata about artifacts around after they're installed so that you know what the SHA etc. was. I'm not really sure about how to structure the thing that goes at ~/.julia/artifacts/libfoo/$slug: you want the actual artifact content somewhere but you also want a bit of metadata about it. This is complicated by the possibility that it is sometimes just a single file and sometimes a folder that we've extracted from an archive. @oxinabox, @staticfloat, do you guys have any thoughts about the structure of these? What would the layout be?

StefanKarpinski on 13 Jun 2019

I'm removing the "speculative" label because this is getting pretty concrete at this point. Some updates from Slack discussion:

We should identify artifacts by their on-disk content, not the archive hash. After all, the former is the definitive thing that offers no wiggle room, whereas many different archives can produce the same on-disk archive. That means we should have a git-tree-sha1 field in each artifact stanza, much like we do in package manifest stanzas. We may want to think about how artifact stanzas mirror manifest stanzas in other ways as well.
As a corollary of the above, you can potentially have different ways of acquiring the same exact artifact—different download URLs, different archive hashes. I previously thought that we should keep metadata about artifacts somewhere with the artifact, but with this design change I'm not so sure. After all, the one true defining characteristic of an artifact is its tree hash and you can always recompute that from it on disk—and if the slug from that hash doesn't match, then you have a corrupted artifact that you shouldn't use anyway.
Maybe we want to keep a log of artifact downloads somewhere like ~/.julia/logs/artifact_usage.toml: a record of what package triggered the install of an artifact, whether it was already installed or not, where it would have been downloaded from, etc.
We still want to record a SHA256 hash of the downloaded, pre-extraction state of each artifact so that we can verify it before downloading it, but this is no longer how we identify it.
I'm still not fully clear on how we should do artifact variant selection. @staticfloat's platform string approach or my more verbose dict approach. This is one of the last things to be decided.

StefanKarpinski on 14 Jun 2019

I'm still not fully clear on how we should do artifact variant selection. @staticfloat's platform string approach or my more verbose dict approach. This is one of the last things to be decided.

The advantage of the dict approach is that it is more extensible should additional keys be required in future.

simonbyrne on 14 Jun 2019

I’m very willing to use a dict based approach. There’s no inherent advantage to the string format other than compactness (and ability to fit within a filename) but living within the Artifact.toml, if we have access to richer data structures we should just use them.

staticfloat on 14 Jun 2019

Great work on the design. I want to bring up a point about build variants, that I was thinking about. Curious about your thoughts.

If I understand correctly, the LibFoo_jll binary variant that is selected is based on its version and on system properties only. Is there any other way for the user to pick a different build, that is not full manual dev mode? Or should they create a separate LibFoo_with_x_enabled_jll and fork LibFoo.jl, and change the Artifact.toml to use LibFoo_with_x_enabled_jll instead? A concrete example is for instance SQLite with the R*tree module enabled, which perhaps does not make sense as a default, but could be requested specifically in Artifact.toml of a project or package. Although you'd probably still want to use it through the same julia wrapper package (SQLite.jl), which would need to know that you want to use a different variant of the binary. Similarly, we could make a default GDAL install small with only the most commonly needed formats, but allow a user to explicitly request a large full variant instead (issue ref). Right now I don't see a way to do that rather than deving everything and putting all artifacts in manually. Not sure how big of a can of worms this is though.

visr on 14 Jun 2019

So latest sketch of the way Artifacts.toml will look:

[dataset-A]
git-tree-sha1 = "e445efb1f3e2bffc06e349651f13729e6f7aeaaf"
basename = "dataset-A.csv"

  [dataset-A.download]
  sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
  url = [ # multiple URLs to try
      "https://server1.com/path/to/dataset.csv",
      "https://server2.com/path/to/dataset.csv",
  ]

[nlp-model-1]
git-tree-sha1 = "dccae443aeddea507583c348d8f082d5ed5c5e55"
basename = "nlp-model-1.onnx"

  [[nlp-model-1.download]] # multiple ways to download
  sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
  url = "https://server1.com/nlp-model-1.onnx.gz"
  extract = "gzip" # decompress file

  [[nlp-model-1.download]]
  sha256 = "9f45411f32dcc332331ff244504ca12ee0b402e00795ab719612a46b7fb24216"
  url = "https://server2.com/nlp-model-1.onnx"

[[libfoo]]
git-tree-sha1 = "05d42b0044984825ae286ebb9e1fc38ed2cce80a"
os = "Linux"
arch = "armv7l"

  [libfoo.download]
  sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
  url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
  extract = ["gzip", "tar"] # outermost first or last?

[[libfoo]]
git-tree-sha1 = "c2dc12a509eec2236e806569120e72058579ba19"
os = "Windows"
arch = "i686"

  [libfoo.download]
  sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
  url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.zip"
  extract = "zip"

[[libfoo]]
git-tree-sha1 = "d633f5f44b06d810d75651a347cae945c3b7f23d"
os = "macOS"
arch = "x86_64"

  [libfoo.download]
  sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
  url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.xz"
  extract = ["xz", "tar"]

Some features of this sketch:

the git-tree-sha1 is the defining key of each artifact variant—it must be present
- this is the tree hash of the final extracted artifact as it appears on disk
- if there is already an artifact in the corresponding location, there is no need to reinstall the artifact
- multiple different packages can use the same artifact and describe different ways to get it—the only thing that matters is the bits on disk, which is what this is a hash of; as long as those are the same, just use it, it doesn't matter how it got there
an optional basename key can be given for an artifact variant
- if it is absent, the extracted artifact tree will be installed at ~/.julia/artifacts/$name/$slug
- if it is present, the extracted artifact tree will be installed at ~/.julia/artifacts/$name/$slug/$basename
- this is intended to handle situations where the name of the artifact file is significant to some consumer, e.g. a reader that expects a CSV file to have the .csv extension
- this could also be handled by putting the basename part inside of artifact, but there may be cases where we want to download artifacts as-is and therefore cannot control their structure
top-level keys in artifact stanzas with multiple variants are variant selectors, os, arch, etc.
each artifact variant has one or more download stanzas which describe a way to get it
- there can be one or more url values in a download stanza—this is just a shorthand for giving multiple identical download stanzas that only differ by URL since that will be a common case
- download stanzas have a sha256 entry, which gives the SHA256 hash of the downloaded file; this may be different for different download methods for the same artifact since it may be archived or compressed differently; this hash allows checking download correctness before extracting.
- download stanzas may have an extract entry which indicates how to extract the actual artifact tree from the download; it can be a string to indicate a single extraction step or an array of string to indicate a sequence of extraction steps; these can only be selected from a set of known extraction steps, e.g. tar, gz, bz2, xz, zip; by default, no extraction is performed

StefanKarpinski on 14 Jun 2019

I'm not so sure if the basename bit is necessary or a good idea. Maybe it isn't—it does mean that all users of an artifact must not only agree on the git-tree-sha1 but _also_ the basename, which gives me pause. Maybe this should be a feature of the download instead, e.g. prefix = "dataset-A.csv"?

StefanKarpinski on 14 Jun 2019

👍1

Part of the download seems right.
If it was a tarball containing a CSV with that name,
that was untarballed
then that should be the same as a CSV that was downloaded
and then lost its name (because Base.download does not know how to negotiate names or the webserver was bad)
and then had its name put back in by postprocessing.
(probably not prefix though, maybelocalfilename?)
It should be mutually exclusing with extract.
So it would be nice to express both extact and the setting of the name as values for a single option

Edit: Oh but we migth want to allowed .csv.gz and have that be extracted to a .csv.
Still putting this into the realm of postfetch feels right.

oxinabox on 14 Jun 2019

Maybe call it basename but put it in the download section and have it mean that the download will be extracted to ~/.julia/artifacts/$name/$slug/$basename. The thing that's git tree hashed is the entire tree at ~/.julia/artifacts/$name/$slug, which in that situation would be $basename and whatever it contains. Updated sketch with this scheme:

[dataset-A]
git-tree-sha1 = "e445efb1f3e2bffc06e349651f13729e6f7aeaaf"

  [dataset-A.download]
  basename = "dataset-A.csv"
  sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
  url = [ # multiple URLs to try
      "https://server1.com/path/to/dataset.csv",
      "https://server2.com/path/to/dataset.csv",
  ]

[nlp-model-1]
git-tree-sha1 = "dccae443aeddea507583c348d8f082d5ed5c5e55"

  [[nlp-model-1.download]] # multiple ways to download
  basename = "nlp-model-1.onnx"
  sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
  url = "https://server1.com/nlp-model-1.onnx.gz"
  extract = "gzip" # decompress file

  [[nlp-model-1.download]]
  basename = "nlp-model-1.onnx"
  sha256 = "9f45411f32dcc332331ff244504ca12ee0b402e00795ab719612a46b7fb24216"
  url = "https://server2.com/nlp-model-1.onnx"

[[libfoo]]
git-tree-sha1 = "05d42b0044984825ae286ebb9e1fc38ed2cce80a"
os = "Linux"
arch = "armv7l"

  [libfoo.download]
  sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
  url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
  extract = ["gzip", "tar"] # outermost first or last?

[[libfoo]]
git-tree-sha1 = "c2dc12a509eec2236e806569120e72058579ba19"
os = "Windows"
arch = "i686"

  [libfoo.download]
  sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
  url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.zip"
  extract = "zip"

[[libfoo]]
git-tree-sha1 = "d633f5f44b06d810d75651a347cae945c3b7f23d"
os = "macOS"
arch = "x86_64"

  [libfoo.download]
  sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
  url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.xz"
  extract = ["xz", "tar"]

StefanKarpinski on 14 Jun 2019

I think we need more thought.

What is so base about basename about it?

It only should matter for things that are not tarballs or zips.
I am kinda think it shouldn't ever exist for other cases?

Or at least I am not sure what it will do in those cases.

Understanding more how in interacts with

extract = ["tar", "gz"]
Vs
For
extract = ["gz"] on a csv
Vs
extract = [] on a csv

Are we thinking that tarballs extract to become 1 folder and we the rename that folder?
Or are we thinking that tarballs become a collection of files?
I was thinking the latter, but now I think I am wrong?

oxinabox on 14 Jun 2019

basename is just the traditional Unix name for the last part of a path. A better scheme for this would be good.

StefanKarpinski on 15 Jun 2019

Idea: basename could be an extraction step, but I'm not sure how to express this. Rough attempt:

[dataset-A.download]
sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = [ # multiple URLs to try
   "https://server1.com/path/to/dataset.csv",
   "https://server2.com/path/to/dataset.csv",
]
extract = { rename: "dataset.csv" }

That's not quite right though since I don't think you can put a dict in an array.

StefanKarpinski on 18 Jun 2019

That is what I was saying.

oxinabox on 19 Jun 2019

👍1

Only took me four days for the same thing to occur to me 😁

StefanKarpinski on 19 Jun 2019

[dataset-A.download]
sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = [ # multiple URLs to try
   "https://server1.com/path/to/dataset.csv.gz",
   "https://server2.com/path/to/dataset.csv.gz",
]
postfetch.extract = ["gz"]
postfetch.rename ="dataset.csv"

With the stern rule rename always occurs after extract, and that omitting either results in identity/noop.

oxinabox on 19 Jun 2019

👍1

Having thought about this for a bit, I am uncomfortable with the coupling between rename and git-tree-sha1 (if you change rename you're going to need to change git-tree-sha1). I'm also uncomfortable with how rename doesn't make sense when dealing with a .tar.gz, since if you're going to extract a file, you kind of don't care what the .tar.gz file's filename was, and renaming something after extracting doesn't make sense in that case.

I think I would rather have extraction only be an option in the well-defined case; where we have a container (like a .tar.gz) and that file structure is stored within it; this would make extract and basename mutually exclusive; either you're extracting things, or you're downloading a single file. basename will still interact with git-tree-sha1, but I'm willing to forgive that.

For more complex usecases, I think I would rather push this off onto a more advanced Pkg concept, which I have helpfully written up a big "thing" about over here: https://github.com/JuliaLang/Pkg.jl/issues/1234 (whooo I got a staircase issue number! Lucky day!). Even if that's not something we want in Pkg, I still think restricting the flexibility here is going to help us keep a sane, simple design.

staticfloat on 20 Jun 2019

👍1

Making extract and basename mutually exclusive extraction options seems sane to me. Maybe in that case calling the option filename would be more obvious than basename which felt more applicable to both files and directories, but of course in the case of a directory, there’s no need for an option to control the name.

When it comes to extraction, we should be very strict about how extraction is allowed: it should only ever produce files under the target location. I know some archive formats allow other destinations, which we should make sure to prevent.

StefanKarpinski on 20 Jun 2019

Yeah, I like filename better as well.

When it comes to extraction, we should be very strict about how extraction is allowed: it should only ever produce files under the target location. I know some archive formats allow other destinations, which we should make sure to prevent.

I want to make sure that extraction can work everywhere, right now with .tar.gz we have pretty good support (since we bundle 7zip with Julia), if we allow people to download non-BB generated things, we may want to widen that to .zip and .tar.bz2 as well (which would also be pretty well supported). Beyond that, there is some desire for .tar.xz just because it compresses pretty well, but the long tail of distro support doesn't have our backs on that one quite yet. We could conceivably ship binaries of tar and xz for all platforms, add it as a lazy Artifact to Pkg itself (stored in a .tar.gz of course, haha) and then we'd be able to do it..... but for now, I argue let's just stick with a small subset of things we already know works.

staticfloat on 20 Jun 2019

👍1

Pkg.jl: Pkg + BinaryProvider

Most helpful comment

All 80 comments

The Artifacts File

Usage by BinaryBuilder & BinaryProvider

Related issues