Pkg.jl: Package files are sometimes truncated

Created on 4 Oct 2018  路  23Comments  路  Source: JuliaLang/Pkg.jl

Often we (Invenia) have run into an issue where, when installing packages on 0.7 and 1.0, random files in a particular package will be truncated in random places. Most often we've seen this with AWSSDK, which has a lot of very large files, though the issue is not limited to that package. We noticed this because we've gotten an error on using AWSSDK that reports that, for example, a docstring or module is unterminated.

Some data points:

  • When we set the default concurrency in Context to 1 instead of 8, the problem disappears.

  • We know that the files are intact after the tarballs are extracted.

  • When we replace the mv from the extracted tarball location to the package tree location with Julia's cp, the problem persists, but if we replace it with a run(`cp ...`), the problem goes away.

  • We have only observed this on Debian Stretch runners on our internal GitLab CI, on both 32- and 64-bit systems, but more often on 32-bit. We have _not_ seen it on our Mac CI or on Amazon Linux.

  • We've seen one instance of it happening locally to @rofinn on Elementary OS (based on Ubuntu which is in turn based on Debian) when installing GitHub.jl.

  • The @asyncs in apply_versions are never @synced, and the Channels are never closed.

  • We use multiple registries: the public General registry and a private one for internal packages.

bug upstream

Most helpful comment

So our current working theory is that it seems like the libuv async stuff is interfering with the libuv copy operation on slower filesystems. So if the copy is doing it in chunks of a certain size (seems like 64K), it would make sense that a task switch might mean it doesn't transfer the remainder of the chunks.

That would explain:

  • The stochastic nature of it (the task switch needs to happen during a copy)
  • It happens to us on Debian (our runners happen to have slower disks)
  • When we reduce the concurrency to 1, the problem goes away (no task switch during copy)
  • Using the system's cp doesn't have this problem (libuv can't interrupt it on task switch)
  • Only files larger than 64K are truncated (as far as we've seen)
  • Files are truncated to multiples of 64K

All 23 comments

How large are the files? Any chance that https://github.com/JuliaLang/julia/issues/14574 is related?

The smallest file we've seen truncated was supposed to be 108,741 bytes, and was truncated at 65,536 bytes. Likely unrelated to that issue given that it's well under 2 GB.

When we replace the mv from the extracted tarball location to the package tree location with Julia's cp, the problem persists, but if we replace it with a run(cp ...), the problem goes away.

Seems like the most damning to me.

Since it only happens with multiple tasks, perhaps a task switch inside julias cp or mv corrupts the file somehow (doesn't finish the transfer)?

Potential repro is to start a bunch of tasks that does cp and see if they all arrive as they should.

We tried that, this was our attempt to reproduce:

using Random

function demo(dest; num=10, size=100_000)
    files = 1:num
    num_concurrent_downloads = 8
    channel = Channel(num_concurrent_downloads)
    results = Channel(num_concurrent_downloads)
    @async for i in files
        file = "$(i).txt"
        put!(channel, file)
    end

    isdir(dest) || mkdir(dest)

    for i in 1:num_concurrent_downloads
        @async begin
            for file in channel
                # Write file to pwd
                open(file, "w+") do io
                    for i in 1:size
                        write(io, randstring(1))
                    end
                end
                output = joinpath(dest, file)
                cp(file, output; force=true)
                put!(results, output)
            end
            println("Game over")
        end

    end

    for i in eachindex(files)
        path = take!(results)
        file_size = lstat(path).size
        # if file_size != size
        println("$path: $file_size")
        # end
    end
end

and... does it reproduce?

No, we can't get it to reproduce the issue.

I can reproduce the issue on Debian Stretch using:

# Setup Julia
if [ "$(uname -m)" = "x86_64" ]; then
    curl -L https://julialang-s3.julialang.org/bin/linux/x64/1.0/julia-1.0.1-linux-x86_64.tar.gz > julia.tar.gz
else
    curl -L https://julialang-s3.julialang.org/bin/linux/x86/1.0/julia-1.0.1-linux-i686.tar.gz > julia.tar.gz
fi
tar xvf julia.tar.gz &> /dev/null
export PATH="$(pwd)/julia-1.0.1/bin:${PATH}"

# Install packages into a fresh depot
export JULIA_DEPOT_PATH="$(pwd)/depot"
rm -rf "$JULIA_DEPOT_PATH"
julia -e 'using Pkg; Pkg.add(["AWSSDK", "FilePaths"]); using AWSSDK'

It doesn't seem to be reproducible on Docker.

Didn't repro on my Mac.

I'll note that we only seem to be able to reproduce this when writing across filesystem boundaries. So Pkg downloads and extracts on /tmp then moves the files to where our CI builds happen in /mnt, which is on a separate volume. When we have it move on the same volume it seems to work.

Seems like an upstream (Julia) bug ;)

We'll try to work around this by using TMPDIR to write to the same volume as our CI build.

We haven't tried this yet, but it might be possible to reproduce this by using creating separate temporary volumes in a 32-bit Debian VM and setting TMPDIR to point to one of them.

Seems like an upstream (Julia) bug ;)

Yeah perhaps. In particular it sounds like libuv.

https://github.com/JuliaLang/julia/issues/26907 was also when crossing file system borders.

When crossing filesystem boundaries, there's some logic to switch from e.g. rename() to cp() since you can't rename a file across a filesystem boundary. Seems like a natural place to look for a bad code path; but from what I understand most of that code is within libuv itself, and I have no idea why you'd only get the first 64K of a file.

Do the truncated lengths follow a particular pattern? E.g. does it look like you always get multiples of 16K or something like that?

Do the truncated lengths follow a particular pattern? E.g. does it look like you always get multiples of 16K or something like that?

That's very interesting. The most files we've seen cut off in a single run is 5, and when we recorded the truncated sizes, there were all multiples of 64K (and thus also 16K):

| File | Size |
| :--- | :---: |
| AWSSDK/dAby7/src/EC2.jl | 65536 |
| AWSSDK/dAby7/src/GameLift.jl | 65536 |
| AWSSDK/dAby7/src/Greengrass.jl | 65536 |
| AWSSDK/dAby7/src/IAM.jl | 196608 |
| AWSSDK/dAby7/src/OpsWorks.jl | 131072 |

@vtjnash, sorry for the ping but perhaps you have an idea of what is going on here.

So our current working theory is that it seems like the libuv async stuff is interfering with the libuv copy operation on slower filesystems. So if the copy is doing it in chunks of a certain size (seems like 64K), it would make sense that a task switch might mean it doesn't transfer the remainder of the chunks.

That would explain:

  • The stochastic nature of it (the task switch needs to happen during a copy)
  • It happens to us on Debian (our runners happen to have slower disks)
  • When we reduce the concurrency to 1, the problem goes away (no task switch during copy)
  • Using the system's cp doesn't have this problem (libuv can't interrupt it on task switch)
  • Only files larger than 64K are truncated (as far as we've seen)
  • Files are truncated to multiples of 64K

That seems like a very likely analysis. So now the question is why the libuv copy work isn't getting continued. @vtjnash and @keno are the resident libuv experts. Any ideas, guys?

I've reported the issue upstream: https://github.com/JuliaLang/julia/issues/29944. Feel free to close this issue here.

I suppose since we don't need to do anything here; once this is fixed we'll get the fix automatically.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jlperla picture jlperla  路  3Comments

cossio picture cossio  路  3Comments

cossio picture cossio  路  3Comments

moustachio-belvedere picture moustachio-belvedere  路  3Comments

innerlee picture innerlee  路  4Comments