Often we (Invenia) have run into an issue where, when installing packages on 0.7 and 1.0, random files in a particular package will be truncated in random places. Most often we've seen this with AWSSDK, which has a lot of very large files, though the issue is not limited to that package. We noticed this because we've gotten an error on using AWSSDK that reports that, for example, a docstring or module is unterminated.
Some data points:
When we set the default concurrency in Context to 1 instead of 8, the problem disappears.
We know that the files are intact after the tarballs are extracted.
When we replace the mv from the extracted tarball location to the package tree location with Julia's cp, the problem persists, but if we replace it with a run(`cp ...`), the problem goes away.
We have only observed this on Debian Stretch runners on our internal GitLab CI, on both 32- and 64-bit systems, but more often on 32-bit. We have _not_ seen it on our Mac CI or on Amazon Linux.
We've seen one instance of it happening locally to @rofinn on Elementary OS (based on Ubuntu which is in turn based on Debian) when installing GitHub.jl.
The @asyncs in apply_versions are never @synced, and the Channels are never closed.
We use multiple registries: the public General registry and a private one for internal packages.
How large are the files? Any chance that https://github.com/JuliaLang/julia/issues/14574 is related?
The smallest file we've seen truncated was supposed to be 108,741 bytes, and was truncated at 65,536 bytes. Likely unrelated to that issue given that it's well under 2 GB.
When we replace the mv from the extracted tarball location to the package tree location with Julia's cp, the problem persists, but if we replace it with a run(
cp ...), the problem goes away.
Seems like the most damning to me.
Since it only happens with multiple tasks, perhaps a task switch inside julias cp or mv corrupts the file somehow (doesn't finish the transfer)?
Potential repro is to start a bunch of tasks that does cp and see if they all arrive as they should.
We tried that, this was our attempt to reproduce:
using Random
function demo(dest; num=10, size=100_000)
files = 1:num
num_concurrent_downloads = 8
channel = Channel(num_concurrent_downloads)
results = Channel(num_concurrent_downloads)
@async for i in files
file = "$(i).txt"
put!(channel, file)
end
isdir(dest) || mkdir(dest)
for i in 1:num_concurrent_downloads
@async begin
for file in channel
# Write file to pwd
open(file, "w+") do io
for i in 1:size
write(io, randstring(1))
end
end
output = joinpath(dest, file)
cp(file, output; force=true)
put!(results, output)
end
println("Game over")
end
end
for i in eachindex(files)
path = take!(results)
file_size = lstat(path).size
# if file_size != size
println("$path: $file_size")
# end
end
end
and... does it reproduce?
No, we can't get it to reproduce the issue.
I can reproduce the issue on Debian Stretch using:
# Setup Julia
if [ "$(uname -m)" = "x86_64" ]; then
curl -L https://julialang-s3.julialang.org/bin/linux/x64/1.0/julia-1.0.1-linux-x86_64.tar.gz > julia.tar.gz
else
curl -L https://julialang-s3.julialang.org/bin/linux/x86/1.0/julia-1.0.1-linux-i686.tar.gz > julia.tar.gz
fi
tar xvf julia.tar.gz &> /dev/null
export PATH="$(pwd)/julia-1.0.1/bin:${PATH}"
# Install packages into a fresh depot
export JULIA_DEPOT_PATH="$(pwd)/depot"
rm -rf "$JULIA_DEPOT_PATH"
julia -e 'using Pkg; Pkg.add(["AWSSDK", "FilePaths"]); using AWSSDK'
It doesn't seem to be reproducible on Docker.
Didn't repro on my Mac.
I'll note that we only seem to be able to reproduce this when writing across filesystem boundaries. So Pkg downloads and extracts on /tmp then moves the files to where our CI builds happen in /mnt, which is on a separate volume. When we have it move on the same volume it seems to work.
Seems like an upstream (Julia) bug ;)
We'll try to work around this by using TMPDIR to write to the same volume as our CI build.
We haven't tried this yet, but it might be possible to reproduce this by using creating separate temporary volumes in a 32-bit Debian VM and setting TMPDIR to point to one of them.
Seems like an upstream (Julia) bug ;)
Yeah perhaps. In particular it sounds like libuv.
https://github.com/JuliaLang/julia/issues/26907 was also when crossing file system borders.
When crossing filesystem boundaries, there's some logic to switch from e.g. rename() to cp() since you can't rename a file across a filesystem boundary. Seems like a natural place to look for a bad code path; but from what I understand most of that code is within libuv itself, and I have no idea why you'd only get the first 64K of a file.
Do the truncated lengths follow a particular pattern? E.g. does it look like you always get multiples of 16K or something like that?
Do the truncated lengths follow a particular pattern? E.g. does it look like you always get multiples of 16K or something like that?
That's very interesting. The most files we've seen cut off in a single run is 5, and when we recorded the truncated sizes, there were all multiples of 64K (and thus also 16K):
| File | Size |
| :--- | :---: |
| AWSSDK/dAby7/src/EC2.jl | 65536 |
| AWSSDK/dAby7/src/GameLift.jl | 65536 |
| AWSSDK/dAby7/src/Greengrass.jl | 65536 |
| AWSSDK/dAby7/src/IAM.jl | 196608 |
| AWSSDK/dAby7/src/OpsWorks.jl | 131072 |
@vtjnash, sorry for the ping but perhaps you have an idea of what is going on here.
So our current working theory is that it seems like the libuv async stuff is interfering with the libuv copy operation on slower filesystems. So if the copy is doing it in chunks of a certain size (seems like 64K), it would make sense that a task switch might mean it doesn't transfer the remainder of the chunks.
That would explain:
cp doesn't have this problem (libuv can't interrupt it on task switch)That seems like a very likely analysis. So now the question is why the libuv copy work isn't getting continued. @vtjnash and @keno are the resident libuv experts. Any ideas, guys?
I've reported the issue upstream: https://github.com/JuliaLang/julia/issues/29944. Feel free to close this issue here.
I suppose since we don't need to do anything here; once this is fixed we'll get the fix automatically.
Most helpful comment
So our current working theory is that it seems like the libuv async stuff is interfering with the libuv copy operation on slower filesystems. So if the copy is doing it in chunks of a certain size (seems like 64K), it would make sense that a task switch might mean it doesn't transfer the remainder of the chunks.
That would explain:
cpdoesn't have this problem (libuv can't interrupt it on task switch)