I can simulate a hang with the following broken pkg server
#!/usr/bin/env julia
using HTTP, Dates
const www_root = joinpath(@__DIR__, "storage")
mkpath(www_root)
HTTP.listen("0.0.0.0", 8123) do http
r = http.message.target
filepath = joinpath(www_root, lstrip(r, '/'))
upstream = false # set to true to generate the local storage
if !isfile(filepath) && upstream
# @info "Missing from cache, fetching upstream" r
tmp_filepath = filepath * ".tmp"
mkpath(dirname(tmp_filepath))
try
open(tmp_filepath, "w") do io
HTTP.get("https://eu-central.pkg.julialang.org$(r)"; response_stream=io)
end
mv(tmp_filepath, filepath)
finally
rm(tmp_filepath; force=true)
end
end
if isfile(filepath)
# @info "Serving file" filepath
HTTP.setstatus(http, 200)
HTTP.startwrite(http)
bytes = read(filepath)
if startswith(r, "/package") && rand() < 0.01
@warn "[$(now())] Sleeping..."
sleep(100)
@warn "[$(now())] Sleeping done..."
return
end
HTTP.write(http, bytes)
return nothing
end
@error "[$(now())] Did not find resource" r
HTTP.setstatus(http, 404)
HTTP.startwrite(http)
return nothing
end
Note this part:
if startswith(r, "/package") && rand() < 0.01
@warn "[$(now())] Sleeping..."
sleep(100)
@warn "[$(now())] Sleeping done..."
return
end
If you try to instantiate an env with this PkgServer running either of two things happen (if you happen to hit the sleep path for that instantiate):
^C┌ Error: curl_multi_remove_handle: 1
â”” @ Downloads.Curl ~/julia16/usr/share/julia/stdlib/v1.6/Downloads/src/Curl/utils.jl:36
Might be duplicate/related to https://github.com/JuliaLang/Pkg.jl/issues/2287 but since I have not seen a common stracktrace I think it might be different...
I used this script to repeatedly instantiate
#!/bin/bash
export JULIA_DEPOT_PATH=${PWD}
export JULIA_LOAD_PATH=${PWD}/Project.toml:@stdlib
export JULIA_PRECOMPILE_AUTO=0
export JULIA_PKG_SERVER="http://localhost:8123"
unset JULIA_PROJECT
export CI=true # avoid pretty-print
rm -rf ${JULIA_DEPOT_PATH}/packages
rm -rf ${JULIA_DEPOT_PATH}/artifacts
rm -rf ${JULIA_DEPOT_PATH}/compiled
julia -e 'using Pkg; Pkg.instantiate()'
So the issues are:
Is that correct?
Something like that
Might need to set a timeout on downloads.
That too. But it shouldn't hang just because the server returns nothing, right?
If the client thinks there's more coming, it might.
But why does it think that sometimes and sometimes not? (option 1 and 2 in the top comment)
To clarify: In option 1 the Pkg client bails the moment the server returns, but in option 2 it keeps waiting.
Just a thought. Does this only happen when there's a progress meter?
Could it be that the write here is silently erroring, and the progress task stays open?
That's possible but note that the @async is wrapped in a @sync and it's an Experimental.@sync which means that I believe it should be capturing the error and propagating it to the caller; perhaps @vtjnash can confirm that's the behavior of Experimental.@sync.
Ah, I missed the import. The experimental version does seem to throw & return on error, shown in the test below.
Given no error message is being thrown during the hang, it seems fair to conclude that it's not an error in the write task that's the problem.
However, in this test the other async task does seem to stay active on error, which may be unintended, even if not the issue at hand
julia> Base.Experimental.@sync begin
@async begin
sleep(4)
error()
end
@async while true
sleep(1)
println("here")
end
end
here
here
here
ERROR: hereTaskFailedException
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:705
[2] wait
@ ./task.jl:764 [inlined]
[3] wait(c::Base.GenericCondition{ReentrantLock})
@ Base ./condition.jl:106
[4] take_buffered(c::Channel{Any})
@ Base ./channels.jl:389
[5] take!
@ ./channels.jl:383 [inlined]
[6] sync_end(c::Channel{Any})
@ Base.Experimental ./experimental.jl:63
[7] top-level scope
@ experimental.jl:101
nested task error:
Stacktrace:
[1] error()
@ Base ./error.jl:42
[2] macro expansion
@ ./REPL[1]:4 [inlined]
[3] (::var"#1#3")()
@ Main ./task.jl:406
julia> here
here
here
@StefanKarpinski can you configure Downloads to use a reasonable combination of https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html and https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_LIMIT.html ? Or can it be done here from Pkg?
We could add some reasonable defaults for those. What do you think would be good defaults? If can be done from here. The way you'd want to do it is create a global Downloader object to be used for all Pkg downloads and then set the .easy_hook field of that downloader object to a function like this (with using LibCURL so that this works):
DOWNLOADER.easy_hook = (easy, _) -> begin
@assert 0 == curl_easy_setop(easy.handle, CURL_LOW_SPEED_TIME, 123)
@assert 0 == curl_easy_setop(easy.handle, CURL_LOW_SPEED_LIMIT, 234)
end
That would be a way to experiment with this. You can also just set the default downloader's easy hook if you're just experimenting, but multiple parties can't do that, so it's best if Pkg doesn't do that for the default one.
Setting these speed limit options doesn't do anything helpful as far as I can tell. Here's the client code I used:
using Downloads, LibCURL
ENV["JULIA_PKG_SERVER"] = "localhost:8123"
Downloads.EASY_HOOK[] = (easy, _) -> begin
@assert 0 == curl_easy_setopt(easy.handle, CURLOPT_LOW_SPEED_TIME, 20)
@assert 0 == curl_easy_setopt(easy.handle, CURLOPT_LOW_SPEED_LIMIT, 1)
end
push!(empty!(DEPOT_PATH), mktempdir())
] activate --temp
] add JSON
Note that setting the speed limit to 0 actually turns it off. By comparison, if I set a hard timeout on the connection like this:
Downloads.EASY_HOOK[] = (easy, _) -> begin
@assert 0 == curl_easy_setopt(easy.handle, CURLOPT_TIMEOUT, 20)
end
then it does timeout after 20 seconds and move on to the fallback.
Ok, I was doing a few things wrong here. Setting ENV["JULIA_PKG_SERVER"] = "localhost:8123" connects via HTTPS which fails but the error gets swallowed by Pkg's overuse of try/catch (the worst, we really need to fix this), so it was manifesting as just hanging but was actually not stuck in the libcurl call. The server code was also wrong because is has upstream = false by default so it just refuses to get anything. If I fix those things then the low speed options actually have an effect and the install makes decent progress even with a _very_ flaky server that hangs on one in every ten downloads. I can make a patch to Downloads to set these options by default.
Probably fixed.