Since the number of threads can't be changed without restarting julia, I find this behavior quite surprising:
julia> @code_llvm Threads.nthreads()
; @ threadingconstructs.jl:19 within `nthreads'
define i64 @julia_nthreads_20335() {
top:
; ┌ @ pointer.jl:105 within `unsafe_load' @ pointer.jl:105
%0 = load i32, i32* inttoptr (i64 140333758273848 to i32*), align 8
; â””
; ┌ @ boot.jl:707 within `Int64'
; │┌ @ boot.jl:626 within `toInt64'
%1 = sext i32 %0 to i64
; └└
ret i64 %1
}
In general, I think it would be useful to do something like
const NUM_THREADS = Int(unsafe_load(cglobal(:jl_n_threads, Cint)))
nthreads() = NUM_THREADS
so that branches / dispatches which depend on the number of threads can be eliminated.
Code doesn't have to be recompiled just because you restart julia.
I'm not sure I understand. Are you saying this causes problems with static compilation?
Making nthread a constant is going in the wrong direction so we should not have this. And yes this cause problems with compilation.
I also don't believe there are many valuable cases that benefits from a static dispatch on nthread.
using @spawn
or @threads
induces a lot of overhead, so often one wants to do things like
if nthreads() > 1
f(x)
else
threaded_f(x)
end
There are cases where even the presence of a @spawn
in a branch that doesn't get hit prevents optimizations of a function body, so in a tight kernel you often want to separate those out into separate function calls. Because nthreads()
gets recomputed each call, the above if
block won't get elided.
If we eventually do make machinery for changing the number of threads at runtime, it should behave like function redefinition and require you to hit the global scope for it to take effect.
Assuming you if
is not in a loop this won't matter. If it is then you should move it out.
The single thread overhead for @thread
pretty much all come from https://github.com/JuliaLang/julia/issues/15276 apart from the cases that is fixed in a PR aready. Even if it is defined in your way it won't help anything.
If we eventually do make machinery for changing the number of threads at runtime, it should behave like function redefinition and require you to hit the global scope for it to take effect.
Absolutely NOT.
We really do not need more reasons to incur large latencies recompiling everything.
It would be nice to have a way to dispatch single/multithreaded functions without incurring a runtime penalty. Is there a suggested way to do this? (How are other packages managing it?)
Instead of dispatching on the exact value of Threads.nthreads()
, would it be easier on the compiler to allow us to just dispatch on Threads.nthreads() == 1
versus Threads.nthreads() != 1
?
@DilumAluthge that would be sufficient, I think. Basically, I'd like a pain-free way to do
foo(a, b) = nthreads > 1 ? foo(a, b, Val{true}()) : foo(a, b, Val{false}())
foo(a, b, ::Val{true}) = # threaded version
foo(a, b, ::Val{false}) = # unthreaded version
would it be easier on the compiler to allow us to just dispatch on
Threads.nthreads() == 1
versusThreads.nthreads() != 1
No.
Is if nthreads() > 1
really so slow that it's a deal-breaker even for potentially-multithreaded code? I know @threads
currently has too much overhead, but it's never going to have less overhead than simply checking nthreads()
, so I don't see why that would be the limiting factor.
From a practical standpoint, and for your own planning purposes, I can tell you we are not going to implement this any time soon. If you really need to configure some code for threaded/non-threaded cases with zero overhead, I would use const threaded = true # or false
.
Sorry for prolonging the thread, but in thinking about this, would a SimpleTrait be a good way to handle things?
@traitdef IsThreaded
@traitimpl IsThreaded <- nthreads() > 1
In terms of a easy to write way to do things that you shouldn't do, sure. In terms of a long term, acceptable and well defined solution, no.
edit: and this is assume there's some compile time things involved. If there's none, than it might be well defined but it will strictly be slower than a branch.
what's the most correct way to handle this (to dispatch based on threading capability)?
Don't. Trying to use dispatch is exactly what's wrong about it. A branch is always faster.
Now if what you mean is a easier way to thread the information from top level down to feed a branch a few function calls down so that you don't need to rely on inlining for constprop to work, then you can use whatever singleton type to pass that information down like Val{true}
.
Is if nthreads() > 1 really so slow that it's a deal-breaker even for potentially-multithreaded code?
Not necessarily, but you're having to check every time since it's not constant (even though it effectively IS right now), right? Maybe it's a case of premature optimization, but every little bit helps in this code.
@yuyichao
Don't. Trying to use dispatch is exactly what's wrong about it. A branch is always faster.
So, something like this?
foo(a, b) = nthreads() > 1 ? foo_threaded(a, b) : foo_unthreaded(a, b)
Yes.
Yes, that approach is fine. Or if you need to remove all possible overhead, a global constant. Or as Yichao said you can specialize a whole call tree by passing Val(nthreads()>1)
and propagating it --- however in that case there is an initial slow dynamic call and compilation.
however in that case there is an initial slow dynamic call and compilation.
You could do if nthreads() > 0; foo(Val(true)) else foo(Val(false)) end
to convince the optimizer and that's a tranformation that at least in principle could be done automatically...
It would be interesting to see the case where a single (perfectly predicted) branch would have a measurable runtime performance vs calling a function that you might want to run threaded.
It just seems to me that some sort of global "capabilities list" that all packages can access consistently might make sense. There's ComputationalResources.jl, which might do the trick, but it doesn't (yet) cover all cases that might be relevant for our work.
I think trying to depend on nthreads()
creates less composable ecosystem. _Ideally_, it would be better to depend on the input "problem size" so that you won't consume extra threads when the input is too small. This way, other parts of the program can use more threads when they need to. Of course, I understand that it's hard when your function is not data parallel.
Most helpful comment
We really do not need more reasons to incur large latencies recompiling everything.