Julia: Threads.nthreads() not a compile-time constant.

Created on 27 Feb 2020 · 22Comments · Source: JuliaLang/julia

Since the number of threads can't be changed without restarting julia, I find this behavior quite surprising:

julia> @code_llvm Threads.nthreads()
;  @ threadingconstructs.jl:19 within `nthreads'
define i64 @julia_nthreads_20335() {
top:
; ┌ @ pointer.jl:105 within `unsafe_load' @ pointer.jl:105
   %0 = load i32, i32* inttoptr (i64 140333758273848 to i32*), align 8
; └
; ┌ @ boot.jl:707 within `Int64'
; │┌ @ boot.jl:626 within `toInt64'
    %1 = sext i32 %0 to i64
; └└
  ret i64 %1
}

In general, I think it would be useful to do something like

const NUM_THREADS = Int(unsafe_load(cglobal(:jl_n_threads, Cint)))
nthreads() = NUM_THREADS

so that branches / dispatches which depend on the number of threads can be eliminated.

Source

MasonProtter

👍1

Most helpful comment

We really do not need more reasons to incur large latencies recompiling everything.

JeffBezanson on 27 Feb 2020

👍2

All 22 comments

Code doesn't have to be recompiled just because you restart julia.

KristofferC on 27 Feb 2020

I'm not sure I understand. Are you saying this causes problems with static compilation?

MasonProtter on 27 Feb 2020

Making nthread a constant is going in the wrong direction so we should not have this. And yes this cause problems with compilation.

I also don't believe there are many valuable cases that benefits from a static dispatch on nthread.

yuyichao on 27 Feb 2020

using @spawn or @threads induces a lot of overhead, so often one wants to do things like

if nthreads() > 1
    f(x)
else
    threaded_f(x)
end

There are cases where even the presence of a @spawn in a branch that doesn't get hit prevents optimizations of a function body, so in a tight kernel you often want to separate those out into separate function calls. Because nthreads() gets recomputed each call, the above if block won't get elided.

If we eventually do make machinery for changing the number of threads at runtime, it should behave like function redefinition and require you to hit the global scope for it to take effect.

MasonProtter on 27 Feb 2020

👍1

Assuming you if is not in a loop this won't matter. If it is then you should move it out.
The single thread overhead for @thread pretty much all come from https://github.com/JuliaLang/julia/issues/15276 apart from the cases that is fixed in a PR aready. Even if it is defined in your way it won't help anything.

If we eventually do make machinery for changing the number of threads at runtime, it should behave like function redefinition and require you to hit the global scope for it to take effect.

Absolutely NOT.

yuyichao on 27 Feb 2020

😕1 😄1

We really do not need more reasons to incur large latencies recompiling everything.

JeffBezanson on 27 Feb 2020

👍2

It would be nice to have a way to dispatch single/multithreaded functions without incurring a runtime penalty. Is there a suggested way to do this? (How are other packages managing it?)

sbromberger on 28 Feb 2020

👍1

Instead of dispatching on the exact value of Threads.nthreads(), would it be easier on the compiler to allow us to just dispatch on Threads.nthreads() == 1 versus Threads.nthreads() != 1?

DilumAluthge on 28 Feb 2020

@DilumAluthge that would be sufficient, I think. Basically, I'd like a pain-free way to do

foo(a, b) = nthreads > 1 ? foo(a, b, Val{true}()) : foo(a, b, Val{false}())
foo(a, b, ::Val{true}) = # threaded version
foo(a, b, ::Val{false}) = # unthreaded version

sbromberger on 28 Feb 2020

👍1

would it be easier on the compiler to allow us to just dispatch on Threads.nthreads() == 1 versus Threads.nthreads() != 1

No.

Is if nthreads() > 1 really so slow that it's a deal-breaker even for potentially-multithreaded code? I know @threads currently has too much overhead, but it's never going to have less overhead than simply checking nthreads(), so I don't see why that would be the limiting factor.

From a practical standpoint, and for your own planning purposes, I can tell you we are not going to implement this any time soon. If you really need to configure some code for threaded/non-threaded cases with zero overhead, I would use const threaded = true # or false.

JeffBezanson on 28 Feb 2020

Sorry for prolonging the thread, but in thinking about this, would a SimpleTrait be a good way to handle things?

@traitdef IsThreaded
@traitimpl IsThreaded <- nthreads() > 1

sbromberger on 28 Feb 2020

In terms of a easy to write way to do things that you shouldn't do, sure. In terms of a long term, acceptable and well defined solution, no.

edit: and this is assume there's some compile time things involved. If there's none, than it might be well defined but it will strictly be slower than a branch.

yuyichao on 28 Feb 2020

what's the most correct way to handle this (to dispatch based on threading capability)?

sbromberger on 28 Feb 2020

Don't. Trying to use dispatch is exactly what's wrong about it. A branch is always faster.

Now if what you mean is a easier way to thread the information from top level down to feed a branch a few function calls down so that you don't need to rely on inlining for constprop to work, then you can use whatever singleton type to pass that information down like Val{true}.

yuyichao on 28 Feb 2020

Is if nthreads() > 1 really so slow that it's a deal-breaker even for potentially-multithreaded code?

Not necessarily, but you're having to check every time since it's not constant (even though it effectively IS right now), right? Maybe it's a case of premature optimization, but every little bit helps in this code.

sbromberger on 28 Feb 2020

@yuyichao

Don't. Trying to use dispatch is exactly what's wrong about it. A branch is always faster.

So, something like this?

foo(a, b) = nthreads() > 1 ? foo_threaded(a, b) : foo_unthreaded(a, b)

sbromberger on 28 Feb 2020

Yes.

yuyichao on 28 Feb 2020

Yes, that approach is fine. Or if you need to remove all possible overhead, a global constant. Or as Yichao said you can specialize a whole call tree by passing Val(nthreads()>1) and propagating it --- however in that case there is an initial slow dynamic call and compilation.

JeffBezanson on 28 Feb 2020

however in that case there is an initial slow dynamic call and compilation.

You could do if nthreads() > 0; foo(Val(true)) else foo(Val(false)) end to convince the optimizer and that's a tranformation that at least in principle could be done automatically...

yuyichao on 28 Feb 2020

It would be interesting to see the case where a single (perfectly predicted) branch would have a measurable runtime performance vs calling a function that you might want to run threaded.

KristofferC on 28 Feb 2020

It just seems to me that some sort of global "capabilities list" that all packages can access consistently might make sense. There's ComputationalResources.jl, which might do the trick, but it doesn't (yet) cover all cases that might be relevant for our work.

sbromberger on 28 Feb 2020

I think trying to depend on nthreads() creates less composable ecosystem. _Ideally_, it would be better to depend on the input "problem size" so that you won't consume extra threads when the input is too small. This way, other parts of the program can use more threads when they need to. Of course, I understand that it's hard when your function is not data parallel.