Julia: No performance scaling in threading

Created on 13 Jul 2016 · 47Comments · Source: JuliaLang/julia

Consider the following code:

using Base.Threads
println("Number of threads = $(nthreads())")
x = rand(10^6)
y = zeros(10^6)
println("Warmup!")
for i = 1:10^6
    y[i] = sin(x[i])^2 + cos(x[i])^2
end
t1 = @elapsed for i = 1:10^6
    y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2 = @elapsed @threads for i = 1:10^6
    y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
println("Serial time = $t1")
println("Parallel time = $t2")

The output recorded on OSX is:

Number of threads = 4
Warmup!
Serial time = 0.239838418
Parallel time = 0.282324893

The output recorded on Linux is:

Number of threads = 4
Warmup!
Serial time = 0.227327406
Parallel time = 0.542067206

multithreading performance

Source

ranjanan

Most helpful comment

So there appears to be multiple issues here but at least we've made some progress on the issue I observed. To summarize, the issue I've seen is that there are significant slow down when calling libopenlibm functions on each thread unless some magic code is run on the thread after calling some openlibm functions. The issue does not appear with the system libm.

This issue is actually caused by the slow down when mixing avx and sse instructions and is made more confusing by a glibc bug that'll be fixed in the next release. The issue was mainly solved by going through the performance counter and see which one has significant difference between the good and the bad version, many thanks to @Keno and @kpamnany for advices on the right counter/processor manual to look at....

Explaination for each features:

At least on Skylake+, there's a significant and persistent cost when executing sse instructions when the upper bits of the ymm registers are dirty. Here the sse code is libopenlibm compiled for a generic x64 arch and the cause of the dirty state is the confusion part.
The "magic" code is actually vzeroupper or sth else with the same effect. It puts the upper bits in a clean zero state and let sse instructions run fast.
The per-thread state is the cleaness of the upper bits. It's not actually necessary to run libopenlibm functions on each thread before doing the magic but the threads are likely started in a dirty state.
The most confusion source of the dirty state is the glibc bug. The libopenlibm uses PLT entries to call internal functions which calls the libc callback at the first time to resolve symbol address. This callback function needs to save and restore all registers that can be used to pass arguments so it uses avx instructions to save and restore ymm. This means that the register will be dirty uppon returning from any plt entry the first time it is used and when this happens in a loop with sse instructions, the whole loop will be executed in the dirty state in the absence of a vzeroupper in the loop.

Fixing this bug will probably involve fixing openlibm to use VEX encoding/AVX instructions (dispatch at load/runtime). We might also want to skip PLT for openlibm internal references.

yuyichao on 1 Mar 2017

❤6

All 47 comments

@JeffBezanson Julia is aware of the types of x and y before the loop executes. So do you have any idea where this regression is coming from?

ranjanan on 13 Jul 2016

Are you deliberately benchmarking what happens when you don't put it in a function? Note that the times inside a function are two orders of magnitude faster:

using Base.Threads, BenchmarkTools
function test1!(y, x)
    @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    y
end
function testn!(y, x)
    @assert length(y) == length(x)
    @threads for i = 1:length(x)
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    y
end

I have little experience playing with the threads feature, but one thing I've noticed is that my performance fluctuates a lot from session to session: on a laptop with 2 "real" cores and 2 threads, in some sessions I can get a nearly 2x improvement, while if I quit and restart julia I might get a 4x worsening. Quit & restart again, and perhaps I'm back to the 2x improvement. Weird.

timholy on 13 Jul 2016

Is OSX and linux (ubuntu?) installed on the same machine?

pkofod on 13 Jul 2016

@timholy Of course, it's best not to use globals when benchmarking. However, this regression is not supposed to happen irrespective of whether you put everything in functions or not.

using Base.Threads
function driver()
    println("Number of threads = $(nthreads())")
    x = rand(10^6)
    y = zeros(10^6)
    println("Warmup!")
    warmup(x, y)
    t1 = test1(x, y)
    t2 = test2(x, y)
    println("Serial time = $t1")
    println("Parallel time = $t2")
end
function warmup(x::Vector{Float64}, y::Vector{Float64})
    for i = 1:10^6
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
end
function test1(x::Vector{Float64}, y::Vector{Float64})
    t1 = @elapsed for i = 1:10^6
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    @assert sum(y) == 10^6
    t1
end
function test2(x::Vector{Float64}, y::Vector{Float64})
    t2 = @elapsed @threads for i = 1:10^6
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    @assert sum(y) == 10^6
    t2
end
driver()

gives

Number of threads = 4
Warmup!
Serial time = 0.039676539
Parallel time = 0.048867776

And yes, sometimes threading performance is flaky at times. AFAIK, this is because of unresolved gc issues.

ranjanan on 13 Jul 2016

@pkofod I first saw this on OSX and then verified it on a different Linux machine. The point I was trying to make was not the numbers themselves, but the fact that there seems to be no scaling.

ranjanan on 13 Jul 2016

https://github.com/JuliaLang/julia/issues/15874

yuyichao on 13 Jul 2016

This seems to be a openlibm issue. Using the system libm here have good performance scaline.

yuyichao on 13 Jul 2016

Also note that the type of x and y in the loop still has issue, but the dispatch cost seems to be much cheaper than the calculation.

yuyichao on 13 Jul 2016

Script I use for benchmarking

using Base.Threads

println("Number of threads = $(nthreads())")

# sin1(x::Float64) = ccall((:sin, Base.Math.libm), Float64, (Float64,), x)
# cos1(x::Float64) = ccall((:cos, Base.Math.libm), Float64, (Float64,), x)
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
cos1(x::Float64) = ccall(:cos, Float64, (Float64,), x)

function test1!(y, x)
    # @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end

function testn!(y::Vector{Float64}, x::Vector{Float64})
    # @assert length(y) == length(x)
    Threads.@threads for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end
n = 10^7
x = rand(n)
y = zeros(n)
@time test1!(y, x)
@time testn!(y, x)
@time test1!(y, x)
@time testn!(y, x)

With openlibm, i.e. Base.Math.libm

Number of threads = 8
  1.521594 seconds (4.77 k allocations: 512.504 KB)
  0.338695 seconds (11.96 k allocations: 515.312 KB)
  0.377355 seconds (4 allocations: 160 bytes)
  0.392234 seconds (6 allocations: 224 bytes)

With glibc libm

Number of threads = 8
  0.421319 seconds (4.66 k allocations: 508.051 KB)
  0.246319 seconds (11.90 k allocations: 512.966 KB)
  0.562959 seconds (4 allocations: 160 bytes)
  0.098284 seconds (6 allocations: 224 bytes)

yuyichao on 13 Jul 2016

More typical timing (glibc libm and openlibm)

yyc2:~/projects/julia/master
yuyichao% JULIA_NUM_THREADS=4 ./julia --color=yes a.jl
Number of threads = 4
  0.515686 seconds (4.66 k allocations: 508.051 KB)
  0.159722 seconds (11.90 k allocations: 512.966 KB)
  0.429442 seconds (4 allocations: 160 bytes)
  0.145766 seconds (6 allocations: 224 bytes)
yyc2:~/projects/julia/master
yuyichao% JULIA_NUM_THREADS=4 ./julia --color=yes a.jl
Number of threads = 4
  1.530746 seconds (4.77 k allocations: 512.504 KB)
  0.445153 seconds (11.96 k allocations: 515.312 KB)
  0.282715 seconds (4 allocations: 160 bytes)
  0.418572 seconds (6 allocations: 224 bytes)

yuyichao on 13 Jul 2016

Interestingly, I can't reproduce this in C with either openmp or cilk......

yuyichao on 13 Jul 2016

glibc seems slower than openlibm serially but scales with threading...

ranjanan on 13 Jul 2016

Hmm, interesting, on another machine I got.

yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libopenlibm"
  3.561010 seconds
  1.213387 seconds (20 allocations: 640 bytes)
yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libm"
  2.449299 seconds
  0.853167 seconds (20 allocations: 640 bytes)

using

using Base.Threads

println("Number of threads = $(nthreads())")

# const libm_name = "libopenlibm"
const libm_name = "libm"

@show libm_name

sin1(x::Float64) = ccall((:sin, libm_name), Float64, (Float64,), x)
cos1(x::Float64) = ccall((:cos, libm_name), Float64, (Float64,), x)

@noinline function test1!(y, x)
    # @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end

@noinline function testn!(y::Vector{Float64}, x::Vector{Float64})
    # @assert length(y) == length(x)
    Threads.@threads for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end

function run_tests()
    n = 10^7
    x = rand(n)
    y = zeros(n)
    test1!(y, x)
    testn!(y, x)
    @time for i in 1:10
        test1!(y, x)
    end
    @time for i in 1:10
        testn!(y, x)
    end
end

run_tests()

So both of the scales and glibc is always faster.

yuyichao on 13 Jul 2016

For the record, I'm using LLVM 3.8.0 on both machine. And the C code I use is

/* #include <cilk/cilk.h> */
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>

double (*sin1)(double) = NULL;
double (*cos1)(double) = NULL;

uint64_t get_time(void)
{
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
}

static inline double kernel(double x)
{
    double s = sin1(x);
    double c = cos1(x);
    return s * s + c * c;
}

__attribute__((noinline)) void test1(double *y, double *x, size_t n)
{
    asm volatile("":::"memory");
    for (size_t i = 0;i < n;i++) {
        y[i] = kernel(x[i]);
    }
    asm volatile("":::"memory");
}

/* __attribute__((noinline)) void testn(double *y, double *x, size_t n) */
/* { */
/*     asm volatile("":::"memory"); */
/*     cilk_for (size_t i = 0;i < n;i++) { */
/*         y[i] = kernel(x[i]); */
/*     } */
/*     asm volatile("":::"memory"); */
/* } */

__attribute__((noinline)) void testn2(double *y, double *x, size_t n)
{
    asm volatile("":::"memory");
#pragma omp parallel for
    for (size_t i = 0;i < n;i++) {
        y[i] = kernel(x[i]);
    }
    asm volatile("":::"memory");
}

__attribute__((noinline)) void run_tests(double *y, double *x, size_t n)
{
    uint64_t t_start = get_time();
    test1(y, x, n);
    uint64_t t_end = get_time();
    printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
    /* t_start = get_time(); */
    /* testn(y, x, n); */
    /* t_end = get_time(); */
    /* printf("time: %.3f ms\n", (t_end - t_start) / 1e6); */
    t_start = get_time();
    testn2(y, x, n);
    t_end = get_time();
    printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
}

int main()
{
    void *hdl = dlopen("libm.so.6", RTLD_NOW);
    /* void *hdl = dlopen("libopenlibm.so", RTLD_NOW); */
    sin1 = (double(*)(double))dlsym(hdl, "sin");
    cos1 = (double(*)(double))dlsym(hdl, "cos");
    size_t n = 10000000;
    double *x = malloc(sizeof(double) * n);
    /* double *x = calloc(n, sizeof(double)); */
    double *y = calloc(n, sizeof(double));
    for (size_t i = 0;i < n;i++) {
        x[i] = 1;
    }
    run_tests(y, x, n);
    run_tests(y, x, n);
    return 0;
}

(the cilkplus version is commented out for clang....)

yuyichao on 13 Jul 2016

I see good scaling once there is enough data to process to overcome the setup costs.

vtjnash on 19 Jul 2016

I don't think the setup is the issue? It doesn't explain the difference between openlibm and system libm (system libm being faster with multiple thread) and there are already 10^6 elements in the test.

yuyichao on 19 Jul 2016

@vtjnash Can you post the problem size and perf vs thread size? Because I do not see that, with openlibm atleast. With glibc, I see some nominal scaling.

ranjanan on 19 Jul 2016

Ah, OK. Profiling is showing wide variation in the number of hits in the sin and cos functions. Do you see scaling with hyperthreading disabled? From the assembly, it looks like this problem is pretty solidly avx-bound, so I wonder if this is some sort of resource-contention in the processor. But the question is: which resource?

vtjnash on 19 Jul 2016

So this issue seems to be really system dependent. Here's my observation so far but I'm not sure if others have similar observations. The scripts I used can be found here if others want to reproduce.

TL;DR, the slowness I've seen on the two systems I've tested can be fully explained by a single thread slowdown that can mysteriously disappear due to certain operations on the thread.

Here are some of the main observations that leads to the conclusion above:

Openlibm is slower (8x on haswell, 2.3x on skylake) before the first yield() after running openlibm functions.
The multithread performance has good scaling (~2.5-3.3x on haswell, ~3.8-3.9x on skylake) for openlibm if the single thread performance uses the slow version (the timing before yield is called). Since it is not really possible to call yield on the thread, it's likely that the poor threading performance I see is just because the slowness on the threads are not "fixed" on the workers and only on the master threads.
I reduced the yield() call to

jl @noinline function yield2() current_task().state == :runnable || error() end

which is not as reliable but can "fix" the performance on master thread most of the time. This version is also thread safe (doesn't use tasks or libuv) so it can be run on worker threads too. After running the function on each threads. The performance of openlibm threading version improved by the expected factor.

The effect also goes away when the loop is moved to C or when the ccall is replaced by llvm intrinsics so it might have something related to certain detail of the execution history. However perf stat suggests that instruction count, branch misprediction rate and cache miss rate are basically the same for the fast and slow version so it's unclear what CPU behavior is changed. The code does not move so it's hard to believe if this is related to code alignment either.

yuyichao on 19 Jul 2016

I'm actually getting that the performance is scaling worse than serial on a problem. Is this the same issue?

https://gist.github.com/ChrisRackauckas/6970aa6c3fa42c987b63dc9fe21c48fd

ChrisRackauckas on 8 Aug 2016

It's unclear what this problem is right now.

yuyichao on 8 Aug 2016

Perhaps this is the same issue showing up in #17751

ViralBShah on 8 Aug 2016

Hoping we can fix this for 0.5.x.

ViralBShah on 8 Aug 2016

I can replicate @ChrisRackauckas issue on
JULIA_NUM_THREADS=10
Julia Version 0.5.0-rc0+150
Commit 389dc1c (2016-08-03 04:22 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: AMD Opteron 63xx class CPU

oxinabox on 9 Aug 2016

Chris's code isn't a good minimum example (MRE). Do you also see the issue with the code yuyichao posted earlier in this thread? It would be interesting to have your AMD results for that to confirm that the OP wasn't Intel-specific.

vtjnash on 9 Aug 2016

Running the code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-232343762
(Was that the code you mean?)
I do get scaling.

with:

sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
cos1(x::Float64) = ccall(:cos, Float64, (Float64,), x)

I gets

ubuntu@tlaloc:~$ export JULIA_NUM_THREADS=8
ubuntu@tlaloc:~$ julia a_base.jl
Number of threads = 8
  0.705599 seconds (4.67 k allocations: 201.551 KB)
  0.160424 seconds (11.90 k allocations: 510.830 KB)
  0.649657 seconds (4 allocations: 160 bytes)
  0.086682 seconds (6 allocations: 224 bytes)

and with

sin1(x::Float64) = ccall((:sin, Base.Math.libm), Float64, (Float64,), x)
cos1(x::Float64) = ccall((:cos, Base.Math.libm), Float64, (Float64,), x)

I have:

ubuntu@tlaloc:~$ export JULIA_NUM_THREADS=8
ubuntu@tlaloc:~$ julia a_alt.jl
Number of threads = 8
  0.353342 seconds (4.82 k allocations: 207.410 KB)
  0.116370 seconds (11.97 k allocations: 513.175 KB)
  0.341241 seconds (4 allocations: 160 bytes)
  0.057563 seconds (6 allocations: 224 bytes)

oxinabox on 9 Aug 2016

Running the code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-233726845

I get good scaling

ubuntu@tlaloc:~$ julia compare.jl
Number of threads = 8
libopenlibm
  3.518335 seconds
  0.617270 seconds (20 allocations: 800 bytes)
libm
  6.260513 seconds
  1.036404 seconds (20 allocations: 800 bytes)
libopenlibm
  3.487945 seconds
  0.569065 seconds (20 allocations: 800 bytes)

ubuntu@tlaloc:~$ julia compare2.jl
Number of threads = 8
libopenlibm
t1 = 0.515692918
t2 = 0.515126275

Running: https://github.com/JuliaLang/julia/issues/17395#issue-165259305

ubuntu@tlaloc:~$ julia op.jl
Number of threads = 8
Warmup!
Serial time = 0.413255659
Parallel time = 1.558858472

I get worse performance with parallel than not. (as did OP)

and running https://github.com/JuliaLang/julia/issues/17395#issuecomment-232316297

ubuntu@tlaloc:~$ julia op_noglobal.jl
Number of threads = 8
Warmup!
Serial time = 0.042985187
Parallel time = 0.033117154

So same performance on with both (as did the OP)

So all the results seem more or less the same on my system, as the posted results.
So it is not a AMD vs Intel difference.

oxinabox on 9 Aug 2016

I tracked down the issue in my code to be due to eachindex(u) not strictly typing when inside Base.@threads. You can fix it by defining local uidx::Base.OneTo{Int64} = eachindex(u) and using that (you have to put the type specification too). So yes, it is a different issue.

ChrisRackauckas on 23 Aug 2016

Note that one of the trick I've been using to workaround https://github.com/JuliaLang/julia/issues/15276 that works pretty well for threading loops is to put just the loop itself in a separate function, use a ref if you need to update local variable.

yuyichao on 24 Aug 2016

Hi, is there any progress on this issue or any new workarounds that are not mentioned in this thread? I was porting some C++ code to julia, but had to abandon the project since I got 1.5 scaling of 8 threads. I also can still reproduce the results mentioned before in this thread (where scaling close to 1.0 happens with 8 threads).

Edit: also, maybe it's just my ignorance, but @threads seems to slow down the code by a factor of 2 when calling rand()

julia> versioninfo()
Julia Version 0.6.0-dev.674
Commit 8cb72ee (2016-09-16 12:29 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

julia> using Base.Threads

julia> function rfill!(A)
           for i in 1:length(A)
              A[i] = rand()
           end
       end

julia> function rfill_mt!(A)
           @threads for i in 1:length(A)
              A[i] = rand()
           end
       end

julia> function driver()
           A = zeros(64000000);
           println("num threads: $(nthreads())")
           rfill!(A);
           rfill_mt!(A);

           @time rfill!(A)
           @time rfill_mt!(A)
       end

julia> driver()
num threads: 8
  0.222452 seconds
  0.554084 seconds (2 allocations: 48 bytes)

elanmart on 17 Sep 2016

I think rand isn't threadsafe, so that's not meaningful. The original bug here appears to be a hardware issue that could hit you at any time (even single threaded) and on any code (even if you used asm instead of C++)

vtjnash on 17 Sep 2016

For me at least it seems that the scaling is quite poor across few benchmarks, not only the ones mentioned above. Also, it seems quite unfortunate that code using random numbers will run slower when used with mulithreading.

elanmart on 17 Sep 2016

That's just because what's in Base is not thread safe. But RNG.jl is trying to make some thread-safe ones I believe.

ChrisRackauckas on 17 Sep 2016

Explaination for each features:

At least on Skylake+, there's a significant and persistent cost when executing sse instructions when the upper bits of the ymm registers are dirty. Here the sse code is libopenlibm compiled for a generic x64 arch and the cause of the dirty state is the confusion part.
The "magic" code is actually vzeroupper or sth else with the same effect. It puts the upper bits in a clean zero state and let sse instructions run fast.
The per-thread state is the cleaness of the upper bits. It's not actually necessary to run libopenlibm functions on each thread before doing the magic but the threads are likely started in a dirty state.
The most confusion source of the dirty state is the glibc bug. The libopenlibm uses PLT entries to call internal functions which calls the libc callback at the first time to resolve symbol address. This callback function needs to save and restore all registers that can be used to pass arguments so it uses avx instructions to save and restore ymm. This means that the register will be dirty uppon returning from any plt entry the first time it is used and when this happens in a loop with sse instructions, the whole loop will be executed in the dirty state in the absence of a vzeroupper in the loop.

Fixing this bug will probably involve fixing openlibm to use VEX encoding/AVX instructions (dispatch at load/runtime). We might also want to skip PLT for openlibm internal references.

yuyichao on 1 Mar 2017

❤6

@ranjanan Just trying to see if there's anything left here that's not related to mixing avx and sse. I also can't figure out (or find) if you have good scaling using the system libm instead.

Which linux/glibc versison are you using. Does it (or the hardware) support AVX?
Does the performance issue go away if you put a Base.llvmcall("call void asm \"vzeroupper\", \"\"()\nret void", Void, Tuple{}) before each loop iteration?

If you see good scaling with vzeroupper and a new enough system libm, then we should rename this issue to solving sse/avx mixing issue. OpenLibm to start though it might apply to other libs too.

There are two other issues in this thread https://github.com/JuliaLang/julia/issues/17395#issuecomment-238149050 and https://github.com/JuliaLang/julia/issues/17395#issuecomment-247727559. Those should probably be separate issues if they are still not solved.

yuyichao on 1 Mar 2017

@yuyichao:

Just to clarify things, I'm running the benchmark codes again right now. On running code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-232316297, which is the original benchmark code, I get:

Number of threads = 4
Warmup!
Serial time = 0.020891164
Parallel time = 0.034003818

which shows no scaling.

On running code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-232385075,
I get:

Number of threads = 4
libm_name = "libm"
  2.056310 seconds
  0.741611 seconds (20 allocations: 640 bytes)


Number of threads = 4
libm_name = "libopenlibm"
  2.193386 seconds
  0.930211 seconds (20 allocations: 640 bytes)

Interestingly, both of these show (~ 2.4x) scaling. But I suppose the difference between both the above benchmark codes is that there is 100x more data parallelism in the second benchmark code (size 10^7, and running it 10 times).

Now addressing both your points:

I wrote these tests on my mac primarily, where I run MacOS Sierra. I am on a Intel(R) Core(TM) i5-5257U CPU which supports SSE4.1/4.2, AVX 2.0 according to this. Also, by default, I'm using the glibc version that is shipped with Julia v0.5.0 on Mac.

Now, I added that Base.llvmcall to every loop iteration, so the threads code becomes this:

function test2(x::Vector{Float64}, y::Vector{Float64})                          
    t2 = @elapsed @threads for i = 1:10^6                                       
        Base.llvmcall("call void asm \"vzeroupper\", \"\"()\nret void", Void, Tuple{})
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
    @assert sum(y) == 10^6                                                      
    t2                                                                          
end

Unfortunately, this does not make any difference to benchmark results of https://github.com/JuliaLang/julia/issues/17395#issuecomment-232316297.

I hope this summary helps.

ranjanan on 1 Mar 2017

On running code from #17395 (comment), which is the original benchmark code, I get:

Can you run it twice? I get:

julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.012560089
Parallel time = 0.019910309

julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.013349351
Parallel time = 0.004878786

There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).

I wrote these tests on my mac primarily, where I run MacOS Sierra. I am on a Intel(R) Core(TM) i5-5257U CPU which supports SSE4.1/4.2, AVX 2.0 according to this. Also, by default, I'm using the glibc version that is shipped with Julia v0.5.0 on Mac.

We don't ship libc (and not glibc), that glibc question is for the linux one. According to Jameson, the macOS libc plt callback uses XSAVE and shouldn't have this problem. (though other things can still put it in a dirty state.) Given that you don't see a difference adding vzeroupper you are probably not hitting the same issue as me.

yuyichao on 1 Mar 2017

Ah, you are right:

```julia

julia> include("thread2.jl")
Number of threads = 4
Warmup!
Serial time = 0.024926309
Parallel time = 0.038902679

julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.023512779
Parallel time = 0.010175957

> There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).

I modified my warmup code to include a threaded loop now. 
```julia
function warmup(x::Vector{Float64}, y::Vector{Float64})                         
    for i = 1:10^6                                                              
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
    @threads for i = 1:10^6                                                     
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
end

But this doesn't seem to help:

julia> include("thread2.jl") # `driver()` called inside
Number of threads = 4
Warmup!
Serial time = 0.034189366
Parallel time = 0.029591109

Of course, if I call driver() a second time, it shows the expected speedup. Do you know why this is happening?

ranjanan on 2 Mar 2017

You need to run the exact same loop. The warm up time is the compilation of the threading callback.

yuyichao on 2 Mar 2017

In this particular case, it is the exact same loop. Compare the threaded loop in the warmup code in the comment above with the following:

function test2(x::Vector{Float64}, y::Vector{Float64})                          
    t2 = @elapsed @threads for i = 1:10^6                                       
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
    @assert sum(y) == 10^6                                                      
    t2                                                                          
end

I guess it is also to do with the compilation time of test2 which contains the threaded loop. That is why if you run test2 twice, or call it with smaller arrays first for warmup, you can get the speedup you want on the second run.

In which case, I suppose my warmup function isn't quite doing anything because calling test1 and test2 the first time has compilation overhead too, maybe even more so in the case of test2 because of compilation of the threading callback.

ranjanan on 2 Mar 2017

In this particular case, it is the exact same loop

It needs to be the same code as in the exactly same line. What matters is the closure identity.
The warmup function shouldn't be needed anyway as long as the code is type stable. You only need to actually call something for warmup in a function if that part is type unstable. In this case, the call to the closure isn't visible to the compiler so that's what need to be warmed up.

yuyichao on 2 Mar 2017

👍1

x-ref discourse thread where users report bad scaling with PROFILE_JL_THREADING = 1 (the default) due to their systems falling back from TSC to HPET:

https://discourse.julialang.org/t/thread-overhead-variability-across-machines/7320/5

ihnorton on 27 Nov 2017

@yuyichao

My code is of the form

function F(g::AbstractGraph{T}) where T <: Integer
    n::Int64 = nthreads()
    cur::Vector{T} = zeros(T, n)
    next::Vector{T} = zeros(T, n)

    @threads for i in S
            DO SOMETHING with next and cur
    end

    next , cur = cur::Vector{T}, next::Vector{T}
end

Replacing the last line with:

    for t in 1:n
        next[t] , cur[t] = cur[t], next[t]
    end

Fixes the type instability.

Is there any way to swap the array pointers instead of the contents (Code 1 instead of Code 2) while maintaining type stability.

SohamTamba on 24 Jul 2018

@yuyichao

My code is of the form

function F(g::AbstractGraph{T}) where T <: Integer
    level::T = one(T)
    @threads for i in S
            READ ONLY ACCESS TO level
    end

    level += one(T)
end

Remove the last line level += one(T) fixes the type instability.

Using level = Ref(one(T)) also fixes this problem.

SohamTamba on 24 Jul 2018

@SohamTamba these cases look more like #15276 and #24688 and can be worked around by using a let block.

function F(g::AbstractGraph{T}) where T <: Integer
    level = one(T)
    let level=level
    @threads for i in S
            READ ONLY ACCESS TO level
    end
    end

    level += one(T)
end