Consider the following code:
using Base.Threads
println("Number of threads = $(nthreads())")
x = rand(10^6)
y = zeros(10^6)
println("Warmup!")
for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
t1 = @elapsed for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2 = @elapsed @threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
println("Serial time = $t1")
println("Parallel time = $t2")
The output recorded on OSX is:
Number of threads = 4
Warmup!
Serial time = 0.239838418
Parallel time = 0.282324893
The output recorded on Linux is:
Number of threads = 4
Warmup!
Serial time = 0.227327406
Parallel time = 0.542067206
@JeffBezanson Julia is aware of the types of x
and y
before the loop executes. So do you have any idea where this regression is coming from?
Are you deliberately benchmarking what happens when you don't put it in a function? Note that the times inside a function are two orders of magnitude faster:
using Base.Threads, BenchmarkTools
function test1!(y, x)
@assert length(y) == length(x)
for i = 1:length(x)
y[i] = sin(x[i])^2 + cos(x[i])^2
end
y
end
function testn!(y, x)
@assert length(y) == length(x)
@threads for i = 1:length(x)
y[i] = sin(x[i])^2 + cos(x[i])^2
end
y
end
I have little experience playing with the threads feature, but one thing I've noticed is that my performance fluctuates a lot from session to session: on a laptop with 2 "real" cores and 2 threads, in some sessions I can get a nearly 2x improvement, while if I quit and restart julia I might get a 4x worsening. Quit & restart again, and perhaps I'm back to the 2x improvement. Weird.
Is OSX and linux (ubuntu?) installed on the same machine?
@timholy Of course, it's best not to use globals when benchmarking. However, this regression is not supposed to happen irrespective of whether you put everything in functions or not.
using Base.Threads
function driver()
println("Number of threads = $(nthreads())")
x = rand(10^6)
y = zeros(10^6)
println("Warmup!")
warmup(x, y)
t1 = test1(x, y)
t2 = test2(x, y)
println("Serial time = $t1")
println("Parallel time = $t2")
end
function warmup(x::Vector{Float64}, y::Vector{Float64})
for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
end
function test1(x::Vector{Float64}, y::Vector{Float64})
t1 = @elapsed for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t1
end
function test2(x::Vector{Float64}, y::Vector{Float64})
t2 = @elapsed @threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2
end
driver()
gives
Number of threads = 4
Warmup!
Serial time = 0.039676539
Parallel time = 0.048867776
And yes, sometimes threading performance is flaky at times. AFAIK, this is because of unresolved gc issues.
@pkofod I first saw this on OSX and then verified it on a different Linux machine. The point I was trying to make was not the numbers themselves, but the fact that there seems to be no scaling.
This seems to be a openlibm issue. Using the system libm here have good performance scaline.
Also note that the type of x
and y
in the loop still has issue, but the dispatch cost seems to be much cheaper than the calculation.
Script I use for benchmarking
using Base.Threads
println("Number of threads = $(nthreads())")
# sin1(x::Float64) = ccall((:sin, Base.Math.libm), Float64, (Float64,), x)
# cos1(x::Float64) = ccall((:cos, Base.Math.libm), Float64, (Float64,), x)
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
cos1(x::Float64) = ccall(:cos, Float64, (Float64,), x)
function test1!(y, x)
# @assert length(y) == length(x)
for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
function testn!(y::Vector{Float64}, x::Vector{Float64})
# @assert length(y) == length(x)
Threads.@threads for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
n = 10^7
x = rand(n)
y = zeros(n)
@time test1!(y, x)
@time testn!(y, x)
@time test1!(y, x)
@time testn!(y, x)
With openlibm
, i.e. Base.Math.libm
Number of threads = 8
1.521594 seconds (4.77 k allocations: 512.504 KB)
0.338695 seconds (11.96 k allocations: 515.312 KB)
0.377355 seconds (4 allocations: 160 bytes)
0.392234 seconds (6 allocations: 224 bytes)
With glibc libm
Number of threads = 8
0.421319 seconds (4.66 k allocations: 508.051 KB)
0.246319 seconds (11.90 k allocations: 512.966 KB)
0.562959 seconds (4 allocations: 160 bytes)
0.098284 seconds (6 allocations: 224 bytes)
More typical timing (glibc libm and openlibm)
yyc2:~/projects/julia/master
yuyichao% JULIA_NUM_THREADS=4 ./julia --color=yes a.jl
Number of threads = 4
0.515686 seconds (4.66 k allocations: 508.051 KB)
0.159722 seconds (11.90 k allocations: 512.966 KB)
0.429442 seconds (4 allocations: 160 bytes)
0.145766 seconds (6 allocations: 224 bytes)
yyc2:~/projects/julia/master
yuyichao% JULIA_NUM_THREADS=4 ./julia --color=yes a.jl
Number of threads = 4
1.530746 seconds (4.77 k allocations: 512.504 KB)
0.445153 seconds (11.96 k allocations: 515.312 KB)
0.282715 seconds (4 allocations: 160 bytes)
0.418572 seconds (6 allocations: 224 bytes)
Interestingly, I can't reproduce this in C with either openmp or cilk......
glibc seems slower than openlibm serially but scales with threading...
Hmm, interesting, on another machine I got.
yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libopenlibm"
3.561010 seconds
1.213387 seconds (20 allocations: 640 bytes)
yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libm"
2.449299 seconds
0.853167 seconds (20 allocations: 640 bytes)
using
using Base.Threads
println("Number of threads = $(nthreads())")
# const libm_name = "libopenlibm"
const libm_name = "libm"
@show libm_name
sin1(x::Float64) = ccall((:sin, libm_name), Float64, (Float64,), x)
cos1(x::Float64) = ccall((:cos, libm_name), Float64, (Float64,), x)
@noinline function test1!(y, x)
# @assert length(y) == length(x)
for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
@noinline function testn!(y::Vector{Float64}, x::Vector{Float64})
# @assert length(y) == length(x)
Threads.@threads for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
function run_tests()
n = 10^7
x = rand(n)
y = zeros(n)
test1!(y, x)
testn!(y, x)
@time for i in 1:10
test1!(y, x)
end
@time for i in 1:10
testn!(y, x)
end
end
run_tests()
So both of the scales and glibc
is always faster.
For the record, I'm using LLVM 3.8.0 on both machine. And the C code I use is
/* #include <cilk/cilk.h> */
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
double (*sin1)(double) = NULL;
double (*cos1)(double) = NULL;
uint64_t get_time(void)
{
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
}
static inline double kernel(double x)
{
double s = sin1(x);
double c = cos1(x);
return s * s + c * c;
}
__attribute__((noinline)) void test1(double *y, double *x, size_t n)
{
asm volatile("":::"memory");
for (size_t i = 0;i < n;i++) {
y[i] = kernel(x[i]);
}
asm volatile("":::"memory");
}
/* __attribute__((noinline)) void testn(double *y, double *x, size_t n) */
/* { */
/* asm volatile("":::"memory"); */
/* cilk_for (size_t i = 0;i < n;i++) { */
/* y[i] = kernel(x[i]); */
/* } */
/* asm volatile("":::"memory"); */
/* } */
__attribute__((noinline)) void testn2(double *y, double *x, size_t n)
{
asm volatile("":::"memory");
#pragma omp parallel for
for (size_t i = 0;i < n;i++) {
y[i] = kernel(x[i]);
}
asm volatile("":::"memory");
}
__attribute__((noinline)) void run_tests(double *y, double *x, size_t n)
{
uint64_t t_start = get_time();
test1(y, x, n);
uint64_t t_end = get_time();
printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
/* t_start = get_time(); */
/* testn(y, x, n); */
/* t_end = get_time(); */
/* printf("time: %.3f ms\n", (t_end - t_start) / 1e6); */
t_start = get_time();
testn2(y, x, n);
t_end = get_time();
printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
}
int main()
{
void *hdl = dlopen("libm.so.6", RTLD_NOW);
/* void *hdl = dlopen("libopenlibm.so", RTLD_NOW); */
sin1 = (double(*)(double))dlsym(hdl, "sin");
cos1 = (double(*)(double))dlsym(hdl, "cos");
size_t n = 10000000;
double *x = malloc(sizeof(double) * n);
/* double *x = calloc(n, sizeof(double)); */
double *y = calloc(n, sizeof(double));
for (size_t i = 0;i < n;i++) {
x[i] = 1;
}
run_tests(y, x, n);
run_tests(y, x, n);
return 0;
}
(the cilkplus version is commented out for clang....)
I see good scaling once there is enough data to process to overcome the setup costs.
I don't think the setup is the issue? It doesn't explain the difference between openlibm and system libm (system libm being faster with multiple thread) and there are already 10^6 elements in the test.
@vtjnash Can you post the problem size and perf vs thread size? Because I do not see that, with openlibm atleast. With glibc, I see some nominal scaling.
Ah, OK. Profiling is showing wide variation in the number of hits in the sin
and cos
functions. Do you see scaling with hyperthreading disabled? From the assembly, it looks like this problem is pretty solidly avx-bound, so I wonder if this is some sort of resource-contention in the processor. But the question is: which resource?
So this issue seems to be really system dependent. Here's my observation so far but I'm not sure if others have similar observations. The scripts I used can be found here if others want to reproduce.
TL;DR, the slowness I've seen on the two systems I've tested can be fully explained by a single thread slowdown that can mysteriously disappear due to certain operations on the thread.
Here are some of the main observations that leads to the conclusion above:
yield()
after running openlibm functions.yield
is called). Since it is not really possible to call yield
on the thread, it's likely that the poor threading performance I see is just because the slowness on the threads are not "fixed" on the workers and only on the master threads.yield()
call tojl
@noinline function yield2()
current_task().state == :runnable || error()
end
which is not as reliable but can "fix" the performance on master thread most of the time. This version is also thread safe (doesn't use tasks or libuv) so it can be run on worker threads too. After running the function on each threads. The performance of openlibm threading version improved by the expected factor.
The effect also goes away when the loop is moved to C or when the ccall
is replaced by llvm intrinsics so it might have something related to certain detail of the execution history. However perf stat
suggests that instruction count, branch misprediction rate and cache miss rate are basically the same for the fast and slow version so it's unclear what CPU behavior is changed. The code does not move so it's hard to believe if this is related to code alignment either.
I'm actually getting that the performance is scaling worse than serial on a problem. Is this the same issue?
https://gist.github.com/ChrisRackauckas/6970aa6c3fa42c987b63dc9fe21c48fd
It's unclear what this problem is right now.
Perhaps this is the same issue showing up in #17751
Hoping we can fix this for 0.5.x.
I can replicate @ChrisRackauckas issue on
JULIA_NUM_THREADS=10
Julia Version 0.5.0-rc0+150
Commit 389dc1c (2016-08-03 04:22 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: AMD Opteron 63xx class CPU
Chris's code isn't a good minimum example (MRE). Do you also see the issue with the code yuyichao posted earlier in this thread? It would be interesting to have your AMD results for that to confirm that the OP wasn't Intel-specific.
Running the code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-232343762
(Was that the code you mean?)
I do get scaling.
with:
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
cos1(x::Float64) = ccall(:cos, Float64, (Float64,), x)
I gets
ubuntu@tlaloc:~$ export JULIA_NUM_THREADS=8
ubuntu@tlaloc:~$ julia a_base.jl
Number of threads = 8
0.705599 seconds (4.67 k allocations: 201.551 KB)
0.160424 seconds (11.90 k allocations: 510.830 KB)
0.649657 seconds (4 allocations: 160 bytes)
0.086682 seconds (6 allocations: 224 bytes)
and with
sin1(x::Float64) = ccall((:sin, Base.Math.libm), Float64, (Float64,), x)
cos1(x::Float64) = ccall((:cos, Base.Math.libm), Float64, (Float64,), x)
I have:
ubuntu@tlaloc:~$ export JULIA_NUM_THREADS=8
ubuntu@tlaloc:~$ julia a_alt.jl
Number of threads = 8
0.353342 seconds (4.82 k allocations: 207.410 KB)
0.116370 seconds (11.97 k allocations: 513.175 KB)
0.341241 seconds (4 allocations: 160 bytes)
0.057563 seconds (6 allocations: 224 bytes)
Running the code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-233726845
I get good scaling
ubuntu@tlaloc:~$ julia compare.jl
Number of threads = 8
libopenlibm
3.518335 seconds
0.617270 seconds (20 allocations: 800 bytes)
libm
6.260513 seconds
1.036404 seconds (20 allocations: 800 bytes)
libopenlibm
3.487945 seconds
0.569065 seconds (20 allocations: 800 bytes)
ubuntu@tlaloc:~$ julia compare2.jl
Number of threads = 8
libopenlibm
t1 = 0.515692918
t2 = 0.515126275
Running: https://github.com/JuliaLang/julia/issues/17395#issue-165259305
ubuntu@tlaloc:~$ julia op.jl
Number of threads = 8
Warmup!
Serial time = 0.413255659
Parallel time = 1.558858472
I get worse performance with parallel than not. (as did OP)
and running https://github.com/JuliaLang/julia/issues/17395#issuecomment-232316297
ubuntu@tlaloc:~$ julia op_noglobal.jl
Number of threads = 8
Warmup!
Serial time = 0.042985187
Parallel time = 0.033117154
So same performance on with both (as did the OP)
So all the results seem more or less the same on my system, as the posted results.
So it is not a AMD vs Intel difference.
I tracked down the issue in my code to be due to eachindex(u)
not strictly typing when inside Base.@threads
. You can fix it by defining local uidx::Base.OneTo{Int64} = eachindex(u)
and using that (you have to put the type specification too). So yes, it is a different issue.
Note that one of the trick I've been using to workaround https://github.com/JuliaLang/julia/issues/15276 that works pretty well for threading loops is to put just the loop itself in a separate function, use a ref if you need to update local variable.
Hi, is there any progress on this issue or any new workarounds that are not mentioned in this thread? I was porting some C++ code to julia, but had to abandon the project since I got 1.5 scaling of 8 threads. I also can still reproduce the results mentioned before in this thread (where scaling close to 1.0 happens with 8 threads).
Edit: also, maybe it's just my ignorance, but @threads
seems to slow down the code by a factor of 2 when calling rand()
julia> versioninfo()
Julia Version 0.6.0-dev.674
Commit 8cb72ee (2016-09-16 12:29 UTC)
Platform Info:
System: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> using Base.Threads
julia> function rfill!(A)
for i in 1:length(A)
A[i] = rand()
end
end
julia> function rfill_mt!(A)
@threads for i in 1:length(A)
A[i] = rand()
end
end
julia> function driver()
A = zeros(64000000);
println("num threads: $(nthreads())")
rfill!(A);
rfill_mt!(A);
@time rfill!(A)
@time rfill_mt!(A)
end
julia> driver()
num threads: 8
0.222452 seconds
0.554084 seconds (2 allocations: 48 bytes)
I think rand isn't threadsafe, so that's not meaningful. The original bug here appears to be a hardware issue that could hit you at any time (even single threaded) and on any code (even if you used asm instead of C++)
For me at least it seems that the scaling is quite poor across few benchmarks, not only the ones mentioned above. Also, it seems quite unfortunate that code using random numbers will run slower when used with mulithreading.
That's just because what's in Base is not thread safe. But RNG.jl is trying to make some thread-safe ones I believe.
So there appears to be multiple issues here but at least we've made some progress on the issue I observed. To summarize, the issue I've seen is that there are significant slow down when calling libopenlibm
functions on each thread unless some magic code is run on the thread after calling some openlibm functions. The issue does not appear with the system libm
.
This issue is actually caused by the slow down when mixing avx and sse instructions and is made more confusing by a glibc bug that'll be fixed in the next release. The issue was mainly solved by going through the performance counter and see which one has significant difference between the good and the bad version, many thanks to @Keno and @kpamnany for advices on the right counter/processor manual to look at....
Explaination for each features:
vzeroupper
or sth else with the same effect. It puts the upper bits in a clean zero state and let sse instructions run fast.libopenlibm
uses PLT entries to call internal functions which calls the libc callback at the first time to resolve symbol address. This callback function needs to save and restore all registers that can be used to pass arguments so it uses avx instructions to save and restore ymm. This means that the register will be dirty uppon returning from any plt entry the first time it is used and when this happens in a loop with sse instructions, the whole loop will be executed in the dirty state in the absence of a vzeroupper
in the loop.Fixing this bug will probably involve fixing openlibm to use VEX
encoding/AVX
instructions (dispatch at load/runtime). We might also want to skip PLT for openlibm internal references.
@ranjanan Just trying to see if there's anything left here that's not related to mixing avx and sse. I also can't figure out (or find) if you have good scaling using the system libm instead.
Base.llvmcall("call void asm \"vzeroupper\", \"\"()\nret void", Void, Tuple{})
before each loop iteration?If you see good scaling with vzeroupper
and a new enough system libm, then we should rename this issue to solving sse/avx mixing issue. OpenLibm to start though it might apply to other libs too.
There are two other issues in this thread https://github.com/JuliaLang/julia/issues/17395#issuecomment-238149050 and https://github.com/JuliaLang/julia/issues/17395#issuecomment-247727559. Those should probably be separate issues if they are still not solved.
@yuyichao:
Just to clarify things, I'm running the benchmark codes again right now. On running code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-232316297, which is the original benchmark code, I get:
Number of threads = 4
Warmup!
Serial time = 0.020891164
Parallel time = 0.034003818
which shows no scaling.
On running code from https://github.com/JuliaLang/julia/issues/17395#issuecomment-232385075,
I get:
Number of threads = 4
libm_name = "libm"
2.056310 seconds
0.741611 seconds (20 allocations: 640 bytes)
Number of threads = 4
libm_name = "libopenlibm"
2.193386 seconds
0.930211 seconds (20 allocations: 640 bytes)
Interestingly, both of these show (~ 2.4x) scaling. But I suppose the difference between both the above benchmark codes is that there is 100x more data parallelism in the second benchmark code (size 10^7, and running it 10 times).
Now addressing both your points:
I wrote these tests on my mac primarily, where I run MacOS Sierra. I am on a Intel(R) Core(TM) i5-5257U CPU which supports SSE4.1/4.2, AVX 2.0 according to this. Also, by default, I'm using the glibc version that is shipped with Julia v0.5.0 on Mac.
Now, I added that Base.llvmcall
to every loop iteration, so the threads code becomes this:
function test2(x::Vector{Float64}, y::Vector{Float64})
t2 = @elapsed @threads for i = 1:10^6
Base.llvmcall("call void asm \"vzeroupper\", \"\"()\nret void", Void, Tuple{})
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2
end
Unfortunately, this does not make any difference to benchmark results of https://github.com/JuliaLang/julia/issues/17395#issuecomment-232316297.
I hope this summary helps.
On running code from #17395 (comment), which is the original benchmark code, I get:
Can you run it twice? I get:
julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.012560089
Parallel time = 0.019910309
julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.013349351
Parallel time = 0.004878786
There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).
I wrote these tests on my mac primarily, where I run MacOS Sierra. I am on a Intel(R) Core(TM) i5-5257U CPU which supports SSE4.1/4.2, AVX 2.0 according to this. Also, by default, I'm using the glibc version that is shipped with Julia v0.5.0 on Mac.
We don't ship libc (and not glibc), that glibc question is for the linux one. According to Jameson, the macOS libc plt callback uses XSAVE and shouldn't have this problem. (though other things can still put it in a dirty state.) Given that you don't see a difference adding vzeroupper
you are probably not hitting the same issue as me.
Ah, you are right:
```julia
julia> include("thread2.jl")
Number of threads = 4
Warmup!
Serial time = 0.024926309
Parallel time = 0.038902679
julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.023512779
Parallel time = 0.010175957
> There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).
I modified my warmup code to include a threaded loop now.
```julia
function warmup(x::Vector{Float64}, y::Vector{Float64})
for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
end
But this doesn't seem to help:
julia> include("thread2.jl") # `driver()` called inside
Number of threads = 4
Warmup!
Serial time = 0.034189366
Parallel time = 0.029591109
Of course, if I call driver()
a second time, it shows the expected speedup. Do you know why this is happening?
You need to run the exact same loop. The warm up time is the compilation of the threading callback.
In this particular case, it is the exact same loop. Compare the threaded loop in the warmup code in the comment above with the following:
function test2(x::Vector{Float64}, y::Vector{Float64})
t2 = @elapsed @threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2
end
I guess it is also to do with the compilation time of test2
which contains the threaded loop. That is why if you run test2
twice, or call it with smaller arrays first for warmup, you can get the speedup you want on the second run.
In which case, I suppose my warmup function isn't quite doing anything because calling test1
and test2
the first time has compilation overhead too, maybe even more so in the case of test2
because of compilation of the threading callback.
In this particular case, it is the exact same loop
It needs to be the same code as in the exactly same line. What matters is the closure identity.
The warmup function shouldn't be needed anyway as long as the code is type stable. You only need to actually call something for warmup in a function if that part is type unstable. In this case, the call to the closure isn't visible to the compiler so that's what need to be warmed up.
x-ref discourse thread where users report bad scaling with PROFILE_JL_THREADING = 1
(the default) due to their systems falling back from TSC to HPET:
https://discourse.julialang.org/t/thread-overhead-variability-across-machines/7320/5
@yuyichao
My code is of the form
function F(g::AbstractGraph{T}) where T <: Integer
n::Int64 = nthreads()
cur::Vector{T} = zeros(T, n)
next::Vector{T} = zeros(T, n)
@threads for i in S
DO SOMETHING with next and cur
end
next , cur = cur::Vector{T}, next::Vector{T}
end
Replacing the last line with:
for t in 1:n
next[t] , cur[t] = cur[t], next[t]
end
Fixes the type instability.
Is there any way to swap the array pointers instead of the contents (Code 1 instead of Code 2) while maintaining type stability.
@yuyichao
My code is of the form
function F(g::AbstractGraph{T}) where T <: Integer
level::T = one(T)
@threads for i in S
READ ONLY ACCESS TO level
end
level += one(T)
end
Remove the last line level += one(T)
fixes the type instability.
Using level = Ref(one(T))
also fixes this problem.
@SohamTamba these cases look more like #15276 and #24688 and can be worked around by using a let block.
function F(g::AbstractGraph{T}) where T <: Integer
level = one(T)
let level=level
@threads for i in S
READ ONLY ACCESS TO level
end
end
level += one(T)
end
Is there anything left for us to do here?
Sorry, forgot to reply to this.
That worked @vchuravy
Thanks
Most helpful comment
So there appears to be multiple issues here but at least we've made some progress on the issue I observed. To summarize, the issue I've seen is that there are significant slow down when calling
libopenlibm
functions on each thread unless some magic code is run on the thread after calling some openlibm functions. The issue does not appear with the systemlibm
.This issue is actually caused by the slow down when mixing avx and sse instructions and is made more confusing by a glibc bug that'll be fixed in the next release. The issue was mainly solved by going through the performance counter and see which one has significant difference between the good and the bad version, many thanks to @Keno and @kpamnany for advices on the right counter/processor manual to look at....
Explaination for each features:
vzeroupper
or sth else with the same effect. It puts the upper bits in a clean zero state and let sse instructions run fast.libopenlibm
uses PLT entries to call internal functions which calls the libc callback at the first time to resolve symbol address. This callback function needs to save and restore all registers that can be used to pass arguments so it uses avx instructions to save and restore ymm. This means that the register will be dirty uppon returning from any plt entry the first time it is used and when this happens in a loop with sse instructions, the whole loop will be executed in the dirty state in the absence of avzeroupper
in the loop.Fixing this bug will probably involve fixing openlibm to use
VEX
encoding/AVX
instructions (dispatch at load/runtime). We might also want to skip PLT for openlibm internal references.