On AVX-512 systems, calling extrema(x)
is slower than calling maximum
and then minimum
. Looks like we need to incorporate the same techniques from https://github.com/JuliaLang/julia/pull/30320 over to extrema in order to get it to vectorize in a similar manner.
@mbauman Can you provide the test case you used? I ran the following code:
arrs = []
for i in 1:100000
insert!(arrs, 1, rand(1000))
end
println("running extrema")
@time map(arrs) do arr
extrema(arr)
end
println("running minimum, maximum")
@time map(arrs) do arr
(minimum(arr), maximum(arr))
end
and got this output:
running extrema
0.520760 seconds (301.98 k allocations: 14.811 MiB)
running minimum, maximum
0.741328 seconds (279.30 k allocations: 13.814 MiB)
julia version:
julia version 1.1.0
Yes, that test case reproduces the behavior on master for me. It won't on 1.1.0 as the PR that improved maximum and minimum was after 1.1.0. Of course, note that you'll also need to have a relatively recent computer to see the difference.
I got a slower extrema
in my Core i5 machine.
running extrema
314.316 ms (100003 allocations: 4.58 MiB)
running minimum, maximum
224.977 ms (100003 allocations: 4.58 MiB)
I would like to try working on this, and it would be great to hear what kind of patch is expected. I was thinking maybe mapreduce_impl
could be turned into a generated function that can handle multiple functions simultaneously. Does that sound good?
One question I have is about the unrolling. Ideally this should be adapted for each architecture and data type. How much of this should be up to LLVM or to Julia? Should we prioritize generating code that the LLVM can vectorize? The alternative I imagine would be controlling parameters like this unrolling depending on the architecture.
Thinking a bit more about this, maybe the best of worlds would be if this function could simply be implemented with a general fold-left, and it should result in a good vectorized machine code relying on the LLVM automatic vectorization. Should we concentrate our efforts on making sure that a generic fold works better in more cases, or should we first make sure that a few specific functions have high-performance implementations?
I was thinking of the much simpler option of copy-pasting the techniques used in maximum/minimum (here: https://github.com/JuliaLang/julia/blob/2b49d26b4bf06d607c746923d83fee91ac6bda7a/base/reduce.jl#L481-L518) into the body of the extrema
implementation (here: https://github.com/JuliaLang/julia/blob/2b49d26b4bf06d607c746923d83fee91ac6bda7a/base/operators.jl#L471-L483).
Revamping mapreduce to support multiple functions is a much bigger task (and I'm not sure we want to add that complication to the already complex mapreduce machinery, either), and while improving LLVM's automatic vectorization is always a worthy goal it's particularly challenging in cases like these.
I am still trying to understand why the specialization of mapreduce_impl
was needed in the first place, and what are the symmetries between minimum
and extrema
. I am reading the history of the file and I cannot make complete sense of it.
The generic mapreduce_impl
contains a @simd for
line, and produces vectorized code to op=max
from what I can tell. The only reason I see for the specialized function is the short-circuiting. Are we interested in having this on extrema
too? Also, these specialized functions are not using @simd for
, but a while
instead. Is this on purpose?
This idea I had at first of supporting multiple functions doesn't really make sense, sorry. I am thinking now on something much simpler, which would be to implement this function as a fold like:
myextrema(arr) = Base.mapfoldl(identity, ((mi,ma),v)->(min(mi,v),max(ma,v)), arr[2:end], init=(arr[1],arr[1]))
Which of course we can specialize if it makes sense. What I see now is that this doesn't seem to vectorize, and I wonder if just refactoring mapfoldl_impl
would do it, and that is what I see as something that might be nice to investigate, since it might also improve other functions based on it.
It doesn't seem to matter for this specific function, though, because from what I can tell the current extrema
is already vectorized. So I suppose in the end this is just about using _min_fast
and _max_fast
?
I quickly tried to use the _min_fast
functions and saw no change. And the machine code is already showing the vcmpordsd
instruction. Maybe this will really require something like that unrolling?
Note that just seeing vector instructions in @code_native
set doesn't actually mean the processor is using the parallelism of the SIMD unit... it can also just be using SIMD instructions with single values. For example, take a look at @code_native max(1.0, 2.0)
— that'll show a whole slew of "SIMD" instructions (including that vcmpordsd
) operating on that single data. It's much easier to look at @code_llvm
to see if things are vectorizing.
For example, you can see that extrema
actually does vectorize nicely for integers:
julia> @code_llvm extrema(rand(Int, 1000))
# ... contains things like:
vector.ph: ; preds = %L40.lr.ph.L40.lr.ph.split_crit_edge
%n.vec = and i64 %27, -32
%ind.end = or i64 %n.vec, 1
%ind.end26 = or i64 %n.vec, 2
%minmax.ident.splatinsert = insertelement <8 x i64> undef, i64 %24, i32 0
%minmax.ident.splat = shufflevector <8 x i64> %minmax.ident.splatinsert, <8 x i64> undef, <8 x i32> zeroinitializer
%minmax.ident.splatinsert59 = insertelement <8 x i64> undef, i64 %24, i32 0
%minmax.ident.splat60 = shufflevector <8 x i64> %minmax.ident.splatinsert59, <8 x i64> undef, <8 x i32> zeroinitializer
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %48, %vector.body ]
# ...
Key things to see are the vector.body
label, all the <8 x i64>
vectors. If you look at @code_llvm extrema(rand(1000))
with floating point values, though, you'll see no vector.body
, no <N x double>
vectors, and lots of plain old double
comparisons.
vcmpordsd
isn't a SIMD instruction. sd
means "single double". If it was named pd
for "packed double", it would be SIMD.
Alright, thanks! I need to get more used with reading the code_llvm output. What is a good reference for Intel instructions? And what does the 'v' mean in this case?
Anyways I guess it all means we really need to change this loop somehow to enable the vectorization. Will try something later.
I'm not an assembly expert but here's my understanding: The v prefix is for a "VEX" (vector extensions) encoded assembly instruction. It's working on the xmm
register; that's a register in the SIMD hardware. SIMD units have scalar arithmetic because shuffling data back to the other registers can be costly. I imagine the way it's implemented is that it actually does the same packed operation and just discards the other bits.
I find LLVM IR imminently more understandable and readable — and the giant language reference page is a good resource.
OK, it took me long enough understand the whole thing and go trough my grief process. It seems the LLVM will really refuse to vectorize it unless we do this unrolling. I have prepared a first draft that shows a speedup. It is still missing many details like the chunky traversal.
I hope I can prepare a PR by tomorrow, but it would be great to hear any comments. In especial, would this use of @ntuple
be fine?
julia> using BenchmarkTools
julia> using Test
julia> N=2^16;
julia> mylist = rand(N);
julia> minornan(x,y) = ifelse(isnan(x) || x < y, x, y)
minornan (generic function with 1 method)
julia> maxornan(x,y) = ifelse(isnan(x) || x > y, x, y)
maxornan (generic function with 1 method)
julia> function myext(itr)
V=8
@inbounds begin
vmin = Base.Cartesian.@ntuple 8 j->itr[j]
vmax = Base.Cartesian.@ntuple 8 j->itr[j]
@simd for i in V:V:length(itr)-V
vmin = Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
vmax = Base.Cartesian.@ntuple 8 j->maxornan(vmax[j], itr[i+j])
end
end
(minimum(vmin), maximum(vmax))
end
myext (generic function with 1 method)
julia> minmaxextrema(aa) = (minimum(aa), maximum(aa))
minmaxextrema (generic function with 1 method)
julia> @btime extrema(xx) setup = (xx=rand(N));
193.335 μs (1 allocation: 32 bytes)
julia> @btime minmaxextrema(xx) setup = (xx=rand(N));
77.339 μs (1 allocation: 32 bytes)
julia> @btime myext(xx) setup = (xx=rand(N));
27.389 μs (1 allocation: 32 bytes)
julia> @test all(extrema(xx) == myext(xx) for xx in [rand(N) for _ in 1:100])
Test Passed
I was checking how things are working on the LLVM side, and I see the relevant patch was merged months ago and made it to 8.0. Wouldn't this be a good candidate for the Julia LLVM patches, then? It wasn't the case before because the LLVM patch was still open. I'm proposing to fix the issue by a Julia LLVM patch plus updated extrema
and etc implementations that can rely on the LLVM vectorization.
I tried including the tiny D53680 LLVM patch into Julia, not sure I did anything wrong but nothing came out of it.
I would really like to understand better what could be happening, because I am not so sure there is just some small thing lacking on LLVM 6.0 that may soon just go away. Testing with clang 6.0.0 I can produce vectorized code as long as fast-mode is used, nicely adapted to e.g. AVX-512. And it all works fine for integers. On Julia it works fine with integers and floats with +
, but I can't get vectorization of min
even with fast-mode, or even doing just a ifelse(x < y, x, y)
without caring for NaNs.
I was trying to use some LLVM intrinsics to see if it had any effect, like maxnum(x, y) = ccall("llvm.maxnum.f64", llvmcall, Float64, (Float64, Float64), x, y);
, but nothing. I am thinking maybe I could try other more complex LLVM IR code until automatic vectorization happens.
Wow, nice investigations! When I tagged this as looking for help, I was thinking it'd be a simple copy-pasting of the technique used in #30320 — but you've definitely taken it to the next level. This is great.
On the Julia side, yes, I think it'd be just fine to use the Cartesian macros since they're also used elsewhere in the multidimensional.jl file. We had been slowly replacing them as they're a bit hard to read, but they're here for now. You could also just write out (or get the macro to write out) the expression literally:
julia> @macroexpand Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
:((minornan(vmin[1], itr[i + 1]), minornan(vmin[2], itr[i + 2]), minornan(vmin[3], itr[i + 3]), minornan(vmin[4], itr[i + 4]), minornan(vmin[5], itr[i + 5]), minornan(vmin[6], itr[i + 6]), minornan(vmin[7], itr[i + 7]), minornan(vmin[8], itr[i + 8])))
A few other notes: you'll need to grab the final elements from the iterator (if it's not an even multiple of V
), and this implementation will only work well for 1-based IndexLinear
AbstractArray
s. You can generalize it to offset arrays with firstindex
and lastindex
, and then you can restrict it via dispatch to only apply to ::AbstractArray
.
I'm not as familiar with the LLVM side of things, but it may indeed be worth investigating what version 8 might bring. I don't know how far we are from version 8 compatibility — 7 was just added a few weeks ago. That patch might not stand alone as a patch to version 6.
This code vectorizes on clang, using just comparisons and in fast mode: https://godbolt.org/z/tF8SSH
float mymin(float x, float y) {
return (x<y)?x:y;
}
float maximum(float*itr) {
float v1=0.0;
for(int i=0; i<6553600; i+=1) {
v1=mymin(v1,itr[i]);
}
return v1;
}
This (addition) vectorizes in Julia even without fast mode (fast mode is still being set in the ir from what i understand):
myadd(x,y)=
Base.llvmcall("""
%3 = fadd fast double %1, %0
ret double %3
""", Float64, Tuple{Float64,Float64}, x, y)
This does not vectorize in Julia even with fast mode:
mymin(x,y)=
Base.llvmcall("""
%3 = fcmp fast olt double %0, %1
%4 = select i1 %3, double %0, double %1
ret double %4
""", Float64, Tuple{Float64,Float64}, x, y)
I did the following test today: first I produced logs from the LLVM passes for 4 functions, calculating the sum and min from a vector of integers or floats (with fast-math). It is possible to see that the functions are very similar up to the point the loop vectorizer runs, when it doesn't work for the case of the min on floats, as expected. I've put the logs here.
I can't make much out of it, but maybe someone has any idea about the point we see the first difference, at line 174 where the float version reads
br label %L88, !dbg !116
and the int
%min.iters.check = icmp ult i64 %42, 16, !dbg !115
br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !115
After that I also generated dumps from clang, where the vectorization happens. The dumped IR can be run with llvmcall with slight modifications, and then we can find out vectorization _still doesn't happen_ for floats in Julia. I guess that means the Julia IR is fine and Clang isn't doing anything special there, so it may have something to do with the way LLVM is being configured by Julia?
using BenchmarkTools
using Test
mymin(itr::Vector{Float32})=
Base.llvmcall("""
%ptr = inttoptr i64 %0 to float*
%2 = load float, float* %ptr, align 4
br label %5
; <label>:3: ; preds = %5
%4 = phi float [ %11, %5 ]
ret float %4
; <label>:5: ; preds = %5, %1
%6 = phi i64 [ 1, %1 ], [ %12, %5 ]
%7 = phi float [ %2, %1 ], [ %11, %5 ]
%8 = getelementptr inbounds float, float* %ptr, i64 %6
%9 = load float, float* %8, align 4
%10 = fcmp fast olt float %7, %9
%11 = select i1 %10, float %7, float %9
%12 = add nuw nsw i64 %6, 1
%13 = icmp eq i64 %12, 65536
br i1 %13, label %3, label %5
""", Float32, Tuple{Ptr{Float32}}, pointer(itr))
mymin(itr::Vector{Int32})=
Base.llvmcall("""
%ptr = inttoptr i64 %0 to i32*
%2 = load i32, i32* %ptr, align 4
br label %5
; <label>:3: ; preds = %5
%4 = phi i32 [ %11, %5 ]
ret i32 %4
; <label>:5: ; preds = %5, %1
%6 = phi i64 [ 1, %1 ], [ %12, %5 ]
%7 = phi i32 [ %2, %1 ], [ %11, %5 ]
%8 = getelementptr inbounds i32, i32* %ptr, i64 %6
%9 = load i32, i32* %8, align 4
%10 = icmp slt i32 %7, %9
%11 = select i1 %10, i32 %7, i32 %9
%12 = add nuw nsw i64 %6, 1
%13 = icmp eq i64 %12, 65536
br i1 %13, label %3, label %5
""", Int32, Tuple{Ptr{Int32}}, pointer(itr))
data = rand(Float32, 2^16)
# data = rand(Int32, 2^16)
@test mymin(data) == minimum(data)
@code_native mymin(data)
EDIT: running this code with JULIA_LLVM_ARGS="-pass-remarks-analysis=loop-vectorize -pass-remarks=vector" julia -O3 --math-mode=fast testext.jl
(following comments on #30933) I get for float
remark: /home/user/simpleext.jl:4:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
and for int
remark: /home/user/simpleext.jl:27:0: vectorized loop (vectorization width: 8, interleaved count: 4)
I have finally seen trouble while working strictly outside Julia: I produced the LLVM IR from the int and float versions, and I was not able to produce vectorized code with opt -fp-contract=fast -O3 -mem2reg -S min_f32_clang.ll
, even though it worked with int.
https://gist.github.com/nlw0/eb50a88845eba90b6b859154d2f2383d
I finally started looking at the LLVM analysis log for the loop vectorizer via JULIA_LLVM_ARGS="--debug-only=loop-vectorize" ./usr/bin/julia-debug --math-mode=fast
using this test code:
function myfun(itr)
@inbounds begin
v = itr[1]
@simd for y in itr[2:end]
@fastmath v = ifelse(v < y,v,y)
end
v
end
end
# a = collect(0x00000001:0x00010000)
a = collect(1f0:1f4)
myfun(a)
Straight away what you can see is that vcat(1f0:1f4)
also does not get vectorized, while it does in the integer version and with a similar function in clang --- which also appears to do a lot more unrolling too. I take it as a clue there really may be an underlying issue worth of attention. (EDIT: actually not tested yet on clang using the same llvm) (EDIT2: actually it's different for floats because apparently the loop size is not pre-computed in this case, and I don't know what would be the proper way to do it)
I've run Julia under gdb and made a trace of the moment the loop vectorizer diverges between the integer and float versions of the minimum
loop. The loop starts with two phi nodes, and vectorization apparently fails when at the second node the test if (RecurrenceDescriptor::isReductionPHI(...
fails for floats, while it works for ints. Not sure what any of that means, but maybe someone has an idea. The relevant code is the function canVectorizeInstrs
in https://github.com/llvm/llvm-project/blob/llvmorg-6.0.1/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
https://gist.github.com/nlw0/58ed9fda8e8944a9cb5e5a20f6038fcf
I'm writing a lot of stuff here, only some of which makes actual sense, so apologies for the noise. I believe I am finally getting close to the root cause of this issue, though.
The LLVM loop vectorizer is expecting the function to carry a "no-nans-fp-math" attribute. The test is done here: https://github.com/llvm/llvm-project/blob/d359f2096850c68b708bc25a7baca4282945949f/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5324
As far as I can tell Julia never sets this, and fast-math mode only affects instructions down at the leaves of the expression trees. This is why there is no optimization even though fast mode is active. I believe this is where this function attribute is set by clang https://github.com/llvm-mirror/clang/blob/31c307cf96ef1b6a4d0c542d18ebfd7a307ae9b0/lib/CodeGen/CGCall.cpp#L1747-L1748
I tried hacking this attribute it, but there seems to be something else still missing. Hope I can make some more progress soon.
OK, there was just something missing still in order to see it in @code_llvm
, but it is actually working! I have now managed to take this:
function mymin(itr)
@inbounds begin
v = itr[1]
@simd for y in itr[2:65536]
@fastmath v = ifelse(v < y,v,y)
end
v
end
end
```assembly
And have `@code_native` show this:
vminps %ymm1, %ymm0, %ymm0
vminps %ymm2, %ymm0, %ymm0
vminps %ymm3, %ymm0, %ymm0
This is the nasty patch, it basically hacks the `Function` objects in `jitlayers.cpp` to set the `"no-nans-fp-math"` in all of them. Although `@code_llvm` doesn't work, and I assume it just generates the LLVM objects trough a different route, we can actually see the vectorized IR in the LLVM log:
```llvm
*** IR Dump After Loop Vectorization ***
; Function Attrs: sspstrong
define float @julia_myfun_15967(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) #0 !dbg !5 {
top:
%1 = call %jl_value_t*** @julia.ptls_states()
call void @llvm.dbg.value(metadata %jl_value_t addrspace(10)* null, metadata !17, metadata !DIExpression(DW_OP_deref)), !dbg !18
call void @llvm.dbg.value(metadata %jl_value_t addrspace(10)* %0, metadata !17, metadata !DIExpression(DW_OP_deref)), !dbg !18
%2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*, !dbg !19
%3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*, !dbg !19
%4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1, !dbg !19
%5 = load i64, i64 addrspace(11)* %4, align 8, !dbg !19, !tbaa !29
%6 = icmp slt i64 %5, 1, !dbg !33
br i1 %6, label %L19, label %L10.lr.ph, !dbg !36
L10.lr.ph: ; preds = %top
%7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
%8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8, !tbaa !37, !nonnull !4
%min.iters.check = icmp ult i64 %5, 32, !dbg !39
br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !39
vector.ph: ; preds = %L10.lr.ph
%n.mod.vf = urem i64 %5, 32, !dbg !39
%n.vec = sub i64 %5, %n.mod.vf, !dbg !39
br label %vector.body, !dbg !39
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !40
%vec.phi = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %30, %vector.body ]
%vec.phi10 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %31, %vector.body ]
%vec.phi11 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %32, %vector.body ]
%vec.phi12 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %33, %vector.body ]
%broadcast.splatinsert = insertelement <8 x i64> undef, i64 %index, i32 0, !dbg !40
%broadcast.splat = shufflevector <8 x i64> %broadcast.splatinsert, <8 x i64> undef, <8 x i32> zeroinitializer, !dbg !40
%induction = add <8 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, !dbg !40
%induction7 = add <8 x i64> %broadcast.splat, <i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15>, !dbg !40
%induction8 = add <8 x i64> %broadcast.splat, <i64 16, i64 17, i64 18, i64 19, i64 20, i64 21, i64 22, i64 23>, !dbg !40
%induction9 = add <8 x i64> %broadcast.splat, <i64 24, i64 25, i64 26, i64 27, i64 28, i64 29, i64 30, i64 31>, !dbg !40
%9 = add i64 %index, 0, !dbg !40
%10 = add i64 %index, 8, !dbg !40
%11 = add i64 %index, 16, !dbg !40
%12 = add i64 %index, 24, !dbg !40
%13 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9, !dbg !43
%14 = getelementptr inbounds float, float addrspace(13)* %8, i64 %10, !dbg !43
%15 = getelementptr inbounds float, float addrspace(13)* %8, i64 %11, !dbg !43
%16 = getelementptr inbounds float, float addrspace(13)* %8, i64 %12, !dbg !43
%17 = getelementptr float, float addrspace(13)* %13, i32 0, !dbg !43
%18 = bitcast float addrspace(13)* %17 to <8 x float> addrspace(13)*, !dbg !43
%wide.load = load <8 x float>, <8 x float> addrspace(13)* %18, align 4, !dbg !43, !tbaa !48
%19 = getelementptr float, float addrspace(13)* %13, i32 8, !dbg !43
%20 = bitcast float addrspace(13)* %19 to <8 x float> addrspace(13)*, !dbg !43
%wide.load13 = load <8 x float>, <8 x float> addrspace(13)* %20, align 4, !dbg !43, !tbaa !48
%21 = getelementptr float, float addrspace(13)* %13, i32 16, !dbg !43
%22 = bitcast float addrspace(13)* %21 to <8 x float> addrspace(13)*, !dbg !43
%wide.load14 = load <8 x float>, <8 x float> addrspace(13)* %22, align 4, !dbg !43, !tbaa !48
%23 = getelementptr float, float addrspace(13)* %13, i32 24, !dbg !43
%24 = bitcast float addrspace(13)* %23 to <8 x float> addrspace(13)*, !dbg !43
%wide.load15 = load <8 x float>, <8 x float> addrspace(13)* %24, align 4, !dbg !43, !tbaa !48
%25 = fcmp fast olt <8 x float> %vec.phi, %wide.load, !dbg !51
%26 = fcmp fast olt <8 x float> %vec.phi10, %wide.load13, !dbg !51
%27 = fcmp fast olt <8 x float> %vec.phi11, %wide.load14, !dbg !51
%28 = fcmp fast olt <8 x float> %vec.phi12, %wide.load15, !dbg !51
%29 = extractelement <8 x i1> %25, i32 0, !dbg !51
%30 = select <8 x i1> %25, <8 x float> %vec.phi, <8 x float> %wide.load, !dbg !51
%31 = select <8 x i1> %26, <8 x float> %vec.phi10, <8 x float> %wide.load13, !dbg !51
%32 = select <8 x i1> %27, <8 x float> %vec.phi11, <8 x float> %wide.load14, !dbg !51
%33 = select <8 x i1> %28, <8 x float> %vec.phi12, <8 x float> %wide.load15, !dbg !51
%index.next = add i64 %index, 32, !dbg !40
%34 = icmp eq i64 %index.next, %n.vec, !dbg !40
br i1 %34, label %middle.block, label %vector.body, !dbg !40, !llvm.loop !54
With that I suppose we can start thinking of a proper solution, and since this involves code generation and maybe decisions about what @fastmath
means I imagine some core developers with experience in this part of Julia would probably like to make a call here?
I did some tests in my hacked Julia compiler, benchmarking maximum
and extrema
using @btime mymaximum(d) setup = (d = randn(Float32, 2^20))
and I got these results:
func | time
-|-
maximum stdlib | 626.831 μs
maximum simple loop | 79.141 μs
extrema stdlib | 6.945 ms
extrema simple loop | 160.655 μs
Hi, during the development of my package I encountered a problem with extrema
signal1 = rand(10000000)
signal2 = rand(10000000)
Now I want to find minimum and maximum in one run
julia> @time Iterators.flatten((signal1, signal2)) |> extrema
0.243334 seconds (20.00 M allocations: 610.352 MiB, 18.44% gc time)
(3.8733525054013285e-8, 0.9999999173562155)
I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)
julia> @time [signal1;signal2;] |> extrema
0.201026 seconds (7 allocations: 152.588 MiB, 24.41% gc time)
(3.8733525054013285e-8, 0.9999999173562155)
First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function
julia> @time Iterators.flatten((signal1, signal2)) |> sum
0.022848 seconds (8 allocations: 256 bytes)
1.0000864996770173e7
julia> @time Iterators.flatten((signal1, signal2)) |> minimum
0.050592 seconds (8 allocations: 256 bytes)
3.8733525054013285e-8
When I defined my own extrema implementation problem does not occur.
function my_extrema(itr)
vmin = vmax = first(itr)
for item in itr
vmax = max(item, vmax)
vmin = min(item, vmin)
end
return (vmin, vmax)
end
julia> @time Iterators.flatten((signal1, signal2)) |> my_extrema
0.074841 seconds (7 allocations: 240 bytes)
(3.8733525054013285e-8, 0.9999999173562155)
I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays.
Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.
Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.
@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info.
Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.
@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info.
Much more complicated than I expected and probably needs to be fixed after code lowering: https://github.com/JuliaLang/julia/issues/34385
Here you go with the simple and fast implementation I'm using routinely (without the NaN test as it is not applicable, remove extra isnan test )
function min_max(I)
a = b = I[1]
for i = 2:length(I)
isnan(I[i]) && continue
if I[i] > b
b = I[i]
elseif I[i] < a
a = I[i]
end
end
a, b
end
Now lest bench functions
julia> x=rand(Float32,1000000);
julia> @btime minimum($x)
430.751 μs (0 allocations: 0 bytes)
1.1920929f-6
julia> @btime maximum($x)
435.949 μs (0 allocations: 0 bytes)
0.9999995f0
julia> @btime min_max($x)
955.375 μs (0 allocations: 0 bytes)
(1.1920929f-6, 0.9999995f0)
julia> x[200000:700000] .= NaN;
julia> @btime min_max($x)
707.323 μs (0 allocations: 0 bytes)
(1.7881393f-6, 0.9999995f0)
Without isnan test
julia> @btime min_max($x)
680.122 μs (0 allocations: 0 bytes)
(1.7881393f-6, 0.9999995f0)
Hi, during the development of my package I encountered a problem with
extrema
signal1 = rand(10000000) signal2 = rand(10000000)
Now I want to find minimum and maximum in one run
julia> @time Iterators.flatten((signal1, signal2)) |> extrema 0.243334 seconds (20.00 M allocations: 610.352 MiB, 18.44% gc time) (3.8733525054013285e-8, 0.9999999173562155)
I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)
julia> @time [signal1;signal2;] |> extrema 0.201026 seconds (7 allocations: 152.588 MiB, 24.41% gc time) (3.8733525054013285e-8, 0.9999999173562155)
First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function
julia> @time Iterators.flatten((signal1, signal2)) |> sum 0.022848 seconds (8 allocations: 256 bytes) 1.0000864996770173e7 julia> @time Iterators.flatten((signal1, signal2)) |> minimum 0.050592 seconds (8 allocations: 256 bytes) 3.8733525054013285e-8
When I defined my own extrema implementation problem does not occur.
function my_extrema(itr) vmin = vmax = first(itr) for item in itr vmax = max(item, vmax) vmin = min(item, vmin) end return (vmin, vmax) end
julia> @time Iterators.flatten((signal1, signal2)) |> my_extrema 0.074841 seconds (7 allocations: 240 bytes) (3.8733525054013285e-8, 0.9999999173562155)
I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays.
Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ?
Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ?
The problem I pointed is not about this particular calculation but rather too much memory allocation that manifests when using extrema with flatten iterator and potentially other cases that follow similar iteration pattern.
Most helpful comment
I'm writing a lot of stuff here, only some of which makes actual sense, so apologies for the noise. I believe I am finally getting close to the root cause of this issue, though.
The LLVM loop vectorizer is expecting the function to carry a "no-nans-fp-math" attribute. The test is done here: https://github.com/llvm/llvm-project/blob/d359f2096850c68b708bc25a7baca4282945949f/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5324
As far as I can tell Julia never sets this, and fast-math mode only affects instructions down at the leaves of the expression trees. This is why there is no optimization even though fast mode is active. I believe this is where this function attribute is set by clang https://github.com/llvm-mirror/clang/blob/31c307cf96ef1b6a4d0c542d18ebfd7a307ae9b0/lib/CodeGen/CGCall.cpp#L1747-L1748
I tried hacking this attribute it, but there seems to be something else still missing. Hope I can make some more progress soon.