mapreduce
can be much slower than the equivalent for
-loop. A small (real-life) example:
function dsum(A::Matrix)
z = zero(A[1,1])
n = Base.LinAlg.checksquare(A)
B = Vector{typeof(z)}(n)
@inbounds for j in 1:n
B[j] = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)
end
B
end
function dfor(A::Matrix)
z = zero(A[1,1])
n = Base.LinAlg.checksquare(A)
B = Vector{typeof(z)}(n)
@inbounds for j in 1:n
d = z
for k in 1:j
d += A[j,k]*A[k,j]
end
B[j] = d
end
B
end
A = randn(127,127)
time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))
gives me a performance ratio of about x50 on Julia 0.5, juliabox.com. I think this could be because the for
-loop can be automatically simd
, and the mapreduce isn't? When A = randn(N,N)
and N
is 16
, the gap is around x75, and for N = 10000
, the gap is around x25. Replacing the array access A[j,k]
with A[rand(1:size(A,1)),rand(1:size(A,2))]
destroys the performance on both, but the ratio becomes x1.
simd
the reason why one is x50 faster?mapreduce
underlies sum
, so this could be a popular trap that isn't currently mentioned(Benchmarking mapreduce
versus for
-loops without array access, I still see a x2 performance gap. E.g. mapreduce(identity, +, 0, i for i in 1:n)
versus the equivalent integer-summing for
loop. It looks like this gap used to be smaller? Worth another benchmark in CI?)
Looks like dup of https://github.com/JuliaLang/julia/issues/15276.
Doing:
function dsum(A::Matrix)
z = zero(A[1,1])
n = Base.LinAlg.checksquare(A)
B = Vector{typeof(z)}(n)
@inbounds for j::Int in 1:n
B[j] = _help(A, j, z)
end
B
end
_help(A, j, z) = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)
gives
julia> time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))
1.0013213312412255
You can see the problem by @code_warntype
and looking for Core.Box
.
+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.
Closing as dup of #15276.
Most helpful comment
+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.