Julia: sum|mapreduce versus unrolled for-loop. performance disparity

Created on 8 Feb 2017 · 3Comments · Source: JuliaLang/julia

mapreduce can be much slower than the equivalent for-loop. A small (real-life) example:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        B[j] = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)
    end
    B
end
function dfor(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        d = z
        for k in 1:j
            d += A[j,k]*A[k,j]
        end
        B[j] = d
    end
    B
end

A = randn(127,127)
time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))

gives me a performance ratio of about x50 on Julia 0.5, juliabox.com. I think this could be because the for-loop can be automatically simd, and the mapreduce isn't? When A = randn(N,N) and N is 16, the gap is around x75, and for N = 10000, the gap is around x25. Replacing the array access A[j,k] with A[rand(1:size(A,1)),rand(1:size(A,2))] destroys the performance on both, but the ratio becomes x1.

Is simd the reason why one is x50 faster?
Should this be described in Performance Tips? mapreduce underlies sum, so this could be a popular trap that isn't currently mentioned
Would this be a useful benchmark? on nanosoldier?
Could the performance gap be smaller?

(Benchmarking mapreduce versus for-loops without array access, I still see a x2 performance gap. E.g. mapreduce(identity, +, 0, i for i in 1:n) versus the equivalent integer-summing for loop. It looks like this gap used to be smaller? Worth another benchmark in CI?)

performance

Source

felixrehren

Most helpful comment

+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.

stevengj on 10 Feb 2017

👍2

All 3 comments

Looks like dup of https://github.com/JuliaLang/julia/issues/15276.

Doing:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)
    @inbounds for j::Int in 1:n
        B[j] = _help(A, j, z)
    end
    B
end

_help(A, j, z) = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)

gives

julia> time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))
1.0013213312412255

You can see the problem by @code_warntype and looking for Core.Box.

KristofferC on 8 Feb 2017

👍1

+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.

stevengj on 10 Feb 2017

👍2

Closing as dup of #15276.

KristofferC on 5 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Another illegal instruction

tkoolen · 3Comments

add special display for ≈ test failures

StefanKarpinski · 3Comments

+(x::T,y::T) where {T} not parsing correctly

musm · 3Comments

split() enhancement, keep_splitter

i-apellaniz · 3Comments

isposdef() is incorrect

wilburtownsend · 3Comments