Julia: likely vectorization discrepancy between julia and clang triple-nested-loop gemms

Created on 30 Sep 2018  Â·  2Comments  Â·  Source: JuliaLang/julia

A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
        @inbounds C[i, j] += A[i, p] * B[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* C, double* A, double* B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

using Libdl
const CGemmLib = tempname()
open(`clang -fPIC -O3 -xc -shared -o $(CGemmLib * "." * Libdl.dlext) -`, "w") do f
    print(f, gemm_pji_ccode) 
end

gemm_pji_c!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}) =
    (ccall(("gemm_pji_c", CGemmLib), Cvoid,
            (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Cint, Cint, Cint),
            C, A, B, size(C, 1), size(A, 2), size(C, 2)); return C)

using Test
let
    m, n, k = 48*3, 48*2, 48
    C = rand(m, n)
    A = rand(m, k)
    B = rand(k, n)
    Cref = A * B
    @test gemm_pji!(fill!(C, 0), A, B) ≈ Cref
    @test gemm_pji_c!(fill!(C, 0), A, B) ≈ Cref
end

using BenchmarkTools
mnk = 48;
A = rand(mnk, mnk);
B = rand(mnk, mnk);
C = rand(mnk, mnk);
@benchmark gemm_pji!($C, $A, $B)
@benchmark gemm_pji_c!($C, $A, $B)

yielding

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.311 μs (0.00% GC)
  median time:      66.397 μs (0.00% GC)
  mean time:        67.214 μs (0.00% GC)
  maximum time:     244.067 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.923 μs (0.00% GC)
  median time:      32.973 μs (0.00% GC)
  mean time:        34.007 μs (0.00% GC)
  maximum time:     133.430 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

i.e. almost precisely a factor of two discrepancy. Best!

performance simd

Most helpful comment

Looking briefly at this, the function vectorized just fine (4x vectorization 4x unroll), but we have a rather costly alias-check that we do each 16 iterations.

Adding an @aliasscope annotation we get to:

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.339 μs (0.00% GC)
  median time:      15.321 μs (0.00% GC)
  mean time:        15.621 μs (0.00% GC)
  maximum time:     92.682 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.485 μs (0.00% GC)
  median time:      29.610 μs (0.00% GC)
  mean time:        29.633 μs (0.00% GC)
  maximum time:     148.273 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Interestingly the restrict annotation didn't help the C code.

import Base.Experimental: Const, @aliasscope

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
          @aliasscope @inbounds C[i, j] += Const(A)[i, p] * Const(B)[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* restrict C, double* restrict A, double* restrict B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

So more investigation is needed, but it is not just a case of failed vectorization

All 2 comments

I was working on #31442, where the first idea was to go the same way as #30320 where vectorization was activated by unrolling the loop in chunks of 4 steps. I am now thinking that is actually an issue in the code generation that is preventing the vectorization, and it should deserve more attention because it may affect many other functions. Only now I had the idea of looking for other similar open issues and I found this one.

One hypothesis raised before is that this could be something down in LLVM, but your test here using clang is a very compelling argument for the idea it must be something a bit higher level. I suspect we would see the same result if we did this test with maximum.

I imagine there is either something missing in the non-optimized IR code produced by Julia, or if this is really something missing in LLVM then clang must be doing some automatic vectorization by itself, in which case Julia would probably have to start doing the same. I don't think this is likely, though.

Other existing vectorization issues seem to be #29441, #28331 and #30290, apart from the two I mentioned and this one here. Could there be a common cause for some of them? And can anyone give an idea of how one could go about working in these issues? For instance, what is the best way to produce the non-optimized IR, change the code generation, and to test just the LLVM optimizations given the initial IR?

EDIT: Answering my own question, read the fine manual... https://docs.julialang.org/en/v1.1/devdocs/llvm/

Looking briefly at this, the function vectorized just fine (4x vectorization 4x unroll), but we have a rather costly alias-check that we do each 16 iterations.

Adding an @aliasscope annotation we get to:

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.339 μs (0.00% GC)
  median time:      15.321 μs (0.00% GC)
  mean time:        15.621 μs (0.00% GC)
  maximum time:     92.682 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.485 μs (0.00% GC)
  median time:      29.610 μs (0.00% GC)
  mean time:        29.633 μs (0.00% GC)
  maximum time:     148.273 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Interestingly the restrict annotation didn't help the C code.

import Base.Experimental: Const, @aliasscope

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
          @aliasscope @inbounds C[i, j] += Const(A)[i, p] * Const(B)[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* restrict C, double* restrict A, double* restrict B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

So more investigation is needed, but it is not just a case of failed vectorization

Was this page helpful?
0 / 5 - 0 ratings

Related issues

m-j-w picture m-j-w  Â·  3Comments

felixrehren picture felixrehren  Â·  3Comments

wilburtownsend picture wilburtownsend  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

arshpreetsingh picture arshpreetsingh  Â·  3Comments