Julia: Unnecessary GC root for getfield of SSA immutable object

Created on 8 Mar 2016 · 20Comments · Source: JuliaLang/julia

Test case:

immutable Wrap{T}
  x::T
end

function f(a::Wrap)
  s = 0.0
  for i = 1:length(a.x)
    @inbounds s += a.x[i]
  end
  s
end

@code_llvm f(Wrap(rand(2,2)))

Inner loop code in v0.4:

  %"#s1.0" = phi i64 [ %13, %L ], [ 1, %L.preheader ]
  %s.0 = phi double [ %17, %L ], [ 0.000000e+00, %L.preheader ]
  %13 = add i64 %"#s1.0", 1
  %14 = add i64 %"#s1.0", -1
  %15 = getelementptr double* %12, i64 %14
  %16 = load double* %15, align 8
  %17 = fadd double %s.0, %16
  %18 = icmp eq i64 %"#s1.0", %7
  br i1 %18, label %L3, label %L

Inner loop code on master:

  %s.02 = phi double [ %22, %if ], [ 0.000000e+00, %if.preheader ]
  %"#s1.01" = phi i64 [ %15, %if ], [ 1, %if.preheader ]
  %15 = add i64 %"#s1.01", 1
  %16 = load %jl_value_t*, %jl_value_t** %8, align 8
  store %jl_value_t* %16, %jl_value_t** %3, align 8
  %17 = add i64 %"#s1.01", -1
  %18 = bitcast %jl_value_t* %16 to double**
  %19 = load double*, double** %18, align 8
  %20 = getelementptr double, double* %19, i64 %17
  %21 = load double, double* %20, align 8
  %22 = fadd double %s.02, %21
  %23 = icmp eq i64 %"#s1.01", %12
  br i1 %23, label %L2.loopexit, label %if

It seems to be rooting a.x on every iteration where we didn't before. This is one of the issues affecting #15146.

codegen performance regression

Source

JeffBezanson

Most helpful comment

We are now at performance parity for [ i/10 for i=1:n ]. Excellent progress. :tada:

JeffBezanson on 13 May 2016

🎉4

All 20 comments

Yeah, I've also just seen this for array stores.

yuyichao on 8 Mar 2016

On master the IR becames

if:                                               ; preds = %if.preheader, %if
  %s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
  %"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
  %17 = add i64 %"#temp#.03", 1
  store %jl_value_t* %0, %jl_value_t** %4, align 8
  %18 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %18, %jl_value_t** %5, align 8
  %19 = add i64 %"#temp#.03", -1
  %20 = bitcast %jl_value_t* %18 to double**
  %21 = load double*, double** %20, align 8
  %22 = getelementptr double, double* %21, i64 %19
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L2.loopexit, label %if

So we are rooting a at every iteration too and this starts to affect containers of bitstype too. Ref https://github.com/JuliaLang/julia/issues/15717

yuyichao on 31 Mar 2016

I think the inner loop is free of gc frame store now.

if:                                               ; preds = %if.lr.ph, %if
  %s.04 = phi double [ 0.000000e+00, %if.lr.ph ], [ %24, %if ]
  %"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %20, %if ]
  %20 = add i64 %"#temp#.03", 1
  %21 = add i64 %"#temp#.03", -1
  %22 = getelementptr double, double* %18, i64 %21
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L.L2_crit_edge, label %if

However, the function still has two gc slot and it shouldn't need any.

yuyichao on 5 May 2016

Oh, very nice. I wonder what fixed that. Anyway this might be enough to get #7258 off the ground.

JeffBezanson on 5 May 2016

😄2

I believe it's https://github.com/JuliaLang/julia/pull/13463. Commenting the tbaa decoration out results in

if:                                               ; preds = %if.preheader, %if
  %s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
  %"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
  %17 = add i64 %"#temp#.03", 1
  %18 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %18, %jl_value_t** %4, align 8
  %19 = add i64 %"#temp#.03", -1
  %20 = bitcast %jl_value_t* %18 to double**
  %21 = load double*, double** %20, align 8
  %22 = getelementptr double, double* %21, i64 %19
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L2.loopexit, label %if

This is also why the GC slots are still there, the metadata merely allows llvm to move it out of the loop, not deleting it...

yuyichao on 5 May 2016

Ok, I've tried #7258 again. Here's where we are for this test case:

g(n) = [ i/10 for i=1:n ]

The inner loop:

if:                                               ; preds = %if.lr.ph, %if
  %i.09 = phi i64 [ %2, %if.lr.ph ], [ %15, %if ]
  %st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %st.08, 1
  %10 = sitofp i64 %st.08 to double
  %11 = fdiv double %10, 1.000000e+01
  %12 = add i64 %i.09, -1
  %13 = load double*, double** %8, align 8
  %14 = getelementptr double, double* %13, i64 %12
  store double %11, double* %14, align 8
  %15 = add i64 %i.09, 1
  %16 = load i64, i64* %4, align 8
  %17 = icmp eq i64 %st.08, %16
  br i1 %17, label %L5.loopexit, label %if

There are two extra loads, one for the array pointer and one for the range endpoint. So there is still a ~2x performance gap.

JeffBezanson on 6 May 2016

Does the ~~store~~ load gets moved out of the loop with -O3?

yuyichao on 6 May 2016

Interesting; with -O3 the array pointer load is hoisted, but the extra i64 load is still there.

JeffBezanson on 6 May 2016

I noticed a similar effect in #15717 (https://github.com/JuliaLang/julia/pull/13463#issuecomment-204204985) The load of the end point is interesting. Maybe check Base.code_llvm(STDOUT, g, Tuple{Int}, false, true) to see why does LLVM think it is not ok to move it out? (Unless LLVM does this iteratively and don't want to move too many loads out at once....)

yuyichao on 6 May 2016

Here's the full output:

; ModuleID = 'collect_to!'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

%jl_value_t = type { %jl_value_t* }
%Generator.62 = type { [0 x i1], %UnitRange }
%UnitRange = type { i64, i64 }

define %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t*, %Generator.62*, i64, i64) #0 {
top:
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !22, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !31, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !32, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* %0, i64 0, metadata !22, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %Generator.62* %1, i64 0, metadata !23, metadata !39), !dbg !38
  call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !24, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %3, i64 0, metadata !25, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !26, metadata !37), !dbg !38
  %4 = getelementptr inbounds %Generator.62, %Generator.62* %1, i64 0, i32 1, i32 1, !dbg !40
  %5 = load i64, i64* %4, align 8, !dbg !40
  %6 = add i64 %5, 1, !dbg !40
  %7 = icmp eq i64 %6, %3, !dbg !40
  br i1 %7, label %L5, label %if.lr.ph, !dbg !40

if.lr.ph:                                         ; preds = %top
  %8 = bitcast %jl_value_t* %0 to double**, !dbg !41
  %.pre = load double*, double** %8, align 8, !dbg !41, !tbaa !42
  br label %if, !dbg !40

L5.loopexit:                                      ; preds = %if
  br label %L5, !dbg !46

L5:                                               ; preds = %L5.loopexit, %top
  ret %jl_value_t* %0, !dbg !46

if:                                               ; preds = %if.lr.ph, %if
  %i.09 = phi i64 [ %2, %if.lr.ph ], [ %14, %if ]
  %st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %st.08, 1, !dbg !47
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !35, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !36, metadata !37), !dbg !38
  %10 = sitofp i64 %st.08 to double, !dbg !51
  %11 = fdiv double %10, 1.000000e+01, !dbg !51
  call void @llvm.dbg.value(metadata double %11, i64 0, metadata !27, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !25, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i1 true, i64 0, metadata !33, metadata !37), !dbg !38
  %12 = add i64 %i.09, -1, !dbg !41
  %13 = getelementptr double, double* %.pre, i64 %12, !dbg !41
  store double %11, double* %13, align 8, !dbg !41, !tbaa !52
  %14 = add i64 %i.09, 1, !dbg !53
  call void @llvm.dbg.value(metadata i64 %14, i64 0, metadata !26, metadata !37), !dbg !38
  %15 = load i64, i64* %4, align 8, !dbg !40
  %16 = icmp eq i64 %st.08, %15, !dbg !40
  br i1 %16, label %L5.loopexit, label %if, !dbg !40
}

define %jl_value_t* @"jlcall_collect_to!_55312"(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states()
  %4 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 0
  %5 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %5, align 8, !tbaa !54
  %6 = bitcast [3 x %jl_value_t*]* %4 to i64*
  store i64 2, i64* %6, align 8, !tbaa !54
  %7 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 1
  %8 = bitcast %jl_value_t*** %3 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = bitcast %jl_value_t** %7 to i64*
  store i64 %9, i64* %10, align 8, !tbaa !54
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %11 = load %jl_value_t*, %jl_value_t** %1, align 8
  %12 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  %13 = bitcast %jl_value_t** %12 to %Generator.62**
  %14 = load %Generator.62*, %Generator.62** %13, align 8
  %15 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
  %16 = bitcast %jl_value_t** %15 to i64**
  %17 = load i64*, i64** %16, align 8
  %18 = load i64, i64* %17, align 8
  %19 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 3
  %20 = bitcast %jl_value_t** %19 to i64**
  %21 = load i64*, i64** %20, align 8
  %22 = load i64, i64* %21, align 8
  %23 = call %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t* %11, %Generator.62* %14, i64 %18, i64 %22) #0
  store %jl_value_t* %23, %jl_value_t** %5, align 8, !tbaa !54
  %24 = load i64, i64* %10, align 8, !tbaa !54
  store i64 %24, i64* %8, align 8, !tbaa !54
  ret %jl_value_t* %23
}

; Function Attrs: nounwind readnone
declare %jl_value_t*** @jl_get_ptls_states() #1

; Function Attrs: nounwind readnone
declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

; Function Attrs: nounwind readnone
declare void @llvm.dbg.value(metadata, i64, metadata, metadata) #1

attributes #0 = { "no-frame-pointer-elim"="true" }
attributes #1 = { nounwind readnone }

!llvm.module.flags = !{!0, !1}
!llvm.dbg.cu = !{!2}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"Debug Info Version", i32 3}
!2 = distinct !DICompileUnit(language: DW_LANG_C89, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: 1, enums: !4, subprograms: !5)
!3 = !DIFile(filename: "array.jl", directory: ".")
!4 = !{}
!5 = !{!6}
!6 = !DISubprogram(name: "collect_to!", linkageName: "julia_collect_to!_55328", scope: null, file: !3, type: !7, isLocal: false, isDefinition: true, isOptimized: true, function: %jl_value_t* (%jl_value_t*, %Generator.62*, i64, i64)* @"julia_collect_to!_55312", variables: !19)
!7 = !DISubroutineType(types: !8)
!8 = !{!9, !13, !18, !18}
!9 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !10, size: 64, align: 64)
!10 = !DICompositeType(tag: DW_TAG_structure_type, name: "jl_value_t", file: !11, line: 71, align: 64, elements: !12)
!11 = !DIFile(filename: "julia.h", directory: "")
!12 = !{!9}
!13 = !DICompositeType(tag: DW_TAG_structure_type, name: "Generator", size: 128, align: 64, elements: !14, runtimeLang: DW_LANG_Julia)
!14 = !{!15, !16}
!15 = !DICompositeType(tag: DW_TAG_structure_type, name: "#a", align: 8, elements: !4, runtimeLang: DW_LANG_Julia)
!16 = !DICompositeType(tag: DW_TAG_structure_type, name: "UnitRange", size: 128, align: 64, elements: !17, runtimeLang: DW_LANG_Julia)
!17 = !{!18, !18}
!18 = !DIBasicType(name: "Int64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!19 = !{!20, !22, !23, !24, !25, !26, !27, !29, !30, !31, !32, !33, !35, !36, !29}
!20 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "#self#", arg: 1, scope: !6, file: !3, line: 267, type: !21)
!21 = !DICompositeType(tag: DW_TAG_structure_type, name: "#collect_to!", elements: !4, runtimeLang: DW_LANG_Julia)
!22 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "dest", arg: 2, scope: !6, file: !3, line: 267, type: !9)
!23 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "itr", arg: 3, scope: !6, file: !3, line: 267, type: !13)
!24 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "offs", arg: 4, scope: !6, file: !3, line: 267, type: !18)
!25 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "st", arg: 5, scope: !6, file: !3, line: 267, type: !18)
!26 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "i", scope: !6, file: !3, line: 267, type: !18)
!27 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "el", scope: !6, file: !3, line: 267, type: !28)
!28 = !DIBasicType(name: "Float64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!29 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !18)
!30 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "S", scope: !6, file: !3, line: 267, type: !9)
!31 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "R", scope: !6, file: !3, line: 267, type: !9)
!32 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "new", scope: !6, file: !3, line: 267, type: !9)
!33 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !34)
!34 = !DIBasicType(name: "Bool", size: 1, align: 8, encoding: DW_ATE_unsigned)
!35 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "v", scope: !6, file: !3, line: 267, type: !18)
!36 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "s2", scope: !6, file: !3, line: 267, type: !18)
!37 = !DIExpression()
!38 = !DILocation(line: 267, scope: !6)
!39 = !DIExpression(DW_OP_deref)
!40 = !DILocation(line: 268, column: 1, scope: !6)
!41 = !DILocation(line: 272, column: 1, scope: !6)
!42 = !{!"jtbaa_arrayptr", !43}
!43 = !{!"jtbaa_array", !44}
!44 = !{!"jtbaa_value", !45}
!45 = !{!"jtbaa"}
!46 = !DILocation(line: 282, column: 1, scope: !6)
!47 = !DILocation(line: 21, column: 1, scope: !48, inlinedAt: !50)
!48 = !DILexicalBlockFile(scope: !6, file: !49, discriminator: 0)
!49 = !DIFile(filename: "generator.jl", directory: ".")
!50 = !DILocation(line: 269, column: 1, scope: !6)
!51 = !DILocation(line: 22, column: 1, scope: !48, inlinedAt: !50)
!52 = !{!"jtbaa_user", !45}
!53 = !DILocation(line: 273, column: 1, scope: !6)
!54 = !{!"jtbaa_gcframe", !45}

JeffBezanson on 6 May 2016

What type is %Generator.62? Is it mutable? It looks like we are somehow missing a tbaa_user (if it's mutable) or tbaa_immut (if it's immutable) for the load from %4.

yuyichao on 6 May 2016

Reproduced with

f(i) = i / 10
immutable Gen
    iter::UnitRange{Int}
end
Base.start(g::Gen) = start(g.iter)
Base.done(g::Gen, s) = done(g.iter, s)
function Base.next(g::Gen, s)
    v, s2 = next(g.iter, s)
    f(v), s2
end
gen = Gen(1:10)
function collect_to2!{T}(dest::AbstractArray{T}, itr)
    i = 1
    st = start(itr)
    while !done(itr, st)
        el, st = next(itr, st)
        @inbounds dest[i] = el
        i += 1
    end
    return dest
end
@code_llvm collect_to2!(Vector{Float64}(10), gen)
@show collect_to2!(Vector{Float64}(10), gen)

seems to be unrelated to GC root (the above function doesn't have one) Might be related to https://github.com/JuliaLang/julia/issues/13104

yuyichao on 6 May 2016

@yuyichao is this completely fixed by #16230 or is more work needed?

tkelman on 12 May 2016

It still need work

yuyichao on 12 May 2016

We are now at performance parity for [ i/10 for i=1:n ]. Excellent progress. :tada:

JeffBezanson on 13 May 2016

🎉4

The next case is 2d, which is much harder as it uses a Product iterator. Here's a test case:

G = ((i+j)/1000 for i=1:2000, j=1:2000)

@code_llvm Base.collect_to!(zeros(2000,2000), G, 1, start(G))

Here's the inner loop:

if:                                               ; preds = %if.lr.ph, %if
  %i.07 = phi i64 [ %2, %if.lr.ph ], [ %35, %if ]
  call void @julia_next_54517({ double, { i64, i64, %Nullable.27, i8 } }* nonnull sret %4, %Generator.58* %1, { i64, i64, %Nullable.27, i8 }* nonnull %st) #0
  %26 = load i64, i64* %20, align 8
  %27 = load i64, i64* %21, align 8
  %28 = load i64, i64* %22, align 8
  %29 = load i8, i8* %23, align 8
  %30 = load i64, i64* %24, align 8
  %31 = load i8, i8* %25, align 8
  store i64 %27, i64* %6, align 8
  store i64 %28, i64* %8, align 8
  store i8 %29, i8* %10, align 8
  store i64 %30, i64* %12, align 8
  store i8 %31, i8* %14, align 8
  %32 = add i64 %i.07, -1
  %33 = getelementptr double, double* %19, i64 %32
  %34 = bitcast double* %33 to i64*
  store i64 %26, i64* %34, align 8
  %35 = add i64 %i.07, 1
  %36 = and i8 %31, 1
  %37 = icmp eq i8 %36, 0
  br i1 %37, label %if, label %L4.loopexit

Somewhat amazingly, this is only 30% slower than 0.4. Adding @_inline_meta for Generator iteration removes the function call and gets us within ~10%. However, another interesting feature of the above code is that it gets _much slower_ (almost 2x) with -O3, since it gets "vectorized" but the vectors are constantly stored to the stack (at least I think that's the problem). In the better, fully-inlined version -O3 doesn't cause a problem.

JeffBezanson on 13 May 2016

Somewhat amazingly, this is only 30% slower than 0.4. Adding @_inline_meta for Generator iteration removes the function call and gets us within ~10%.

I assume the current version is basically fully inlined so if the performance is only different by 10% with inlining, this seems to be a inlining problem?

yuyichao on 13 May 2016

Well, I'm happy to add the extra @inline declaration, but from there a 10% slowdown is still not good and is probably due to the messy product iterator state.

JeffBezanson on 13 May 2016

The consensus right now seems to be track these in a single issue. So move to https://github.com/JuliaLang/julia/issues/15369#issuecomment-224953393

yuyichao on 9 Jun 2016

(Given that the performance issue for the inner loop isn't there anymore)

yuyichao on 9 Jun 2016

Was this page helpful?

0 / 5 - 0 ratings