Julia: Unnecessary GC root for getfield of SSA immutable object

Created on 8 Mar 2016  路  20Comments  路  Source: JuliaLang/julia

Test case:

immutable Wrap{T}
  x::T
end

function f(a::Wrap)
  s = 0.0
  for i = 1:length(a.x)
    @inbounds s += a.x[i]
  end
  s
end

@code_llvm f(Wrap(rand(2,2)))

Inner loop code in v0.4:

  %"#s1.0" = phi i64 [ %13, %L ], [ 1, %L.preheader ]
  %s.0 = phi double [ %17, %L ], [ 0.000000e+00, %L.preheader ]
  %13 = add i64 %"#s1.0", 1
  %14 = add i64 %"#s1.0", -1
  %15 = getelementptr double* %12, i64 %14
  %16 = load double* %15, align 8
  %17 = fadd double %s.0, %16
  %18 = icmp eq i64 %"#s1.0", %7
  br i1 %18, label %L3, label %L

Inner loop code on master:

  %s.02 = phi double [ %22, %if ], [ 0.000000e+00, %if.preheader ]
  %"#s1.01" = phi i64 [ %15, %if ], [ 1, %if.preheader ]
  %15 = add i64 %"#s1.01", 1
  %16 = load %jl_value_t*, %jl_value_t** %8, align 8
  store %jl_value_t* %16, %jl_value_t** %3, align 8
  %17 = add i64 %"#s1.01", -1
  %18 = bitcast %jl_value_t* %16 to double**
  %19 = load double*, double** %18, align 8
  %20 = getelementptr double, double* %19, i64 %17
  %21 = load double, double* %20, align 8
  %22 = fadd double %s.02, %21
  %23 = icmp eq i64 %"#s1.01", %12
  br i1 %23, label %L2.loopexit, label %if

It seems to be rooting a.x on every iteration where we didn't before. This is one of the issues affecting #15146.

codegen performance regression

Most helpful comment

We are now at performance parity for [ i/10 for i=1:n ]. Excellent progress. :tada:

All 20 comments

Yeah, I've also just seen this for array stores.

On master the IR becames

if:                                               ; preds = %if.preheader, %if
  %s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
  %"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
  %17 = add i64 %"#temp#.03", 1
  store %jl_value_t* %0, %jl_value_t** %4, align 8
  %18 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %18, %jl_value_t** %5, align 8
  %19 = add i64 %"#temp#.03", -1
  %20 = bitcast %jl_value_t* %18 to double**
  %21 = load double*, double** %20, align 8
  %22 = getelementptr double, double* %21, i64 %19
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L2.loopexit, label %if

So we are rooting a at every iteration too and this starts to affect containers of bitstype too. Ref https://github.com/JuliaLang/julia/issues/15717

I think the inner loop is free of gc frame store now.

if:                                               ; preds = %if.lr.ph, %if
  %s.04 = phi double [ 0.000000e+00, %if.lr.ph ], [ %24, %if ]
  %"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %20, %if ]
  %20 = add i64 %"#temp#.03", 1
  %21 = add i64 %"#temp#.03", -1
  %22 = getelementptr double, double* %18, i64 %21
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L.L2_crit_edge, label %if

However, the function still has two gc slot and it shouldn't need any.

Oh, very nice. I wonder what fixed that. Anyway this might be enough to get #7258 off the ground.

I believe it's https://github.com/JuliaLang/julia/pull/13463. Commenting the tbaa decoration out results in

if:                                               ; preds = %if.preheader, %if
  %s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
  %"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
  %17 = add i64 %"#temp#.03", 1
  %18 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %18, %jl_value_t** %4, align 8
  %19 = add i64 %"#temp#.03", -1
  %20 = bitcast %jl_value_t* %18 to double**
  %21 = load double*, double** %20, align 8
  %22 = getelementptr double, double* %21, i64 %19
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L2.loopexit, label %if

This is also why the GC slots are still there, the metadata merely allows llvm to move it out of the loop, not deleting it...

Ok, I've tried #7258 again. Here's where we are for this test case:

g(n) = [ i/10 for i=1:n ]

The inner loop:

if:                                               ; preds = %if.lr.ph, %if
  %i.09 = phi i64 [ %2, %if.lr.ph ], [ %15, %if ]
  %st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %st.08, 1
  %10 = sitofp i64 %st.08 to double
  %11 = fdiv double %10, 1.000000e+01
  %12 = add i64 %i.09, -1
  %13 = load double*, double** %8, align 8
  %14 = getelementptr double, double* %13, i64 %12
  store double %11, double* %14, align 8
  %15 = add i64 %i.09, 1
  %16 = load i64, i64* %4, align 8
  %17 = icmp eq i64 %st.08, %16
  br i1 %17, label %L5.loopexit, label %if

There are two extra loads, one for the array pointer and one for the range endpoint. So there is still a ~2x performance gap.

Does the store load gets moved out of the loop with -O3?

Interesting; with -O3 the array pointer load is hoisted, but the extra i64 load is still there.

I noticed a similar effect in #15717 (https://github.com/JuliaLang/julia/pull/13463#issuecomment-204204985) The load of the end point is interesting. Maybe check Base.code_llvm(STDOUT, g, Tuple{Int}, false, true) to see why does LLVM think it is not ok to move it out? (Unless LLVM does this iteratively and don't want to move too many loads out at once....)

Here's the full output:

; ModuleID = 'collect_to!'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

%jl_value_t = type { %jl_value_t* }
%Generator.62 = type { [0 x i1], %UnitRange }
%UnitRange = type { i64, i64 }

define %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t*, %Generator.62*, i64, i64) #0 {
top:
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !22, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !31, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !32, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* %0, i64 0, metadata !22, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %Generator.62* %1, i64 0, metadata !23, metadata !39), !dbg !38
  call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !24, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %3, i64 0, metadata !25, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !26, metadata !37), !dbg !38
  %4 = getelementptr inbounds %Generator.62, %Generator.62* %1, i64 0, i32 1, i32 1, !dbg !40
  %5 = load i64, i64* %4, align 8, !dbg !40
  %6 = add i64 %5, 1, !dbg !40
  %7 = icmp eq i64 %6, %3, !dbg !40
  br i1 %7, label %L5, label %if.lr.ph, !dbg !40

if.lr.ph:                                         ; preds = %top
  %8 = bitcast %jl_value_t* %0 to double**, !dbg !41
  %.pre = load double*, double** %8, align 8, !dbg !41, !tbaa !42
  br label %if, !dbg !40

L5.loopexit:                                      ; preds = %if
  br label %L5, !dbg !46

L5:                                               ; preds = %L5.loopexit, %top
  ret %jl_value_t* %0, !dbg !46

if:                                               ; preds = %if.lr.ph, %if
  %i.09 = phi i64 [ %2, %if.lr.ph ], [ %14, %if ]
  %st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %st.08, 1, !dbg !47
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !35, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !36, metadata !37), !dbg !38
  %10 = sitofp i64 %st.08 to double, !dbg !51
  %11 = fdiv double %10, 1.000000e+01, !dbg !51
  call void @llvm.dbg.value(metadata double %11, i64 0, metadata !27, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !25, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i1 true, i64 0, metadata !33, metadata !37), !dbg !38
  %12 = add i64 %i.09, -1, !dbg !41
  %13 = getelementptr double, double* %.pre, i64 %12, !dbg !41
  store double %11, double* %13, align 8, !dbg !41, !tbaa !52
  %14 = add i64 %i.09, 1, !dbg !53
  call void @llvm.dbg.value(metadata i64 %14, i64 0, metadata !26, metadata !37), !dbg !38
  %15 = load i64, i64* %4, align 8, !dbg !40
  %16 = icmp eq i64 %st.08, %15, !dbg !40
  br i1 %16, label %L5.loopexit, label %if, !dbg !40
}

define %jl_value_t* @"jlcall_collect_to!_55312"(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states()
  %4 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 0
  %5 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %5, align 8, !tbaa !54
  %6 = bitcast [3 x %jl_value_t*]* %4 to i64*
  store i64 2, i64* %6, align 8, !tbaa !54
  %7 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 1
  %8 = bitcast %jl_value_t*** %3 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = bitcast %jl_value_t** %7 to i64*
  store i64 %9, i64* %10, align 8, !tbaa !54
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %11 = load %jl_value_t*, %jl_value_t** %1, align 8
  %12 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  %13 = bitcast %jl_value_t** %12 to %Generator.62**
  %14 = load %Generator.62*, %Generator.62** %13, align 8
  %15 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
  %16 = bitcast %jl_value_t** %15 to i64**
  %17 = load i64*, i64** %16, align 8
  %18 = load i64, i64* %17, align 8
  %19 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 3
  %20 = bitcast %jl_value_t** %19 to i64**
  %21 = load i64*, i64** %20, align 8
  %22 = load i64, i64* %21, align 8
  %23 = call %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t* %11, %Generator.62* %14, i64 %18, i64 %22) #0
  store %jl_value_t* %23, %jl_value_t** %5, align 8, !tbaa !54
  %24 = load i64, i64* %10, align 8, !tbaa !54
  store i64 %24, i64* %8, align 8, !tbaa !54
  ret %jl_value_t* %23
}

; Function Attrs: nounwind readnone
declare %jl_value_t*** @jl_get_ptls_states() #1

; Function Attrs: nounwind readnone
declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

; Function Attrs: nounwind readnone
declare void @llvm.dbg.value(metadata, i64, metadata, metadata) #1

attributes #0 = { "no-frame-pointer-elim"="true" }
attributes #1 = { nounwind readnone }

!llvm.module.flags = !{!0, !1}
!llvm.dbg.cu = !{!2}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"Debug Info Version", i32 3}
!2 = distinct !DICompileUnit(language: DW_LANG_C89, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: 1, enums: !4, subprograms: !5)
!3 = !DIFile(filename: "array.jl", directory: ".")
!4 = !{}
!5 = !{!6}
!6 = !DISubprogram(name: "collect_to!", linkageName: "julia_collect_to!_55328", scope: null, file: !3, type: !7, isLocal: false, isDefinition: true, isOptimized: true, function: %jl_value_t* (%jl_value_t*, %Generator.62*, i64, i64)* @"julia_collect_to!_55312", variables: !19)
!7 = !DISubroutineType(types: !8)
!8 = !{!9, !13, !18, !18}
!9 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !10, size: 64, align: 64)
!10 = !DICompositeType(tag: DW_TAG_structure_type, name: "jl_value_t", file: !11, line: 71, align: 64, elements: !12)
!11 = !DIFile(filename: "julia.h", directory: "")
!12 = !{!9}
!13 = !DICompositeType(tag: DW_TAG_structure_type, name: "Generator", size: 128, align: 64, elements: !14, runtimeLang: DW_LANG_Julia)
!14 = !{!15, !16}
!15 = !DICompositeType(tag: DW_TAG_structure_type, name: "#a", align: 8, elements: !4, runtimeLang: DW_LANG_Julia)
!16 = !DICompositeType(tag: DW_TAG_structure_type, name: "UnitRange", size: 128, align: 64, elements: !17, runtimeLang: DW_LANG_Julia)
!17 = !{!18, !18}
!18 = !DIBasicType(name: "Int64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!19 = !{!20, !22, !23, !24, !25, !26, !27, !29, !30, !31, !32, !33, !35, !36, !29}
!20 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "#self#", arg: 1, scope: !6, file: !3, line: 267, type: !21)
!21 = !DICompositeType(tag: DW_TAG_structure_type, name: "#collect_to!", elements: !4, runtimeLang: DW_LANG_Julia)
!22 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "dest", arg: 2, scope: !6, file: !3, line: 267, type: !9)
!23 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "itr", arg: 3, scope: !6, file: !3, line: 267, type: !13)
!24 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "offs", arg: 4, scope: !6, file: !3, line: 267, type: !18)
!25 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "st", arg: 5, scope: !6, file: !3, line: 267, type: !18)
!26 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "i", scope: !6, file: !3, line: 267, type: !18)
!27 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "el", scope: !6, file: !3, line: 267, type: !28)
!28 = !DIBasicType(name: "Float64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!29 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !18)
!30 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "S", scope: !6, file: !3, line: 267, type: !9)
!31 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "R", scope: !6, file: !3, line: 267, type: !9)
!32 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "new", scope: !6, file: !3, line: 267, type: !9)
!33 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !34)
!34 = !DIBasicType(name: "Bool", size: 1, align: 8, encoding: DW_ATE_unsigned)
!35 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "v", scope: !6, file: !3, line: 267, type: !18)
!36 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "s2", scope: !6, file: !3, line: 267, type: !18)
!37 = !DIExpression()
!38 = !DILocation(line: 267, scope: !6)
!39 = !DIExpression(DW_OP_deref)
!40 = !DILocation(line: 268, column: 1, scope: !6)
!41 = !DILocation(line: 272, column: 1, scope: !6)
!42 = !{!"jtbaa_arrayptr", !43}
!43 = !{!"jtbaa_array", !44}
!44 = !{!"jtbaa_value", !45}
!45 = !{!"jtbaa"}
!46 = !DILocation(line: 282, column: 1, scope: !6)
!47 = !DILocation(line: 21, column: 1, scope: !48, inlinedAt: !50)
!48 = !DILexicalBlockFile(scope: !6, file: !49, discriminator: 0)
!49 = !DIFile(filename: "generator.jl", directory: ".")
!50 = !DILocation(line: 269, column: 1, scope: !6)
!51 = !DILocation(line: 22, column: 1, scope: !48, inlinedAt: !50)
!52 = !{!"jtbaa_user", !45}
!53 = !DILocation(line: 273, column: 1, scope: !6)
!54 = !{!"jtbaa_gcframe", !45}

What type is %Generator.62? Is it mutable? It looks like we are somehow missing a tbaa_user (if it's mutable) or tbaa_immut (if it's immutable) for the load from %4.

Reproduced with

f(i) = i / 10
immutable Gen
    iter::UnitRange{Int}
end
Base.start(g::Gen) = start(g.iter)
Base.done(g::Gen, s) = done(g.iter, s)
function Base.next(g::Gen, s)
    v, s2 = next(g.iter, s)
    f(v), s2
end
gen = Gen(1:10)
function collect_to2!{T}(dest::AbstractArray{T}, itr)
    i = 1
    st = start(itr)
    while !done(itr, st)
        el, st = next(itr, st)
        @inbounds dest[i] = el
        i += 1
    end
    return dest
end
@code_llvm collect_to2!(Vector{Float64}(10), gen)
@show collect_to2!(Vector{Float64}(10), gen)

seems to be unrelated to GC root (the above function doesn't have one) Might be related to https://github.com/JuliaLang/julia/issues/13104

@yuyichao is this completely fixed by #16230 or is more work needed?

It still need work

We are now at performance parity for [ i/10 for i=1:n ]. Excellent progress. :tada:

The next case is 2d, which is much harder as it uses a Product iterator. Here's a test case:

G = ((i+j)/1000 for i=1:2000, j=1:2000)

@code_llvm Base.collect_to!(zeros(2000,2000), G, 1, start(G))

Here's the inner loop:

if:                                               ; preds = %if.lr.ph, %if
  %i.07 = phi i64 [ %2, %if.lr.ph ], [ %35, %if ]
  call void @julia_next_54517({ double, { i64, i64, %Nullable.27, i8 } }* nonnull sret %4, %Generator.58* %1, { i64, i64, %Nullable.27, i8 }* nonnull %st) #0
  %26 = load i64, i64* %20, align 8
  %27 = load i64, i64* %21, align 8
  %28 = load i64, i64* %22, align 8
  %29 = load i8, i8* %23, align 8
  %30 = load i64, i64* %24, align 8
  %31 = load i8, i8* %25, align 8
  store i64 %27, i64* %6, align 8
  store i64 %28, i64* %8, align 8
  store i8 %29, i8* %10, align 8
  store i64 %30, i64* %12, align 8
  store i8 %31, i8* %14, align 8
  %32 = add i64 %i.07, -1
  %33 = getelementptr double, double* %19, i64 %32
  %34 = bitcast double* %33 to i64*
  store i64 %26, i64* %34, align 8
  %35 = add i64 %i.07, 1
  %36 = and i8 %31, 1
  %37 = icmp eq i8 %36, 0
  br i1 %37, label %if, label %L4.loopexit

Somewhat amazingly, this is only 30% slower than 0.4. Adding @_inline_meta for Generator iteration removes the function call and gets us within ~10%. However, another interesting feature of the above code is that it gets _much slower_ (almost 2x) with -O3, since it gets "vectorized" but the vectors are constantly stored to the stack (at least I think that's the problem). In the better, fully-inlined version -O3 doesn't cause a problem.

Somewhat amazingly, this is only 30% slower than 0.4. Adding @_inline_meta for Generator iteration removes the function call and gets us within ~10%.

I assume the current version is basically fully inlined so if the performance is only different by 10% with inlining, this seems to be a inlining problem?

Well, I'm happy to add the extra @inline declaration, but from there a 10% slowdown is still not good and is probably due to the messy product iterator state.

The consensus right now seems to be track these in a single issue. So move to https://github.com/JuliaLang/julia/issues/15369#issuecomment-224953393

(Given that the performance issue for the inner loop isn't there anymore)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yurivish picture yurivish  路  3Comments

helgee picture helgee  路  3Comments

TotalVerb picture TotalVerb  路  3Comments

StefanKarpinski picture StefanKarpinski  路  3Comments

manor picture manor  路  3Comments