Julia: gc heuristics seem to be problematic when working with large arrays

Created on 31 Aug 2018 · 25Comments · Source: JuliaLang/julia

There is a strange slowdown when using copy in a for loop.

using BenchmarkTools
f1()=randn(14,14,1024,32)
f2()=(for i in 1:1; randn(14,14,1024,32); end)
f3()=copy(randn(14,14,1024,32))
f4()=(for i in 1:1; copy(randn(14,14,1024,32)); end)
for f in (f1,f2,f3,f4); println(@benchmark $f()); end
# julia 1.0 output
Trial(55.839 ms)
Trial(55.836 ms)
Trial(81.300 ms)
Trial(148.378 ms)
# julia 0.6.4 output:
Trial(68.128 ms)
Trial(65.547 ms)
Trial(92.005 ms)
Trial(91.002 ms)

GC performance

Source

ekinakyurek

Most helpful comment

Talked to Jeff yesterday and I believe he should now have a patch to fix the full collection heuristic.

yuyichao on 20 Sep 2018

❤2

All 25 comments

Need to be careful that this isn't a benchmarking artifact, for example:

julia> for f in (f1,f2,f3,f4); @btime $f(); end
  57.959 ms (8 allocations: 49.00 MiB)
  57.947 ms (8 allocations: 49.00 MiB)
  73.382 ms (10 allocations: 98.00 MiB)
  73.248 ms (10 allocations: 98.00 MiB)

KristofferC on 3 Sep 2018

Thanks, I had thought same, but I don’t think so. If you look at this comment and the previous one from referenced issue you will see:
https://github.com/denizyuret/Knet.jl/issues/345#issuecomment-417711416

Ekin

On Sep 3, 2018, at 1:09 PM, Kristoffer Carlsson notifications@github.com wrote:

Need to be careful that this isn't a benchmarking artifact, for example:

julia> for f in (f1,f2,f3,f4); @btime $f(); end
57.959 ms (8 allocations: 49.00 MiB)
57.947 ms (8 allocations: 49.00 MiB)
73.382 ms (10 allocations: 98.00 MiB)
73.248 ms (10 allocations: 98.00 MiB)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

ekinakyurek on 3 Sep 2018

Just a simple loop also have problems showing any difference.

julia> @time for i in 1:10 f4() end
  0.767406 seconds (100 allocations: 980.004 MiB, 10.57% gc time)

julia> @time for i in 1:10 f3() end
  0.772148 seconds (100 allocations: 980.004 MiB, 11.71% gc time)

KristofferC on 3 Sep 2018

Did you try with adding length(x) to the end of each function as commented in the previous link?

On Sep 3, 2018, at 1:39 PM, Kristoffer Carlsson notifications@github.com wrote:

Just a simple loop also have problems showing any difference.

julia> @time for i in 1:10 f4() end
0.767406 seconds (100 allocations: 980.004 MiB, 10.57% gc time)

julia> @time for i in 1:10 f3() end
0.772148 seconds (100 allocations: 980.004 MiB, 11.71% gc time)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

ekinakyurek on 3 Sep 2018

Please show the two full function definitions that give a consistent time difference when using @time.

KristofferC on 3 Sep 2018

Here is as best I can replicate this without including all of Knet. Both fslow and ffast should be doing the same thing, but one is twice as fast as the other.

mutable struct MyArray{T,N}; a; d; end
MyArray{T,N}(::UndefInitializer, d::NTuple{N,Int}) where {T,N} = MyArray{T,N}(Array{T}(undef,d), d)
MyArray{T,N}(x::AbstractArray{S,N}) where {T,N,S} = copyto!(MyArray{T,N}(undef,size(x)), convert(Array{T,N},x))
Base.copyto!(a::MyArray,x)=copyto!(a.a,x)

fslow() = for i in 1:10  # takes 1.62s
    a = randn(14,14,1024,32)
    x = MyArray{Float32,4}(a)
end

ffast() = for i in 1:10  # takes 0.86s
    a = convert(Array{Float32,4},randn(14,14,1024,32))
    x = MyArray{Float32,4}(a)
end

denizyuret on 3 Sep 2018

❤1

Comparing the code_warntype of the two functions (thanks @shashi) the difference seems to be shuffling a few instructions. I still don't see how this can make a difference. What next step of analysis would you suggest?

slow:
13. randn Float64 array0
19. alloc Float32 array1
20. create MyArray with array1
26. alloc Float32 array2
27. copyto! array2 <- array0
28. copyto! MyArray <- array2

fast:
13. randn Float64 array0
19. alloc Float32 array1
20. copyto! array1 <- array0
26. alloc Float32 array2
27. create MyArray with array2
28. copyto! MyArray array1

denizyuret on 4 Sep 2018

There seems to be difference in the gc() behavior:

julia> @time fslow()
  1.640096 seconds (160 allocations: 980.006 MiB, 47.69% gc time)
julia> @time ffast()
  0.944209 seconds (160 allocations: 980.006 MiB, 16.97% gc time)

denizyuret on 12 Sep 2018

This indeed seems to be regarding the GC:

Some cleaned up Profile traces:

ffast:

540 /Users/kristoffer/julia/src/task.c:268; start_task
 540 /Users/kristoffer/julia/src/./julia.h:1537; jl_apply
  540 ./task.jl:259; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
   540 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117; macro expansion
    540 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
     540 ./boot.jl:319; eval(::Module, ::Any)
      540 /Users/kristoffer/julia/src/builtins.c:622; jl_toplevel_eval_in
       540 /Users/kristoffer/julia/src/toplevel.c:807; jl_toplevel_eval_flex
        540 /Users/kristoffer/julia/src/gf.c:1831; jl_fptr_trampoline
         540 ...re/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
          468 ./REPL[11]:2; ffast()
          72  ./REPL[11]:3; ffast()
           72 ./REPL[3]:1; Type
            72 ./REPL[2]:1; Type
             72 ./boot.jl:411; Type
              72 ./boot.jl:406; Type
               72 /Users/kristoffer/julia/src/array.c:413; jl_new_array
                72 /Users/kristoffer/julia/src/array.c:158; _new_array
                 72 /Users/kristoffer/julia/src/array.c:112; _new_array_
                  72 .../kristoffer/julia/src/./julia_internal.h:274; jl_gc_alloc
                   72 /Users/kristoffer/julia/src/gc.c:954; jl_gc_pool_alloc
                    72 /Users/kristoffer/julia/src/gc.c:2633; jl_gc_collect
                     72 /Users/kristoffer/julia/src/gc.c:2468; _jl_gc_collect
                      39 /Users/kristoffer/julia/src/gc.c:1815; gc_mark_loop
                       25 /Users/kristoffer/julia/src/gc.c:1551; gc_mark_scan_obj8
                        10 /Users/kristoffer/julia/src/gc.c:1425; gc_try_setmark
                        14 /Users/kristoffer/julia/src/gc.c:1428; gc_try_setmark
                         12 /Users/kristoffer/julia/src/gc.c:540; gc_setmark_tag

fslow

814 /Users/kristoffer/julia/src/task.c:268; start_task
 814 /Users/kristoffer/julia/src/./julia.h:1537; jl_apply
  814 ./task.jl:259; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
   814 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117; macro expansion
    814 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
     814 ./boot.jl:319; eval(::Module, ::Any)
      814 /Users/kristoffer/julia/src/builtins.c:622; jl_toplevel_eval_in
       814 /Users/kristoffer/julia/src/toplevel.c:807; jl_toplevel_eval_flex
        814 /Users/kristoffer/julia/src/gf.c:1831; jl_fptr_trampoline
         814 ...re/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
           439 ./REPL[5]:3; fslow
           439 ./REPL[3]:1; MyArray{Float32,4}(::Array{Float64,4})
            342 ./REPL[2]:1; Type
             342 ./boot.jl:411; Type
              342 ./boot.jl:406; Type
               342 /Users/kristoffer/julia/src/array.c:413; jl_new_array
                342 /Users/kristoffer/julia/src/array.c:158; _new_array
                 342 /Users/kristoffer/julia/src/array.c:112; _new_array_
                  342 ...kristoffer/julia/src/./julia_internal.h:274; jl_gc_alloc
                   342 /Users/kristoffer/julia/src/gc.c:954; jl_gc_pool_alloc
                    342 /Users/kristoffer/julia/src/gc.c:2633; jl_gc_collect
                     335 /Users/kristoffer/julia/src/gc.c:2468; _jl_gc_collect
                      55  /Users/kristoffer/julia/src/gc.c:1804; gc_mark_loop
                       54 /Users/kristoffer/julia/src/gc.c:1516; gc_mark_scan_objarray
                        44 /Users/kristoffer/julia/src/gc.c:1425; gc_try_setmark
                      186 /Users/kristoffer/julia/src/gc.c:1815; gc_mark_loop
                       37  /Users/kristoffer/julia/src/gc.c:1547; gc_mark_scan_obj8
                       137 /Users/kristoffer/julia/src/gc.c:1551; gc_mark_scan_obj8
                        85 /Users/kristoffer/julia/src/gc.c:1425; gc_try_setmark
                        43 /Users/kristoffer/julia/src/gc.c:1428; gc_try_setmark
                         39 /Users/kristoffer/julia/src/gc.c:540; gc_setmark_tag

@yuyichao perhaps you have an idea what is going on here.

KristofferC on 12 Sep 2018

So we did a little bit of digging. IIUC default_collect_interval controlls through should_collect if we should run a GC.

default_collect_interval (5600*1024*sizeof(void*))

So Deniz allocations end up being just a big bigger 6272*1024*sizeof(void*) then our default collect interval triggering a GC collection at every iteration.

To add to the mystery:

fveryfast() = for i in 1:10
           v = rand(14, 14, 1024, 32)
           a = convert(Array{Float32,4}, v)
           x = MyArray{Float32,4}(a)
       end

julia> @btime fveryfast()
  299.005 ms (110 allocations: 980.00 MiB)
julia> @btime ffast()
  681.678 ms (160 allocations: 980.01 MiB)

vchuravy on 19 Sep 2018

Does everything run at the same speed with the GC turned off?

KristofferC on 19 Sep 2018

With GC off:

julia> @time fveryfast()
  0.390459 seconds (114 allocations: 980.005 MiB)

julia> @time fslow()
  0.669810 seconds (164 allocations: 980.006 MiB)

julia> @time ffast()
  0.658035 seconds (164 allocations: 980.006 MiB)

vchuravy on 19 Sep 2018

ffast calls randn, fveryfast calls rand.

JeffBezanson on 19 Sep 2018

In ffast the conversion is done before calling the MyArray constructor, which means MyArray needs less code, bringing it within the inlining threshold. In fslow, MyArray is not inlined. This seems to contribute to the difference.

JeffBezanson on 19 Sep 2018

Oh duh. Thanks for pointing that out.

I am currently testing:

mutable struct MyArray{T,N}; a; d; end
@inline MyArray{T,N}(::UndefInitializer, d::NTuple{N,Int}) where {T,N} = MyArray{T,N}(Array{T}(undef,d), d)
@inline MyArray{T,N}(x::AbstractArray{S,N}) where {T,N,S} = copyto!(MyArray{T,N}(undef,size(x)), convert(Array{T,N},x))
@inline Base.copyto!(a::MyArray,x)=copyto!(a.a,x)

fslow() = for i in 1:10  # takes 1.62s
    a = randn(14,14,1024,32)
    x = MyArray{Float32,4}(a)
end

ffast() = for i in 1:10  # takes 0.86s
    a = convert(Array{Float32,4},randn(14,14,1024,32))
    x = MyArray{Float32,4}(a)
end

to remove the inlining part.

I am also messing around with perf

perf record -g -k 1 ~/builds/julia/julia perfgc.jl
perf inject --jit -i perf.data -o perf.jit.data
perf report --call-graph -G -i perf.jit.data
perf script -i perf.jit.data | FlameGraph/stackcollapse-perf.pl  > out.perf-folded
FlameGraph/flamegraph.pl  out.perf-folded > perf-kernel.svg

with Flamegraph

-   44.43%     0.00%  julia  [.] japi1_fslow_-2142026221                                                                                                                                           ▒
   - 44.43% japi1_fslow_-2142026221                                                                                                                                                                ▒
      + 22.82% japi1_randn_-2142026219                                                                                                                                                             ▒
      - 19.32% jl_new_array                                                                                                                                                                        ▒
         - 19.31% _new_array_                                                                                                                                                                      ▒
              jl_gc_alloc                                                                                                                                                                          ▒
            - jl_gc_pool_alloc                                                                                                                                                                     ▒
               - 19.31% jl_gc_collect                                                                                                                                                              ▒
                  - 19.04% _jl_gc_collect                                                                                                                                                          ▒
                       18.81% gc_mark_loop 
-   29.08%     0.00%  julia  [.] japi1_ffast_-2142026210                                                                                                                                           ◆
   - japi1_ffast_-2142026210                                                                                                                                                                       ▒
      + 21.94% japi1_randn_-2142026219                                                                                                                                                             ▒
      - 5.23% jl_new_array                                                                                                                                                                         ▒
         - 5.23% _new_array_                                                                                                                                                                       ▒
            - 5.23% jl_gc_alloc                                                                                                                                                                    ▒
               - jl_gc_pool_alloc                                                                                                                                                                  ▒
                  - 5.23% jl_gc_collect                                                                                                                                                            ▒
                     - 4.71% _jl_gc_collect                                                                                                                                                        ▒
                          4.34% gc_mark_loop                                                                                                                                                       ▒
                       0.52% gc_sweep_sysimg                                                                                                                                                       ▒
        1.91% __memmove_avx_unaligned_erms

vchuravy on 19 Sep 2018

With the code posted here, I get

julia> @timev ffast();
  1.513723 seconds (154 allocations: 980.006 MiB, 8.81% gc time)
elapsed time (ns): 1513722814
gc time (ns):      133397448
bytes allocated:   1027610400
pool allocs:       124
malloc() calls:    30
GC pauses:         30
full collections:  2

Putting @noinline on the MyArray constructors, I get

julia> @timev ffast();
  2.282502 seconds (154 allocations: 980.006 MiB, 23.95% gc time)
elapsed time (ns): 2282502347
gc time (ns):      546578266
bytes allocated:   1027610400
pool allocs:       124
malloc() calls:    30
GC pauses:         30
full collections:  10

So the un-inlining adds full collections, but doesn't change anything else.

JeffBezanson on 19 Sep 2018

We think this can be solved by nulling roots in the compiler more aggressively. Changes to GC heuristics might also be warranted --- something seems off about repeatedly doing full collections on each iteration ~~after failing to free anything~~ (edit: it is freeing the arrays, but the pattern of allocations is still triggering full collections).

JeffBezanson on 19 Sep 2018

We checked on 0.6 yesterday and in there both are equally slow, and both are doing the same amount of full collections

vchuravy on 20 Sep 2018

Talked to Jeff yesterday and I believe he should now have a patch to fix the full collection heuristic.

yuyichao on 20 Sep 2018

❤2

We have a patch that might fix the full collection issue (PR incoming) but there is still a remaining performance difference that AFAICT comes down to this:

function fslow()
    for i = 1:10
        a = zeros(Float64,14,14,1024,32)
        b = Array{Float32}(undef, (14,14,1024,32))
        c = Array{Float32}(undef, (14,14,1024,32))
        copyto!(c, a)
        copyto!(b, c)
    end
end

function ffast()
    for i = 1:10
        a = zeros(Float64,14,14,1024,32)
        b = Array{Float32}(undef, (14,14,1024,32))
        copyto!(b, a)
        c = Array{Float32}(undef, (14,14,1024,32))
        copyto!(c, b)
    end
end

This simplifies the allocation and copying events that happen in the examples. In the slow version, all three arrays need to be alive at the same time, whereas in the fast version one array can be freed sooner. Keno and I hypothesize that this lets the same space be reused, making more efficient use of memory.

JeffBezanson on 20 Sep 2018

Print the pointer of the array to verify that? (and maybe use a noinline function to minimize effect on the root placement)

yuyichao on 20 Sep 2018

Fast:

pointer(a) = Ptr{Float64} @0x00007f38f2efe040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00

pointer(a) = Ptr{Float64} @0x00007f38f2efe040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00

Slow:

pointer(a) = Ptr{Float64} @0x00007f38f2efe040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00

pointer(a) = Ptr{Float64} @0x00007f38efdfd040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00

So yes, in the fast code the same space is reused for a on each iteration, and in the slow code it isn't.

JeffBezanson on 20 Sep 2018

Thank you @JeffBezanson @yuyichao. My original code which includes similar (apparently not same) structures of above abstracted codes works in same speed on 238c6cd5ff399fc007856385d2e8704f69be68dd and 6354405908703fb379cfc24897512bc6c4ec8c74. GC time is generally arround ~1%. We may close the issue, but it is better to keep it open for couple of days to test different cases.

ekinakyurek on 22 Sep 2018

Should this be closed?