There is a strange slowdown when using copy in a for loop.
using BenchmarkTools
f1()=randn(14,14,1024,32)
f2()=(for i in 1:1; randn(14,14,1024,32); end)
f3()=copy(randn(14,14,1024,32))
f4()=(for i in 1:1; copy(randn(14,14,1024,32)); end)
for f in (f1,f2,f3,f4); println(@benchmark $f()); end
# julia 1.0 output
Trial(55.839 ms)
Trial(55.836 ms)
Trial(81.300 ms)
Trial(148.378 ms)
# julia 0.6.4 output:
Trial(68.128 ms)
Trial(65.547 ms)
Trial(92.005 ms)
Trial(91.002 ms)
Need to be careful that this isn't a benchmarking artifact, for example:
julia> for f in (f1,f2,f3,f4); @btime $f(); end
57.959 ms (8 allocations: 49.00 MiB)
57.947 ms (8 allocations: 49.00 MiB)
73.382 ms (10 allocations: 98.00 MiB)
73.248 ms (10 allocations: 98.00 MiB)
Thanks, I had thought same, but I donβt think so. If you look at this comment and the previous one from referenced issue you will see:
https://github.com/denizyuret/Knet.jl/issues/345#issuecomment-417711416
Ekin
On Sep 3, 2018, at 1:09 PM, Kristoffer Carlsson notifications@github.com wrote:
Need to be careful that this isn't a benchmarking artifact, for example:
julia> for f in (f1,f2,f3,f4); @btime $f(); end
57.959 ms (8 allocations: 49.00 MiB)
57.947 ms (8 allocations: 49.00 MiB)
73.382 ms (10 allocations: 98.00 MiB)
73.248 ms (10 allocations: 98.00 MiB)
β
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Just a simple loop also have problems showing any difference.
julia> @time for i in 1:10 f4() end
0.767406 seconds (100 allocations: 980.004 MiB, 10.57% gc time)
julia> @time for i in 1:10 f3() end
0.772148 seconds (100 allocations: 980.004 MiB, 11.71% gc time)
Did you try with adding length(x) to the end of each function as commented in the previous link?
On Sep 3, 2018, at 1:39 PM, Kristoffer Carlsson notifications@github.com wrote:
Just a simple loop also have problems showing any difference.
julia> @time for i in 1:10 f4() end
0.767406 seconds (100 allocations: 980.004 MiB, 10.57% gc time)julia> @time for i in 1:10 f3() end
0.772148 seconds (100 allocations: 980.004 MiB, 11.71% gc time)
β
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Please show the two full function definitions that give a consistent time difference when using @time.
Here is as best I can replicate this without including all of Knet. Both fslow and ffast should be doing the same thing, but one is twice as fast as the other.
mutable struct MyArray{T,N}; a; d; end
MyArray{T,N}(::UndefInitializer, d::NTuple{N,Int}) where {T,N} = MyArray{T,N}(Array{T}(undef,d), d)
MyArray{T,N}(x::AbstractArray{S,N}) where {T,N,S} = copyto!(MyArray{T,N}(undef,size(x)), convert(Array{T,N},x))
Base.copyto!(a::MyArray,x)=copyto!(a.a,x)
fslow() = for i in 1:10 # takes 1.62s
a = randn(14,14,1024,32)
x = MyArray{Float32,4}(a)
end
ffast() = for i in 1:10 # takes 0.86s
a = convert(Array{Float32,4},randn(14,14,1024,32))
x = MyArray{Float32,4}(a)
end
Comparing the code_warntype of the two functions (thanks @shashi) the difference seems to be shuffling a few instructions. I still don't see how this can make a difference. What next step of analysis would you suggest?
slow:
13. randn Float64 array0
19. alloc Float32 array1
20. create MyArray with array1
26. alloc Float32 array2
27. copyto! array2 <- array0
28. copyto! MyArray <- array2
fast:
13. randn Float64 array0
19. alloc Float32 array1
20. copyto! array1 <- array0
26. alloc Float32 array2
27. create MyArray with array2
28. copyto! MyArray array1
There seems to be difference in the gc() behavior:
julia> @time fslow()
1.640096 seconds (160 allocations: 980.006 MiB, 47.69% gc time)
julia> @time ffast()
0.944209 seconds (160 allocations: 980.006 MiB, 16.97% gc time)
This indeed seems to be regarding the GC:
Some cleaned up Profile traces:
ffast:
540 /Users/kristoffer/julia/src/task.c:268; start_task
540 /Users/kristoffer/julia/src/./julia.h:1537; jl_apply
540 ./task.jl:259; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
540 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117; macro expansion
540 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
540 ./boot.jl:319; eval(::Module, ::Any)
540 /Users/kristoffer/julia/src/builtins.c:622; jl_toplevel_eval_in
540 /Users/kristoffer/julia/src/toplevel.c:807; jl_toplevel_eval_flex
540 /Users/kristoffer/julia/src/gf.c:1831; jl_fptr_trampoline
540 ...re/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
468 ./REPL[11]:2; ffast()
72 ./REPL[11]:3; ffast()
72 ./REPL[3]:1; Type
72 ./REPL[2]:1; Type
72 ./boot.jl:411; Type
72 ./boot.jl:406; Type
72 /Users/kristoffer/julia/src/array.c:413; jl_new_array
72 /Users/kristoffer/julia/src/array.c:158; _new_array
72 /Users/kristoffer/julia/src/array.c:112; _new_array_
72 .../kristoffer/julia/src/./julia_internal.h:274; jl_gc_alloc
72 /Users/kristoffer/julia/src/gc.c:954; jl_gc_pool_alloc
72 /Users/kristoffer/julia/src/gc.c:2633; jl_gc_collect
72 /Users/kristoffer/julia/src/gc.c:2468; _jl_gc_collect
39 /Users/kristoffer/julia/src/gc.c:1815; gc_mark_loop
25 /Users/kristoffer/julia/src/gc.c:1551; gc_mark_scan_obj8
10 /Users/kristoffer/julia/src/gc.c:1425; gc_try_setmark
14 /Users/kristoffer/julia/src/gc.c:1428; gc_try_setmark
12 /Users/kristoffer/julia/src/gc.c:540; gc_setmark_tag
fslow
814 /Users/kristoffer/julia/src/task.c:268; start_task
814 /Users/kristoffer/julia/src/./julia.h:1537; jl_apply
814 ./task.jl:259; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
814 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117; macro expansion
814 .../usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
814 ./boot.jl:319; eval(::Module, ::Any)
814 /Users/kristoffer/julia/src/builtins.c:622; jl_toplevel_eval_in
814 /Users/kristoffer/julia/src/toplevel.c:807; jl_toplevel_eval_flex
814 /Users/kristoffer/julia/src/gf.c:1831; jl_fptr_trampoline
814 ...re/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
439 ./REPL[5]:3; fslow
439 ./REPL[3]:1; MyArray{Float32,4}(::Array{Float64,4})
342 ./REPL[2]:1; Type
342 ./boot.jl:411; Type
342 ./boot.jl:406; Type
342 /Users/kristoffer/julia/src/array.c:413; jl_new_array
342 /Users/kristoffer/julia/src/array.c:158; _new_array
342 /Users/kristoffer/julia/src/array.c:112; _new_array_
342 ...kristoffer/julia/src/./julia_internal.h:274; jl_gc_alloc
342 /Users/kristoffer/julia/src/gc.c:954; jl_gc_pool_alloc
342 /Users/kristoffer/julia/src/gc.c:2633; jl_gc_collect
335 /Users/kristoffer/julia/src/gc.c:2468; _jl_gc_collect
55 /Users/kristoffer/julia/src/gc.c:1804; gc_mark_loop
54 /Users/kristoffer/julia/src/gc.c:1516; gc_mark_scan_objarray
44 /Users/kristoffer/julia/src/gc.c:1425; gc_try_setmark
186 /Users/kristoffer/julia/src/gc.c:1815; gc_mark_loop
37 /Users/kristoffer/julia/src/gc.c:1547; gc_mark_scan_obj8
137 /Users/kristoffer/julia/src/gc.c:1551; gc_mark_scan_obj8
85 /Users/kristoffer/julia/src/gc.c:1425; gc_try_setmark
43 /Users/kristoffer/julia/src/gc.c:1428; gc_try_setmark
39 /Users/kristoffer/julia/src/gc.c:540; gc_setmark_tag
@yuyichao perhaps you have an idea what is going on here.
So we did a little bit of digging. IIUC default_collect_interval controlls through should_collect if we should run a GC.
default_collect_interval (5600*1024*sizeof(void*))
So Deniz allocations end up being just a big bigger 6272*1024*sizeof(void*) then our default collect interval triggering a GC collection at every iteration.
To add to the mystery:
fveryfast() = for i in 1:10
v = rand(14, 14, 1024, 32)
a = convert(Array{Float32,4}, v)
x = MyArray{Float32,4}(a)
end
julia> @btime fveryfast()
299.005 ms (110 allocations: 980.00 MiB)
julia> @btime ffast()
681.678 ms (160 allocations: 980.01 MiB)
Does everything run at the same speed with the GC turned off?
With GC off:
julia> @time fveryfast()
0.390459 seconds (114 allocations: 980.005 MiB)
julia> @time fslow()
0.669810 seconds (164 allocations: 980.006 MiB)
julia> @time ffast()
0.658035 seconds (164 allocations: 980.006 MiB)
ffast calls randn, fveryfast calls rand.
In ffast the conversion is done before calling the MyArray constructor, which means MyArray needs less code, bringing it within the inlining threshold. In fslow, MyArray is not inlined. This seems to contribute to the difference.
Oh duh. Thanks for pointing that out.
I am currently testing:
mutable struct MyArray{T,N}; a; d; end
@inline MyArray{T,N}(::UndefInitializer, d::NTuple{N,Int}) where {T,N} = MyArray{T,N}(Array{T}(undef,d), d)
@inline MyArray{T,N}(x::AbstractArray{S,N}) where {T,N,S} = copyto!(MyArray{T,N}(undef,size(x)), convert(Array{T,N},x))
@inline Base.copyto!(a::MyArray,x)=copyto!(a.a,x)
fslow() = for i in 1:10 # takes 1.62s
a = randn(14,14,1024,32)
x = MyArray{Float32,4}(a)
end
ffast() = for i in 1:10 # takes 0.86s
a = convert(Array{Float32,4},randn(14,14,1024,32))
x = MyArray{Float32,4}(a)
end
to remove the inlining part.
I am also messing around with perf
perf record -g -k 1 ~/builds/julia/julia perfgc.jl
perf inject --jit -i perf.data -o perf.jit.data
perf report --call-graph -G -i perf.jit.data
perf script -i perf.jit.data | FlameGraph/stackcollapse-perf.pl > out.perf-folded
FlameGraph/flamegraph.pl out.perf-folded > perf-kernel.svg
with Flamegraph
- 44.43% 0.00% julia [.] japi1_fslow_-2142026221 β
- 44.43% japi1_fslow_-2142026221 β
+ 22.82% japi1_randn_-2142026219 β
- 19.32% jl_new_array β
- 19.31% _new_array_ β
jl_gc_alloc β
- jl_gc_pool_alloc β
- 19.31% jl_gc_collect β
- 19.04% _jl_gc_collect β
18.81% gc_mark_loop
- 29.08% 0.00% julia [.] japi1_ffast_-2142026210 β
- japi1_ffast_-2142026210 β
+ 21.94% japi1_randn_-2142026219 β
- 5.23% jl_new_array β
- 5.23% _new_array_ β
- 5.23% jl_gc_alloc β
- jl_gc_pool_alloc β
- 5.23% jl_gc_collect β
- 4.71% _jl_gc_collect β
4.34% gc_mark_loop β
0.52% gc_sweep_sysimg β
1.91% __memmove_avx_unaligned_erms
With the code posted here, I get
julia> @timev ffast();
1.513723 seconds (154 allocations: 980.006 MiB, 8.81% gc time)
elapsed time (ns): 1513722814
gc time (ns): 133397448
bytes allocated: 1027610400
pool allocs: 124
malloc() calls: 30
GC pauses: 30
full collections: 2
Putting @noinline on the MyArray constructors, I get
julia> @timev ffast();
2.282502 seconds (154 allocations: 980.006 MiB, 23.95% gc time)
elapsed time (ns): 2282502347
gc time (ns): 546578266
bytes allocated: 1027610400
pool allocs: 124
malloc() calls: 30
GC pauses: 30
full collections: 10
So the un-inlining adds full collections, but doesn't change anything else.
We think this can be solved by nulling roots in the compiler more aggressively. Changes to GC heuristics might also be warranted --- something seems off about repeatedly doing full collections on each iteration after failing to free anything (edit: it is freeing the arrays, but the pattern of allocations is still triggering full collections).
We checked on 0.6 yesterday and in there both are equally slow, and both are doing the same amount of full collections
Talked to Jeff yesterday and I believe he should now have a patch to fix the full collection heuristic.
We have a patch that might fix the full collection issue (PR incoming) but there is still a remaining performance difference that AFAICT comes down to this:
function fslow()
for i = 1:10
a = zeros(Float64,14,14,1024,32)
b = Array{Float32}(undef, (14,14,1024,32))
c = Array{Float32}(undef, (14,14,1024,32))
copyto!(c, a)
copyto!(b, c)
end
end
function ffast()
for i = 1:10
a = zeros(Float64,14,14,1024,32)
b = Array{Float32}(undef, (14,14,1024,32))
copyto!(b, a)
c = Array{Float32}(undef, (14,14,1024,32))
copyto!(c, b)
end
end
This simplifies the allocation and copying events that happen in the examples. In the slow version, all three arrays need to be alive at the same time, whereas in the fast version one array can be freed sooner. Keno and I hypothesize that this lets the same space be reused, making more efficient use of memory.
Print the pointer of the array to verify that? (and maybe use a noinline function to minimize effect on the root placement)
Fast:
pointer(a) = Ptr{Float64} @0x00007f38f2efe040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00
pointer(a) = Ptr{Float64} @0x00007f38f2efe040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00
Slow:
pointer(a) = Ptr{Float64} @0x00007f38f2efe040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00
pointer(a) = Ptr{Float64} @0x00007f38efdfd040
pointer(b) = Ptr{Float32} @0x000000000409bac0
pointer(c) = Ptr{Float32} @0x000000000591bb00
So yes, in the fast code the same space is reused for a on each iteration, and in the slow code it isn't.
Thank you @JeffBezanson @yuyichao. My original code which includes similar (apparently not same) structures of above abstracted codes works in same speed on 238c6cd5ff399fc007856385d2e8704f69be68dd and 6354405908703fb379cfc24897512bc6c4ec8c74. GC time is generally arround ~1%. We may close the issue, but it is better to keep it open for couple of days to test different cases.
Should this be closed?
https://github.com/JuliaLang/julia/pull/29290 is not merged.
Most helpful comment
Talked to Jeff yesterday and I believe he should now have a patch to fix the full collection heuristic.