On REPL of v"1.1.0-DEV.499" (and similarly for v0.6.4 and v.1.0.1)
x = [rand(1:99,4) for i=1:1_000_000];
@time y=Set.(x);
results in first or second calculation is fastest, but then much slower.
julia> @time y=Set.(x);
0.813826 seconds (5.20 M allocations: 491.006 MiB, 37.97% gc time)
julia> @time y=Set.(x);
0.388067 seconds (5.00 M allocations: 480.652 MiB)
julia> for i=1:20; @time y=Set.(x); end
0.822679 seconds (5.00 M allocations: 480.652 MiB, 45.58% gc time)
2.128358 seconds (5.00 M allocations: 480.652 MiB, 78.48% gc time)
1.834704 seconds (5.00 M allocations: 480.652 MiB, 78.35% gc time)
1.808520 seconds (5.00 M allocations: 480.652 MiB, 80.03% gc time)
1.728256 seconds (5.00 M allocations: 480.652 MiB, 79.76% gc time)
1.709768 seconds (5.00 M allocations: 480.652 MiB, 79.21% gc time)
1.678675 seconds (5.00 M allocations: 480.652 MiB, 79.24% gc time)
1.675938 seconds (5.00 M allocations: 480.652 MiB, 79.19% gc time)
1.662018 seconds (5.00 M allocations: 480.652 MiB, 78.57% gc time)
1.625075 seconds (5.00 M allocations: 480.652 MiB, 78.64% gc time)
1.607812 seconds (5.00 M allocations: 480.652 MiB, 78.16% gc time)
1.662902 seconds (5.00 M allocations: 480.652 MiB, 78.32% gc time)
1.621104 seconds (5.00 M allocations: 480.652 MiB, 77.27% gc time)
1.587675 seconds (5.00 M allocations: 480.652 MiB, 77.37% gc time)
1.657706 seconds (5.00 M allocations: 480.652 MiB, 73.21% gc time)
1.540019 seconds (5.00 M allocations: 480.652 MiB, 76.71% gc time)
1.529854 seconds (5.00 M allocations: 480.652 MiB, 76.85% gc time)
1.494510 seconds (5.00 M allocations: 480.652 MiB, 76.72% gc time)
1.508598 seconds (5.00 M allocations: 480.652 MiB, 76.28% gc time)
1.527927 seconds (5.00 M allocations: 480.652 MiB, 76.66% gc time)
Memory consumption of Julia <2 GiB, much less than the total memory of the system.
It's normal that time varies between runs, as the GC may run at at any time. Use BenchmarkTools to average it out.
Using BenchmarkTools shows the slowdown as well.
julia> using BenchmarkTools
julia> x = [rand(1:99,4) for i=1:1_000_000];
julia> @btime Set.($x);
588.088 ms (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
796.988 ms (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
992.914 ms (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
1.247 s (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
1.516 s (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
1.511 s (5000002 allocations: 480.65 MiB)
Non-optimized Python comparison shows deterministic timings, and is faster than Julia above (except for the second run that timed 0.388 s)
import random
a = []
for i in range(0, 1_000_000):
a.append([random.randint(1, 99) for j in range(0,4)])
def myset(a):
b = []
for i in range(0, len(a)):
b.append(set(a[i]))
return b
%timeit b=myset(a)
478 ms 卤 46.8 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
%timeit b=myset(a)
486 ms 卤 49.3 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
%timeit b=myset(a)
464 ms 卤 15.9 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
%timeit b=myset(a)
453 ms 卤 4.16 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
%timeit b=myset(a)
453 ms 卤 3.94 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
Just looking at your reported times, and removing the GC part:
julia> 0.82*0.55
0.451
julia> 2.12*0.22
0.46640000000000004
So there is no real slowdown, apart from the fact that the price of garbage collecting earlier runs is attributed to later runs. On the other hand, this nicely demonstrates the price of garbage collected languages (as opposed to ref-counted ones): If the code is structured such that generational gc doesn't help and heuristics fail, then we get a complexity-class sized slowdown:
julia> using BenchmarkTools
julia> x = [rand(1:99,4) for i=1:100_000]; y=Set.(x);
julia> @btime broadcast!($Set, $y, $x);
58.723 ms (500000 allocations: 47.30 MiB)
julia> x = [rand(1:99,4) for i=1:200_000]; y=Set.(x);
julia> @btime broadcast!($Set, $y, $x);
159.804 ms (1000000 allocations: 94.60 MiB)
julia> x = [rand(1:99,4) for i=1:400_000]; y=Set.(x);
julia> @btime broadcast!($Set, $y, $x);
587.759 ms (2000000 allocations: 189.21 MiB)
julia> x = [rand(1:99,4) for i=1:800_000]; y=Set.(x);
julia> @btime broadcast!($Set, $y, $x);
2.197 s (4000000 allocations: 378.42 MiB)
julia> x = [rand(1:99,4) for i=1:1_600_000]; y=Set.(x);
julia> @btime broadcast!($Set, $y, $x);
7.958 s (8000000 allocations: 756.84 MiB)
For large N, we get quadratic runtime, as expected: Each GC run takes linear time in N, and the number of GC runs is also linear in N. Python, as a refcounted language, will experience linear runtime (I think cycle detection is also quadratic, but triggered more rarely). Thanks for this nice example, maybe we should put it into the performance pitfall section (because you should never ever write code like that: Many tiny interlinked GC visible objects that are rolled over with a period long enough to stymie generational gc).
I think something is suboptimal about our GC heuristics here. GC does not inherently require an asymptotic slowdown; collection can always be made a constant fraction of runtime.
On Julia v1.3.1 I cannot reproduce the slowdown anymore
julia> using BenchmarkTools
julia> x = [rand(1:99,4) for i=1:1_000_000];
julia> for i=1:20; @time y=Set.(x); end
0.552832 seconds (5.00 M allocations: 480.652 MiB, 41.23% gc time)
0.321580 seconds (5.00 M allocations: 480.652 MiB)
0.515798 seconds (5.00 M allocations: 480.652 MiB, 42.49% gc time)
0.323023 seconds (5.00 M allocations: 480.652 MiB)
0.543687 seconds (5.00 M allocations: 480.652 MiB, 41.90% gc time)
0.345157 seconds (5.00 M allocations: 480.652 MiB)
0.531513 seconds (5.00 M allocations: 480.652 MiB, 42.90% gc time)
0.322940 seconds (5.00 M allocations: 480.652 MiB)
0.533765 seconds (5.00 M allocations: 480.652 MiB, 44.11% gc time)
0.316955 seconds (5.00 M allocations: 480.652 MiB)
0.547226 seconds (5.00 M allocations: 480.652 MiB, 43.37% gc time)
0.306540 seconds (5.00 M allocations: 480.652 MiB)
0.314869 seconds (5.00 M allocations: 480.652 MiB)
0.521128 seconds (5.00 M allocations: 480.652 MiB, 42.65% gc time)
0.317711 seconds (5.00 M allocations: 480.652 MiB)
0.453577 seconds (5.00 M allocations: 480.652 MiB, 34.68% gc time)
0.450270 seconds (5.00 M allocations: 480.652 MiB, 32.99% gc time)
0.316003 seconds (5.00 M allocations: 480.652 MiB)
0.494622 seconds (5.00 M allocations: 480.652 MiB, 39.59% gc time)
0.317511 seconds (5.00 M allocations: 480.652 MiB)
julia> @btime Set.($x);
358.550 ms (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
323.221 ms (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
326.482 ms (5000002 allocations: 480.65 MiB)
julia> @btime Set.($x);
322.516 ms (5000002 allocations: 480.65 MiB)
Closing the issue. PR https://github.com/JuliaLang/julia/pull/32556 says it helps this issue a bit, maybe there were other improvements as well.
Most helpful comment
Just looking at your reported times, and removing the GC part:
So there is no real slowdown, apart from the fact that the price of garbage collecting earlier runs is attributed to later runs. On the other hand, this nicely demonstrates the price of garbage collected languages (as opposed to ref-counted ones): If the code is structured such that generational gc doesn't help and heuristics fail, then we get a complexity-class sized slowdown:
For large N, we get quadratic runtime, as expected: Each GC run takes linear time in N, and the number of GC runs is also linear in N. Python, as a refcounted language, will experience linear runtime (I think cycle detection is also quadratic, but triggered more rarely). Thanks for this nice example, maybe we should put it into the performance pitfall section (because you should never ever write code like that: Many tiny interlinked GC visible objects that are rolled over with a period long enough to stymie generational gc).