There seems to be a problem when <=
is used in a while
loop, where the entire function containing the while loop is >5x slower than the while loop that uses <
.
I think that this might be preventing optimizations on inline iterators as well, I can give more details on that as well if needed.
I used the following benchmark functions to identify the problem:
using BenchmarkTools
function while_vect_test_ne(vec::Vector{T}) where T
ret=T(0)
L=length(vec)
i=1
while i != (L+1)
@inbounds v=vec[i]
ret+=v
i += 1;
end
return ret
end
function while_vect_test_lt(vec::Vector{T}) where T
ret=T(0)
L=length(vec)::Int
i=1::Int
while i < (L+1)
@inbounds v=vec[i]
ret+=v
i += 1;
end
return ret
end
#this is the bad one
function while_vect_test_lte(vec::Vector{T}) where T
ret=T(0)
L=length(vec)::Int
i=1::Int
while i <= L
@inbounds v=vec[i]
ret+=v
i += 1;
end
return ret
end
vec=collect(1:1000)
@benchmark while_vect_test_ne(vec)
@benchmark while_vect_test_lt(vec)
@benchmark while_vect_test_lte(vec)
Results
julia> @benchmark while_vect_test_ne(vec)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 66.301 ns (0.00% GC)
median time: 70.757 ns (0.00% GC)
mean time: 86.534 ns (9.57% GC)
maximum time: 55.662 μs (99.82% GC)
--------------
samples: 10000
evals/sample: 976
julia> @benchmark while_vect_test_lt(vec)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 66.695 ns (0.00% GC)
median time: 71.407 ns (0.00% GC)
mean time: 94.916 ns (10.02% GC)
maximum time: 67.625 μs (99.83% GC)
--------------
samples: 10000
evals/sample: 968
julia> @benchmark while_vect_test_lte(vec)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 416.070 ns (0.00% GC)
median time: 436.558 ns (0.00% GC)
mean time: 509.744 ns (2.33% GC)
maximum time: 119.429 μs (99.48% GC)
--------------
samples: 10000
evals/sample: 199
Note that the two conditions are not strictly identical because of integer wrap-around.
when L==typemax(Int)
, L+1
will overflow causing a infinite loop. LLVM (mostly SCEV) will determine that the loop-bounds are unknown and not vectorize this loop.
x-ref: #31011
the "right way" of writing the second loop is:
function while_vect_test_lte(vec::Vector{T}) where T
ret=T(0)
L=length(vec)::Int
i=0::Int
while i < L
i += 1
@inbounds v=vec[i]
ret+=v
end
return ret
end
The weird part is that I would expect the non-vectorized versions to be slower, not faster.
Updating
function while_vect_test_lt(vec::Vector{T}) where T
ret=T(0)
L=length(vec)::Int
i=0::Int
while i < L
i += 1;
@inbounds v=vec[i]
ret+=v
end
return ret
end
Yields:
julia> @benchmark while_vect_test_lt(vec)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 65.307 ns (0.00% GC)
median time: 69.715 ns (0.00% GC)
mean time: 84.524 ns (8.87% GC)
maximum time: 55.671 μs (99.78% GC)
--------------
samples: 10000
evals/sample: 976
Both the first and second loop are vectorized and are fast. The third loop is not vectorized and is slow.
Then isn't the explanation backwards since those are the two that compare with L+1
?
If L+1
overflows the first loop will loop until i
overflows and the second one will loop 0 times.
Oh duh, they both terminate while i ≤ L
is always true so it's an infinite loop.
Not sure there's anything to do here: Julia and LLVM seem to be handling this correctly.
I agree that is the semantics that we have and LLVM handles them correctly. I will note that using add nsw
(no-signed wrap) is an alternative, but avoiding the overflow is better in general.
@ndinsmore you mentioned:
I think that this might be preventing optimizations on inline iterators as well, I can give more details on that as well if needed.
Do you have more information on that?
Also, note that if you use a range as is idiomatic, then this has good performance:
function vect_test_range(vec::Vector{T}) where T
ret = T(0)
L = length(vec)
for i = 1:L
@inbounds v = vec[i]
ret += v
end
return ret
end
Unfortunately, the even more idiomatic version just doing iteration is less fast:
function vect_test_iterate(vec::Vector{T}) where T
ret = T(0)
for v in vec
ret += v
end
return ret
end
Do we have an issue tracking the iteration protocol being able to automatically @inbounds
things? cc @Keno, @mbauman
All my timings:
@belapsed(while_vect_test_ne($vec)) = 3.098590130916415e-8
@belapsed(while_vect_test_lt($vec)) = 3.104028197381672e-8
@belapsed(while_vect_test_lte($vec)) = 2.8979411764705885e-7
@belapsed(vect_test_range($vec)) = 3.093655589123867e-8
@belapsed(vect_test_iterate($vec)) = 5.7927624872579005e-8
Looks the same to me?
julia> @btime vect_test_range(vec)
538.303 ns (1 allocation: 16 bytes)
50005000
julia> @btime vect_test_iterate(vec)
545.455 ns (1 allocation: 16 bytes)
50005000
https://github.com/JuliaLang/julia/pull/27386 should have fixed that.
I'm getting consistently different timings:
julia> @btime vect_test_range(vec)
47.356 ns (1 allocation: 16 bytes)
500500
julia> @btime vect_test_iterate(vec)
73.695 ns (1 allocation: 16 bytes)
500500
julia> @btime vect_test_range($vec)
31.103 ns (0 allocations: 0 bytes)
500500
julia> @btime vect_test_iterate($vec)
57.703 ns (0 allocations: 0 bytes)
500500
Will build a fresh master and see if anything changes...
For small arrays I see some difference but it seems it goes away when increasing the size a bit:
julia> vec=collect(1:1000);
julia> @btime vect_test_range(vec)
42.380 ns (1 allocation: 16 bytes)
500500
julia> @btime vect_test_iterate(vec)
66.700 ns (1 allocation: 16 bytes)
500500
julia> vec=collect(1:5000);
julia> @btime vect_test_range(vec)
283.272 ns (1 allocation: 16 bytes)
12502500
julia> @btime vect_test_iterate(vec)
283.392 ns (1 allocation: 16 bytes)
12502500
Ok, that's interesting. Not sure how much we should be bothered by that... I am a little.
So how would you write a non-infinite loop using <=
? That seems like a huge performance pitfall.
This isn't a Julia-specific problem, it's a general "integers overflow" problem. C avoids this by making integer overflow UB, so the compiler can do anything it wants, but we don't want to introduce UB. You could check that L < typemax(T)
beforehand which allows LLVM to prove that it won't overflow.
At a higher level, why are you writing while loops rather than iterating over a range?
@vchuravy
I think that this might be preventing optimizations on inline iterators as well, I can give more details on that as well if needed.
Do you have more information on that?
This is less compelling than it was when I first discovered it and is potential coincidental.
So it started with an obsession the performance of iteration on dicts:
Starting with a test function like:
using BenchmarkTools
function iterate_test_func(iteration_function,f,iter,N=1)
ret=0
for n in 1:N
ret=0
@inbounds next = iteration_function(iter)
while next !== nothing
(v, state) = next
ret+=f(v)
@inbounds next = iteration_function(iter, state)
end
end
return ret
end
d=Dict(v=>v for v in 1:1000)
@benchmark iterate_test_func(iterate,v->v,values(d))
@benchmark iterate_test_func(iterate,v->v.second,d)
results
julia> @benchmark iterate_test_func(iterate,v->v,values(d))
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 4.223 μs (0.00% GC)
median time: 4.953 μs (0.00% GC)
mean time: 5.478 μs (0.00% GC)
maximum time: 125.694 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark iterate_test_func(iterate,v->v.second,d)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 4.522 μs (0.00% GC)
median time: 5.210 μs (0.00% GC)
mean time: 5.675 μs (0.00% GC)
maximum time: 135.197 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
Then to see if iterate could be faster I wrote something like:
(Which I don't think is an infinate loop)
Base.@propagate_inbounds @inline function iterate_slim(v::T, i = v.dict.idxfloor) where T <: Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}}
d=v.dict
slots=d.slots
L = length(slots)
while i <= L
@inbounds if slots[i] == 0x1
@inbounds return ((T <: Base.KeySet) ? d.keys[i] : d.vals[i], i+1)
end
i += 1
end
return nothing
end
julia> @benchmark iterate_test_func(iterate_slim,v->v,values(d),10)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 36.138 μs (0.00% GC)
median time: 39.947 μs (0.00% GC)
mean time: 42.563 μs (0.00% GC)
maximum time: 827.562 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Now knowing about the pitfall of <=: (Very small code change)
Base.@propagate_inbounds @inline function iterate_slim_lt(v::T, i = v.dict.idxfloor-1) where T <: Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}}
d=v.dict
slots=d.slots
L = length(slots)
while i < L
i += 1
@inbounds if slots[i] == 0x1
@inbounds return ((T <: Base.KeySet) ? d.keys[i] : d.vals[i], i+1)
end
end
return nothing
end
julia> @benchmark iterate_test_func(iterate_slim_lt,v->v,values(d),1)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 2.257 μs (0.00% GC)
median time: 2.696 μs (0.00% GC)
mean time: 3.100 μs (0.00% GC)
maximum time: 132.669 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
Finally at one point I was trying to understand entitlement so I wrote:
(Which again I believe is a non-infinite loop). The coisidence being that this version using a <=
is about the same order of magnitude as the vanilla iterate.
function slow_entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
ret=0
@inbounds for n in 1:N
ret=0
d=iter.dict
slots=d.slots
L=length(slots)
vals=d.vals
i=d.idxfloor
while i <= L
@inbounds if slots[i] === 0x1
@inbounds v=vals[i]
ret+=f(v)
end
i+=1
end
end
return ret
end
julia> @benchmark slow_entitlement_test_func(v->v,values(d))
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 6.702 μs (0.00% GC)
median time: 7.512 μs (0.00% GC)
mean time: 8.123 μs (0.00% GC)
maximum time: 128.705 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
Now get rid of <=
we get blazing a fast loop:
(Which I am still trying to understand why doesn't the iterate_slim_lt
version above get optimize to nearly the same code.)
function entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
ret=0
@inbounds for n in 1:N
ret=0
d=iter.dict
slots=d.slots
L=length(slots)
vals=d.vals
for i = d.idxfloor:L
@inbounds if slots[i] === 0x1
@inbounds v=vals[i]
ret+=f(v)
end
i+=1
end
end
return ret
end
julia> @benchmark entitlement_test_func(v->v,values(d))
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 979.353 ns (0.00% GC)
median time: 1.005 μs (0.00% GC)
mean time: 1.067 μs (0.00% GC)
maximum time: 8.688 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 17
@StefanKarpinski I think that iteration through a Dict
, pairs or values should be 10X faster on the order of entitlement above.
I think that the function-blocking the optimization is:
function skip_deleted(h::Dict, i)
L = length(h.slots)
@inbounds while i<=L && !isslotfilled(h,i)
i += 1
end
return i
end
But have yet been able to be able to prove that.
Makes sense. It could be changed to
function skip_deleted(h::Dict, i)
L = length(h.slots)
@inbounds while true
if i >= L || isslotfilled(h, i)
return i
end
i += 1
end
end
That should be provably non-infinite since i
will eventually be at least as large as L
.
The || !isslotfilled(h,i)
and the early return should kill any vectorization attempts anyway, no?
This gives a 2x improvement:
julia> @benchmark iterate_test_func(iterate,v->v,values(d),1)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 4.329 μs (0.00% GC)
median time: 4.949 μs (0.00% GC)
mean time: 5.340 μs (0.00% GC)
maximum time: 119.739 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark iterate_test_func(iterate,v->v.second,d,1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 5.056 μs (0.00% GC)
median time: 6.266 μs (0.00% GC)
mean time: 6.722 μs (0.00% GC)
maximum time: 446.593 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 6
julia> function Base.skip_deleted(h::Dict, i)
L = length(h.slots)
for i= i:L
@inbounds if Base.isslotfilled(h,i)
return i
end
end
return L+1
end
julia> @benchmark iterate_test_func(iterate,v->v,values(d),1)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 2
--------------
minimum time: 2.623 μs (0.00% GC)
median time: 2.839 μs (0.00% GC)
mean time: 3.083 μs (0.00% GC)
maximum time: 25.621 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark iterate_test_func(iterate,v->v.second,d,1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 3.832 μs (0.00% GC)
median time: 5.146 μs (0.00% GC)
mean time: 5.316 μs (0.00% GC)
maximum time: 51.718 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
is the if statement in the following killing vectorization?
function Base.skip_deleted(h::Dict, i)
L = length(h.slots)
@inbounds while i<=L
if Base.isslotfilled(h,i)
return i
end
i += 1
end
return i
end
Because isn't it provably non-infinite?
It provides no real benefit.
@StefanKarpinski @KristofferC @vchuravy
Do you have any idea why the following loop is so much slower(8x) than the for loop? when I dug in it came down to the <=
, but isn't this a non-infinite loop?
function slow_entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
ret=0
@inbounds for n in 1:N
ret=0
d=iter.dict
slots=d.slots
L=length(slots)
vals=d.vals
i=d.idxfloor
while i <= L
#for i = d.idxfloor:L
@inbounds if slots[i] === 0x1
@inbounds v=vals[i]
ret+=f(v)
end
i+=1 # Also comment out for the for loop
end
end
return ret
end
This actually the source of this whole issue.
For the performance, results see above.
What happens if i=L=typemax(Int64)
? You will enter the loop, and then i
will overflow and wrap around.
@vchuravy is there a way to tell the optimizer that it can't happen even if it appears it could?
I tried @assert L < typemax(Int)
, because in this case seeing as is a linear array and it could not happen since most 64 bit systems limit memory to something less the 2^64 spaces.
While I do not recommend this, there is always a way.
function unsafe_inc(i::Int64)
Base.llvmcall("""
%i = add nsw i64 %0, 1
ret i64 %i
""", Int64, Tuple{Int64}, i)
end
function slow_entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
ret=0
@inbounds for n in 1:N
ret=0
d=iter.dict
slots=d.slots
L=length(slots)
vals=d.vals
i=d.idxfloor
while i <= L
#for i = d.idxfloor:L
@inbounds if slots[i] === 0x1
@inbounds v=vals[i]
ret+=f(v)
end
i = unsafe_inc(i)
end
end
return ret
end
Since this is not a bug in Julia, but due to the defined semantics I am going to close this issue.
Please continue discussion on discourse.julialang.org
Most helpful comment
What happens if
i=L=typemax(Int64)
? You will enter the loop, and theni
will overflow and wrap around.