Opening an issue as suggested here: https://discourse.julialang.org/t/with-missings-julia-is-slower-than-r/11838/17?u=rdeits
Sorting a Float64 array in-place is quite fast and non-allocating:
julia> @btime sort!(y) setup=(y = rand(100)) evals=1
2.414 μs (0 allocations: 0 bytes)
but sorting an array of Union{Float64, Missing}
in-place allocates a new copy and is about 2x slower, regardless of the input size:
julia> @btime sort!(y2) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing)) evals=1
4.157 μs (2 allocations: 624 bytes)
julia> versioninfo()
Julia Version 0.7.0-beta.5
Commit 948b088f17 (2018-06-24 17:50 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Not really a regression, right? Since this was previously definitely not efficient.
The difference is that sort!(v::Vector{Float64})
calls fpsort!(v)
, which partitions v
like
[ ≤ -0.0 | ≥ +0.0 | NaN ]
and then uses Core.Intrinsics.slt_int
for comparison on the first and second part.
Maybe we could treat missing
in general like we treat NaN
w/ floats?
What I mean is: when calling sort!(v::Vector{Union{T,Missing}})
, first partition v
as
[ non-missing | missing ]
and then call sort!(::Vector{T})
on the first block. Is something like this possible? Can you convince the compiler that the first bit of the vector consists of T
's, not of Union{Missing,T}
's?
As people have spotted on the Discourse thread, a first issue is that merge sort is used instead of quick sort. This explains the allocations, but using quick sort "only" reduces the time by one third. Probably still worth doing though, see https://github.com/JuliaLang/julia/pull/27789.
julia> @btime sort!(y, alg=QuickSort) setup=(y = rand(100));
460.335 ns (0 allocations: 0 bytes)
julia> @btime sort!(y, alg=MergeSort) setup=(y = rand(100));
665.809 ns (3 allocations: 608 bytes)
julia> @btime sort!(y2, alg=QuickSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
1.861 μs (0 allocations: 0 bytes)
julia> @btime sort!(y2, alg=MergeSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
1.918 μs (2 allocations: 624 bytes)
What I mean is: when calling sort!(v::Vector{Union{T,Missing}}), first partition v as
[ non-missing | missing ]
and then call sort!(::Vector{T}) on the first block. Is something like this possible? Can you convince the compiler that the first bit of the vector consists of T's, not of Union{Missing,T}'s?
Good catch @haampie! I guess we could try extending nans2end!
to also move missing
values to the end, which will allow using the same algorithm. The compiler can be convinced
using type assertions, but AFAIK they have some cost, so maybe more subtle solutions would be needed (like accessing directly the underlying data).
But is there a safe way to do so? Reinterpreting the vector will not work:
> reinterpret(Float64, Vector{Union{Missing,Float64}}(rand(100)))
ERROR: ArgumentError: cannot reinterpret `Union{Missing, Float64}` `Float64`, type `Union{Missing, Float64}` is not a bits type
No, we'd need to access the data part of the array, which is a hidden implementation detail. But I hope we can avoid that.
I've just tested a quick hack like this:
@@ -1021,7 +1021,7 @@ right(o::Perm) = Perm(right(o.order), o.data)
lt(::Left, x::T, y::T) where {T<:Floats} = slt_int(y, x)
lt(::Right, x::T, y::T) where {T<:Floats} = slt_int(x, y)
-isnan(o::DirectOrdering, x::Floats) = (x!=x)
+isnan(o::DirectOrdering, x::Union{Floats,Missing}) = (x === missing || x!=x)
@@ -1082,7 +1082,7 @@ end
fpsort!(v::AbstractVector, a::Sort.PartialQuickSort, o::Ordering) =
sort!(v, first(axes(v,1)), last(axes(v,1)), a, o)
-sort!(v::AbstractVector{<:Floats}, a::Algorithm, o::DirectOrdering) = fpsort!(v,a,o)
+sort!(v::AbstractVector{<:Union{Floats, Missing}, a::Algorithm, o::DirectOrdering) = fpsort!(v,a,o)
AFAICT it gives correct results for normal values, but mixes NaN
and missing
according to their order of appearance (this should be fixed and shouldn't be too costly). Performance is already quite good:
julia> @btime sort!(y2, alg=QuickSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
796.571 ns (0 allocations: 0 bytes)
but mixes
NaN
andmissing
according to their order of appearance (this should be fixed and shouldn't be too costly)
The correct behavior based on isless
seems to be sorting the NaN
s first in a block in appearance order and then the missing
s in a block at the very end also in order of appearance.
I've prepared a PR at https://github.com/JuliaLang/julia/pull/27817. It's about twice faster than the current QuickSort
, which is itself faster by 30% than the current MergeSort
default. That makes us about 50% slower than R AFAICT, so there's still room for improvement but it's quite reasonable. One should also test with different proportions of missing values.
What about larger vectors? I tried it with ~1000 elements, and that seemed to give the largest performance gap.
You mean compared with R? Actually I could have mentioned that "50% slower" referred to a comparison for a vector with 10M entries (as in the original Discourse post), to limit the variability of measurements.
I opened this issue https://github.com/JuliaLang/julia/issues/27831 with a question about the reinterpret-trick, cause I feel that's the way to tackle this problem in general.
So, the consensus of #27831 seems that something nice & performant is not yet around, but something quick & dirty will work:
@inline function partition_missing!(v::Vector{Union{T,Missing}}) where {T}
i, j = 1, length(v)
@inbounds while true
while i < j && v[i] !== missing; i += 1; end
while i < j && v[j] === missing; j -= 1; end
i >= j && break
v[i], v[j] = v[j], v[i]
i += 1; j -= 1;
end
@inbounds return v[i] === missing ? i - 1 : i
end
function my_sort!(v::Vector{Union{T,Missing}}) where {T}
m = partition_missing!(v)
w = unsafe_wrap(Array, Ptr{T}(pointer(v)), m)
sort!(w)
v
end
### test
using Test
function test_things()
@test issorted(my_sort!([missing, 3, 2, 10, missing, 4]))
@test issorted(my_sort!(Vector{Union{Int,Missing}}([missing, missing])))
@test issorted(my_sort!(Vector{Union{Int,Missing}}([4,2,6,2,9])))
end
### bench
using BenchmarkTools
using Random
function bench(T::Type = Int, n = 1_000)
y = rand(T, n)
vec = ifelse.(y .< 0.9, y, missing)
bench_new = @benchmark my_sort!(z) setup = (z = copy($vec))
bench_curr = @benchmark sort!(z) setup = (z = copy($vec))
bench_notmissing = @benchmark sort!(z) setup = (z = copy($y))
bench_new, bench_curr, bench_notmissing
end
gives
julia> a, b, c = bench(Int, 10_000)
(Trial(352.888 μs), Trial(1.700 ms), Trial(532.544 μs))
julia> a, b, c = bench(Float64, 10_000)
(Trial(514.307 μs), Trial(2.488 ms), Trial(530.676 μs))
So, sorting a vector with missing data is actually faster than sorting a vector without missing values :). Probably has to do with the fact that we don't have to sort 10% of the data when moving the missing
values to the end.
Fascinating! ;-)
That approach sounds promising to me, maybe it can handle missing values in general and avoid the need for https://github.com/JuliaLang/julia/pull/27817. One question is to find what types T
are safe with that approach, something @vtjnash and @quinnj can tell. Then you need to identify where in the complex dispatch tree of the sorting code this method can be introduced
I don’t think we really need a completely general solution to this right away, it would be fine to hack it in for common important bits types like ints and floats.
That's not the right predicate, but the simplest is maybe Base.isbitstype(nonmissingtype(eltype(x)))
I don’t think we really need a completely general solution to this right away, it would be fine to hack it in for common important bits types like ints and floats.
Sure. What I was wondering is whether we need special code in fpsort
as in #27817 or whether a common solution can apply to both Int
and Float64
(and probably others).
My guess is that the above might not be noticeably slower than #27817, maybe even faster?
Yes, it should be similar, though moving NaN
s and missing
s to the end in a single pass, and separating them only in a second pass is probably slightly more efficient when they are the minority. Likely not worth the additional code if we can have a common solution for several element types.
@haampie Do you want to give a try to the general approach you presented above? It would be interesting to know whether it can make https://github.com/JuliaLang/julia/pull/27817 unnecessary or not.
Yes, I'm sorry for not pursuing this right away. I'll give it a shot this weekend!
Most helpful comment
So, the consensus of #27831 seems that something nice & performant is not yet around, but something quick & dirty will work:
gives
So, sorting a vector with missing data is actually faster than sorting a vector without missing values :). Probably has to do with the fact that we don't have to sort 10% of the data when moving the
missing
values to the end.