Julia: `sort!(::Vector{Union{Float64, Missing}})` allocates unexpectedly in v0.7-beta

Created on 25 Jun 2018  Â·  22Comments  Â·  Source: JuliaLang/julia

Opening an issue as suggested here: https://discourse.julialang.org/t/with-missings-julia-is-slower-than-r/11838/17?u=rdeits

Sorting a Float64 array in-place is quite fast and non-allocating:

julia> @btime sort!(y) setup=(y = rand(100)) evals=1
  2.414 μs (0 allocations: 0 bytes)

but sorting an array of Union{Float64, Missing} in-place allocates a new copy and is about 2x slower, regardless of the input size:

julia> @btime sort!(y2) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing)) evals=1
  4.157 μs (2 allocations: 624 bytes)
julia> versioninfo()
Julia Version 0.7.0-beta.5
Commit 948b088f17 (2018-06-24 17:50 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
missing data performance

Most helpful comment

So, the consensus of #27831 seems that something nice & performant is not yet around, but something quick & dirty will work:

@inline function partition_missing!(v::Vector{Union{T,Missing}}) where {T}
    i, j = 1, length(v)
    @inbounds while true
        while i < j && v[i] !== missing; i += 1; end
        while i < j && v[j] === missing; j -= 1; end
        i >= j && break
        v[i], v[j] = v[j], v[i]
        i += 1; j -= 1;
    end
    @inbounds return v[i] === missing ? i - 1 : i
end

function my_sort!(v::Vector{Union{T,Missing}}) where {T}
    m = partition_missing!(v)
    w = unsafe_wrap(Array, Ptr{T}(pointer(v)), m)
    sort!(w)
    v
end

### test
using Test
function test_things()
    @test issorted(my_sort!([missing, 3, 2, 10, missing, 4]))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([missing, missing])))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([4,2,6,2,9])))
end

### bench

using BenchmarkTools
using Random

function bench(T::Type = Int, n = 1_000)
    y = rand(T, n)
    vec = ifelse.(y .< 0.9, y, missing)
    bench_new = @benchmark my_sort!(z) setup = (z = copy($vec))
    bench_curr = @benchmark sort!(z) setup = (z = copy($vec))
    bench_notmissing = @benchmark sort!(z) setup = (z = copy($y))
    bench_new, bench_curr, bench_notmissing
end

gives

julia> a, b, c = bench(Int, 10_000)
(Trial(352.888 μs), Trial(1.700 ms), Trial(532.544 μs))

julia> a, b, c = bench(Float64, 10_000)
(Trial(514.307 μs), Trial(2.488 ms), Trial(530.676 μs))

So, sorting a vector with missing data is actually faster than sorting a vector without missing values :). Probably has to do with the fact that we don't have to sort 10% of the data when moving the missing values to the end.

All 22 comments

Not really a regression, right? Since this was previously definitely not efficient.

The difference is that sort!(v::Vector{Float64}) calls fpsort!(v), which partitions v like

[ ≤ -0.0 | ≥ +0.0 | NaN ]

and then uses Core.Intrinsics.slt_int for comparison on the first and second part.

Maybe we could treat missing in general like we treat NaN w/ floats?

What I mean is: when calling sort!(v::Vector{Union{T,Missing}}), first partition v as

[ non-missing | missing ]

and then call sort!(::Vector{T}) on the first block. Is something like this possible? Can you convince the compiler that the first bit of the vector consists of T's, not of Union{Missing,T}'s?

As people have spotted on the Discourse thread, a first issue is that merge sort is used instead of quick sort. This explains the allocations, but using quick sort "only" reduces the time by one third. Probably still worth doing though, see https://github.com/JuliaLang/julia/pull/27789.

julia> @btime sort!(y, alg=QuickSort) setup=(y = rand(100));
  460.335 ns (0 allocations: 0 bytes)

julia> @btime sort!(y, alg=MergeSort) setup=(y = rand(100));
  665.809 ns (3 allocations: 608 bytes)

julia> @btime sort!(y2, alg=QuickSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.861 μs (0 allocations: 0 bytes)

julia> @btime sort!(y2, alg=MergeSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.918 μs (2 allocations: 624 bytes)

What I mean is: when calling sort!(v::Vector{Union{T,Missing}}), first partition v as

[ non-missing | missing ]

and then call sort!(::Vector{T}) on the first block. Is something like this possible? Can you convince the compiler that the first bit of the vector consists of T's, not of Union{Missing,T}'s?

Good catch @haampie! I guess we could try extending nans2end! to also move missing values to the end, which will allow using the same algorithm. The compiler can be convinced using type assertions, but AFAIK they have some cost, so maybe more subtle solutions would be needed (like accessing directly the underlying data).

But is there a safe way to do so? Reinterpreting the vector will not work:

> reinterpret(Float64, Vector{Union{Missing,Float64}}(rand(100)))
ERROR: ArgumentError: cannot reinterpret `Union{Missing, Float64}` `Float64`, type `Union{Missing, Float64}` is not a bits type

No, we'd need to access the data part of the array, which is a hidden implementation detail. But I hope we can avoid that.

I've just tested a quick hack like this:

@@ -1021,7 +1021,7 @@ right(o::Perm) = Perm(right(o.order), o.data)
 lt(::Left, x::T, y::T) where {T<:Floats} = slt_int(y, x)
 lt(::Right, x::T, y::T) where {T<:Floats} = slt_int(x, y)

-isnan(o::DirectOrdering, x::Floats) = (x!=x)
+isnan(o::DirectOrdering, x::Union{Floats,Missing}) = (x === missing || x!=x)
@@ -1082,7 +1082,7 @@ end
 fpsort!(v::AbstractVector, a::Sort.PartialQuickSort, o::Ordering) =
     sort!(v, first(axes(v,1)), last(axes(v,1)), a, o)

-sort!(v::AbstractVector{<:Floats}, a::Algorithm, o::DirectOrdering) = fpsort!(v,a,o)
+sort!(v::AbstractVector{<:Union{Floats, Missing}, a::Algorithm, o::DirectOrdering) = fpsort!(v,a,o)

AFAICT it gives correct results for normal values, but mixes NaN and missing according to their order of appearance (this should be fixed and shouldn't be too costly). Performance is already quite good:

julia> @btime sort!(y2, alg=QuickSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  796.571 ns (0 allocations: 0 bytes)

but mixes NaN and missing according to their order of appearance (this should be fixed and shouldn't be too costly)

The correct behavior based on isless seems to be sorting the NaNs first in a block in appearance order and then the missings in a block at the very end also in order of appearance.

I've prepared a PR at https://github.com/JuliaLang/julia/pull/27817. It's about twice faster than the current QuickSort, which is itself faster by 30% than the current MergeSort default. That makes us about 50% slower than R AFAICT, so there's still room for improvement but it's quite reasonable. One should also test with different proportions of missing values.

What about larger vectors? I tried it with ~1000 elements, and that seemed to give the largest performance gap.

You mean compared with R? Actually I could have mentioned that "50% slower" referred to a comparison for a vector with 10M entries (as in the original Discourse post), to limit the variability of measurements.

I opened this issue https://github.com/JuliaLang/julia/issues/27831 with a question about the reinterpret-trick, cause I feel that's the way to tackle this problem in general.

So, the consensus of #27831 seems that something nice & performant is not yet around, but something quick & dirty will work:

@inline function partition_missing!(v::Vector{Union{T,Missing}}) where {T}
    i, j = 1, length(v)
    @inbounds while true
        while i < j && v[i] !== missing; i += 1; end
        while i < j && v[j] === missing; j -= 1; end
        i >= j && break
        v[i], v[j] = v[j], v[i]
        i += 1; j -= 1;
    end
    @inbounds return v[i] === missing ? i - 1 : i
end

function my_sort!(v::Vector{Union{T,Missing}}) where {T}
    m = partition_missing!(v)
    w = unsafe_wrap(Array, Ptr{T}(pointer(v)), m)
    sort!(w)
    v
end

### test
using Test
function test_things()
    @test issorted(my_sort!([missing, 3, 2, 10, missing, 4]))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([missing, missing])))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([4,2,6,2,9])))
end

### bench

using BenchmarkTools
using Random

function bench(T::Type = Int, n = 1_000)
    y = rand(T, n)
    vec = ifelse.(y .< 0.9, y, missing)
    bench_new = @benchmark my_sort!(z) setup = (z = copy($vec))
    bench_curr = @benchmark sort!(z) setup = (z = copy($vec))
    bench_notmissing = @benchmark sort!(z) setup = (z = copy($y))
    bench_new, bench_curr, bench_notmissing
end

gives

julia> a, b, c = bench(Int, 10_000)
(Trial(352.888 μs), Trial(1.700 ms), Trial(532.544 μs))

julia> a, b, c = bench(Float64, 10_000)
(Trial(514.307 μs), Trial(2.488 ms), Trial(530.676 μs))

So, sorting a vector with missing data is actually faster than sorting a vector without missing values :). Probably has to do with the fact that we don't have to sort 10% of the data when moving the missing values to the end.

Fascinating! ;-)

That approach sounds promising to me, maybe it can handle missing values in general and avoid the need for https://github.com/JuliaLang/julia/pull/27817. One question is to find what types T are safe with that approach, something @vtjnash and @quinnj can tell. Then you need to identify where in the complex dispatch tree of the sorting code this method can be introduced

I don’t think we really need a completely general solution to this right away, it would be fine to hack it in for common important bits types like ints and floats.

That's not the right predicate, but the simplest is maybe Base.isbitstype(nonmissingtype(eltype(x)))

I don’t think we really need a completely general solution to this right away, it would be fine to hack it in for common important bits types like ints and floats.

Sure. What I was wondering is whether we need special code in fpsort as in #27817 or whether a common solution can apply to both Int and Float64 (and probably others).

My guess is that the above might not be noticeably slower than #27817, maybe even faster?

Yes, it should be similar, though moving NaNs and missings to the end in a single pass, and separating them only in a second pass is probably slightly more efficient when they are the minority. Likely not worth the additional code if we can have a common solution for several element types.

@haampie Do you want to give a try to the general approach you presented above? It would be interesting to know whether it can make https://github.com/JuliaLang/julia/pull/27817 unnecessary or not.

Yes, I'm sorry for not pursuing this right away. I'll give it a shot this weekend!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yurivish picture yurivish  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

Keno picture Keno  Â·  3Comments

felixrehren picture felixrehren  Â·  3Comments

omus picture omus  Â·  3Comments