Julia: `sort!(::Vector{Union{Float64, Missing}})` allocates unexpectedly in v0.7-beta

Created on 25 Jun 2018 · 22Comments · Source: JuliaLang/julia

Opening an issue as suggested here: https://discourse.julialang.org/t/with-missings-julia-is-slower-than-r/11838/17?u=rdeits

Sorting a Float64 array in-place is quite fast and non-allocating:

julia> @btime sort!(y) setup=(y = rand(100)) evals=1
  2.414 μs (0 allocations: 0 bytes)

but sorting an array of Union{Float64, Missing} in-place allocates a new copy and is about 2x slower, regardless of the input size:

julia> @btime sort!(y2) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing)) evals=1
  4.157 μs (2 allocations: 624 bytes)

julia> versioninfo()
Julia Version 0.7.0-beta.5
Commit 948b088f17 (2018-06-24 17:50 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

missing data performance

Source

rdeits

👍2

Most helpful comment

So, the consensus of #27831 seems that something nice & performant is not yet around, but something quick & dirty will work:

@inline function partition_missing!(v::Vector{Union{T,Missing}}) where {T}
    i, j = 1, length(v)
    @inbounds while true
        while i < j && v[i] !== missing; i += 1; end
        while i < j && v[j] === missing; j -= 1; end
        i >= j && break
        v[i], v[j] = v[j], v[i]
        i += 1; j -= 1;
    end
    @inbounds return v[i] === missing ? i - 1 : i
end

function my_sort!(v::Vector{Union{T,Missing}}) where {T}
    m = partition_missing!(v)
    w = unsafe_wrap(Array, Ptr{T}(pointer(v)), m)
    sort!(w)
    v
end

### test
using Test
function test_things()
    @test issorted(my_sort!([missing, 3, 2, 10, missing, 4]))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([missing, missing])))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([4,2,6,2,9])))
end

### bench

using BenchmarkTools
using Random

function bench(T::Type = Int, n = 1_000)
    y = rand(T, n)
    vec = ifelse.(y .< 0.9, y, missing)
    bench_new = @benchmark my_sort!(z) setup = (z = copy($vec))
    bench_curr = @benchmark sort!(z) setup = (z = copy($vec))
    bench_notmissing = @benchmark sort!(z) setup = (z = copy($y))
    bench_new, bench_curr, bench_notmissing
end

gives

julia> a, b, c = bench(Int, 10_000)
(Trial(352.888 μs), Trial(1.700 ms), Trial(532.544 μs))

julia> a, b, c = bench(Float64, 10_000)
(Trial(514.307 μs), Trial(2.488 ms), Trial(530.676 μs))

So, sorting a vector with missing data is actually faster than sorting a vector without missing values :). Probably has to do with the fact that we don't have to sort 10% of the data when moving the missing values to the end.

haampie on 28 Jun 2018

❤2 🎉2 👍2

All 22 comments

Not really a regression, right? Since this was previously definitely not efficient.

StefanKarpinski on 26 Jun 2018

👍2

The difference is that sort!(v::Vector{Float64}) calls fpsort!(v), which partitions v like

[ ≤ -0.0 | ≥ +0.0 | NaN ]

and then uses Core.Intrinsics.slt_int for comparison on the first and second part.

Maybe we could treat missing in general like we treat NaN w/ floats?

What I mean is: when calling sort!(v::Vector{Union{T,Missing}}), first partition v as

[ non-missing | missing ]

and then call sort!(::Vector{T}) on the first block. Is something like this possible? Can you convince the compiler that the first bit of the vector consists of T's, not of Union{Missing,T}'s?

haampie on 26 Jun 2018

👍1

As people have spotted on the Discourse thread, a first issue is that merge sort is used instead of quick sort. This explains the allocations, but using quick sort "only" reduces the time by one third. Probably still worth doing though, see https://github.com/JuliaLang/julia/pull/27789.

julia> @btime sort!(y, alg=QuickSort) setup=(y = rand(100));
  460.335 ns (0 allocations: 0 bytes)

julia> @btime sort!(y, alg=MergeSort) setup=(y = rand(100));
  665.809 ns (3 allocations: 608 bytes)

julia> @btime sort!(y2, alg=QuickSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.861 μs (0 allocations: 0 bytes)

julia> @btime sort!(y2, alg=MergeSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.918 μs (2 allocations: 624 bytes)

nalimilan on 26 Jun 2018

What I mean is: when calling sort!(v::Vector{Union{T,Missing}}), first partition v as

[ non-missing | missing ]

and then call sort!(::Vector{T}) on the first block. Is something like this possible? Can you convince the compiler that the first bit of the vector consists of T's, not of Union{Missing,T}'s?

Good catch @haampie! I guess we could try extending nans2end! to also move missing values to the end, which will allow using the same algorithm. The compiler can be convinced using type assertions, but AFAIK they have some cost, so maybe more subtle solutions would be needed (like accessing directly the underlying data).

nalimilan on 26 Jun 2018

But is there a safe way to do so? Reinterpreting the vector will not work:

> reinterpret(Float64, Vector{Union{Missing,Float64}}(rand(100)))
ERROR: ArgumentError: cannot reinterpret `Union{Missing, Float64}` `Float64`, type `Union{Missing, Float64}` is not a bits type

haampie on 26 Jun 2018

No, we'd need to access the data part of the array, which is a hidden implementation detail. But I hope we can avoid that.

nalimilan on 26 Jun 2018

I've just tested a quick hack like this:

@@ -1021,7 +1021,7 @@ right(o::Perm) = Perm(right(o.order), o.data)
 lt(::Left, x::T, y::T) where {T<:Floats} = slt_int(y, x)
 lt(::Right, x::T, y::T) where {T<:Floats} = slt_int(x, y)

-isnan(o::DirectOrdering, x::Floats) = (x!=x)
+isnan(o::DirectOrdering, x::Union{Floats,Missing}) = (x === missing || x!=x)
@@ -1082,7 +1082,7 @@ end
 fpsort!(v::AbstractVector, a::Sort.PartialQuickSort, o::Ordering) =
     sort!(v, first(axes(v,1)), last(axes(v,1)), a, o)

-sort!(v::AbstractVector{<:Floats}, a::Algorithm, o::DirectOrdering) = fpsort!(v,a,o)
+sort!(v::AbstractVector{<:Union{Floats, Missing}, a::Algorithm, o::DirectOrdering) = fpsort!(v,a,o)

AFAICT it gives correct results for normal values, but mixes NaN and missing according to their order of appearance (this should be fixed and shouldn't be too costly). Performance is already quite good:

julia> @btime sort!(y2, alg=QuickSort) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  796.571 ns (0 allocations: 0 bytes)

nalimilan on 26 Jun 2018

but mixes NaN and missing according to their order of appearance (this should be fixed and shouldn't be too costly)

The correct behavior based on isless seems to be sorting the NaNs first in a block in appearance order and then the missings in a block at the very end also in order of appearance.

StefanKarpinski on 26 Jun 2018

👍1

I've prepared a PR at https://github.com/JuliaLang/julia/pull/27817. It's about twice faster than the current QuickSort, which is itself faster by 30% than the current MergeSort default. That makes us about 50% slower than R AFAICT, so there's still room for improvement but it's quite reasonable. One should also test with different proportions of missing values.

nalimilan on 27 Jun 2018

What about larger vectors? I tried it with ~1000 elements, and that seemed to give the largest performance gap.

haampie on 27 Jun 2018

You mean compared with R? Actually I could have mentioned that "50% slower" referred to a comparison for a vector with 10M entries (as in the original Discourse post), to limit the variability of measurements.

nalimilan on 27 Jun 2018

I opened this issue https://github.com/JuliaLang/julia/issues/27831 with a question about the reinterpret-trick, cause I feel that's the way to tackle this problem in general.

haampie on 27 Jun 2018

So, the consensus of #27831 seems that something nice & performant is not yet around, but something quick & dirty will work:

@inline function partition_missing!(v::Vector{Union{T,Missing}}) where {T}
    i, j = 1, length(v)
    @inbounds while true
        while i < j && v[i] !== missing; i += 1; end
        while i < j && v[j] === missing; j -= 1; end
        i >= j && break
        v[i], v[j] = v[j], v[i]
        i += 1; j -= 1;
    end
    @inbounds return v[i] === missing ? i - 1 : i
end

function my_sort!(v::Vector{Union{T,Missing}}) where {T}
    m = partition_missing!(v)
    w = unsafe_wrap(Array, Ptr{T}(pointer(v)), m)
    sort!(w)
    v
end

### test
using Test
function test_things()
    @test issorted(my_sort!([missing, 3, 2, 10, missing, 4]))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([missing, missing])))
    @test issorted(my_sort!(Vector{Union{Int,Missing}}([4,2,6,2,9])))
end

### bench

using BenchmarkTools
using Random

function bench(T::Type = Int, n = 1_000)
    y = rand(T, n)
    vec = ifelse.(y .< 0.9, y, missing)
    bench_new = @benchmark my_sort!(z) setup = (z = copy($vec))
    bench_curr = @benchmark sort!(z) setup = (z = copy($vec))
    bench_notmissing = @benchmark sort!(z) setup = (z = copy($y))
    bench_new, bench_curr, bench_notmissing
end

gives

julia> a, b, c = bench(Int, 10_000)
(Trial(352.888 μs), Trial(1.700 ms), Trial(532.544 μs))

julia> a, b, c = bench(Float64, 10_000)
(Trial(514.307 μs), Trial(2.488 ms), Trial(530.676 μs))

haampie on 28 Jun 2018

❤2 🎉2 👍2

Fascinating! ;-)

That approach sounds promising to me, maybe it can handle missing values in general and avoid the need for https://github.com/JuliaLang/julia/pull/27817. One question is to find what types T are safe with that approach, something @vtjnash and @quinnj can tell. Then you need to identify where in the complex dispatch tree of the sorting code this method can be introduced

nalimilan on 28 Jun 2018

Base.isbitsunion(T)

quinnj on 28 Jun 2018

I don’t think we really need a completely general solution to this right away, it would be fine to hack it in for common important bits types like ints and floats.

StefanKarpinski on 28 Jun 2018

That's not the right predicate, but the simplest is maybe Base.isbitstype(nonmissingtype(eltype(x)))

vtjnash on 28 Jun 2018

👍1

I don’t think we really need a completely general solution to this right away, it would be fine to hack it in for common important bits types like ints and floats.

Sure. What I was wondering is whether we need special code in fpsort as in #27817 or whether a common solution can apply to both Int and Float64 (and probably others).

nalimilan on 28 Jun 2018

My guess is that the above might not be noticeably slower than #27817, maybe even faster?

haampie on 28 Jun 2018

Yes, it should be similar, though moving NaNs and missings to the end in a single pass, and separating them only in a second pass is probably slightly more efficient when they are the minority. Likely not worth the additional code if we can have a common solution for several element types.

nalimilan on 28 Jun 2018

@haampie Do you want to give a try to the general approach you presented above? It would be interesting to know whether it can make https://github.com/JuliaLang/julia/pull/27817 unnecessary or not.

nalimilan on 10 Jul 2018

Yes, I'm sorry for not pursuing this right away. I'll give it a shot this weekend!

haampie on 11 Jul 2018

❤1

Was this page helpful?

0 / 5 - 0 ratings