Julia: BitVector has slow lexicographic compare due to storage order

Created on 31 Aug 2018 · 6Comments · Source: JuliaLang/julia

julia> using BenchmarkTools
julia> a=BitVector(rand(Bool,1<<10)); b=copy(a);
julia> @btime ==($a,$b)
  13.865 ns (0 allocations: 0 bytes)
true
julia> @btime <($a,$b)
  2.252 μs (0 allocations: 0 bytes)
false
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)

The immediate problem is that lexicographic comparison falls back to AbstractArray. However, it is not clear how to write a fast (chunk-wise) comparison with the current layout, because the most significant bit of the bitvector (wrt julia's convention for vector comparison) is stored in the least significant bit of the unterlying UInt64 (wrt hardware integer comparison convention). See:

julia> a=BitVector(rand(Bool,1<<6));
julia> a
64-element BitArray{1}:
  true
  true
 false
 false
  true
     ⋮
  true
 false
  true
 false

julia> bitstring(a.chunks[1])
"0101011000101101010100010100000110100011100101000000101011110011"

I'd propose to switch the layout of BitVector to match the machine convention. Then, one would need to also replace trailing_zeros by leading_zeros for iterations, but this has just as good hardware support. Or is there a strong reason for the current layout?

arrays performance

Source

chethega

Most helpful comment

Here is the full code, shall I make a pull request?

function bvcmp(a::BitVector, b::BitVector)
    # the last chunk of a and b might not be fully populated
    # assume those unused bits are zero, is this correct?
    n = min(length(a.chunks), length(b.chunks))
    @inbounds for i in 1:n
        x, y = a.chunks[i], b.chunks[i]
        if x != y
            # compute differing bits
            d = xor(x, y)
            # compute a mask where only the least significant bit is set
            lsb = d & -d;
            # a < b <=> x & lsb == 0 and y & lsb == 1

            # on x86 `y & lsb` can be fused into a `test y, lsb` instruction :-)
            return y & lsb != 0 ? -1 : 1
        end
    end
    return cmp(length(a), length(b))
end

bvisless(va, vb) = bvcmp(va, vb) < 0

Benchmarks: (on 0.7.0 linux x64)

julia> @btime cmp($(falses(1)), $(falses(1)))
  9.601 ns (0 allocations: 0 bytes)
0

julia> @btime bvcmp($(falses(1)), $(falses(1)))
  4.815 ns (0 allocations: 0 bytes)
0

julia> @btime cmp($(falses(10000)), $(falses(10000)))
  23.969 μs (0 allocations: 0 bytes)
0

julia> @btime bvcmp($(falses(10000)), $(falses(10000)))
  120.681 ns (0 allocations: 0 bytes)
0

Roughly 200x as fast :-)

simonfxr on 2 Sep 2018

❤2 👍2

All 6 comments

Changing the storage order is a big change that I'd guess lots of folks rely upon and would probably be breaking. Not sure about other tradeoffs here, but in this specific case we could use a bit-twiddling algorithm to flip the bits and compare on the result. That'd likely be much faster than doing up to 64 bit-extractions.

mbauman on 31 Aug 2018

Darn, should have brought this up before 1.0. I guess one can then use the following, which is at least branchfree ans should be faster than reversing:

julia> function rev_isless(a::UInt64, b::UInt64)
       bt = trailing_zeros(xor(a,b))
       return  0!= (0x01 & (a>>bt))
       end

chethega on 31 Aug 2018

Testing the least significant bit can be done without extracting it directly and building a mask instead:

rev_isless(a::UInt, b::UInt) = begin
    d = xor(x, y)
    lsb = d & -d;
    return (y & lsb) != 0
end

simonfxr on 2 Sep 2018

Here is the full code, shall I make a pull request?

function bvcmp(a::BitVector, b::BitVector)
    # the last chunk of a and b might not be fully populated
    # assume those unused bits are zero, is this correct?
    n = min(length(a.chunks), length(b.chunks))
    @inbounds for i in 1:n
        x, y = a.chunks[i], b.chunks[i]
        if x != y
            # compute differing bits
            d = xor(x, y)
            # compute a mask where only the least significant bit is set
            lsb = d & -d;
            # a < b <=> x & lsb == 0 and y & lsb == 1

            # on x86 `y & lsb` can be fused into a `test y, lsb` instruction :-)
            return y & lsb != 0 ? -1 : 1
        end
    end
    return cmp(length(a), length(b))
end

bvisless(va, vb) = bvcmp(va, vb) < 0

Benchmarks: (on 0.7.0 linux x64)

julia> @btime cmp($(falses(1)), $(falses(1)))
  9.601 ns (0 allocations: 0 bytes)
0

julia> @btime bvcmp($(falses(1)), $(falses(1)))
  4.815 ns (0 allocations: 0 bytes)
0

julia> @btime cmp($(falses(10000)), $(falses(10000)))
  23.969 μs (0 allocations: 0 bytes)
0

julia> @btime bvcmp($(falses(10000)), $(falses(10000)))
  120.681 ns (0 allocations: 0 bytes)
0

Roughly 200x as fast :-)

simonfxr on 2 Sep 2018

❤2 👍2

A PR with performance improvements and perhaps some extra tests thrown in there is always welcome :)

KristofferC on 3 Sep 2018

Yes please PR that! I don't think lexicographic comparison of BitVectors is important enough to change the layout, but it's very much worth optimizing with that code.