Julia: push!/pop! always does a ccall

Created on 4 Dec 2017  路  12Comments  路  Source: JuliaLang/julia

When looking over my code_native, I recently saw that every push! costs me a call into the runtime library. This appears...excessive, at least for the standard case where the push can be accommodated without copying.

So, lets compare the push! and the fastpush!

First, lets look at what a vector actually is:

struct array_T
    data :: Ptr{Void}
    length:: UInt64
    flags::UInt16
    elsize::UInt16
    offset::UInt32
    nrows::UInt64
    ncols_maxsz::UInt64
    stuff1::UInt64
    stuff2::UInt64
    stuff3::UInt64
end

This gives us the fastpush:

@inline function fastpush!(A::Vector{T},val::T) where T
    if isbits(T)
        arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A)))
        if arr.length + arr.offset  < arr.ncols_maxsz       
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2)
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4)
            @inbounds A[arr.length+1] = val
        else
            push!(A,val)
        end
    else
        push!(A,val)
    end
    A
end

And lets run a race:

function pushtest(A,n)
    resize!(A, 0);
    for i = 1:n push!(A,i) end
    nothing
end

function fastpushtest(A,n)
    resize!(A, 0)
    for i = 1:n fastpush!(A,i) end
    nothing
end

function settest(A,n)
    resize!(A, n)
    for i = 1:n  A[i]=i end
    nothing
end

yielding:

Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0);
#after warmup:

@time settest(A,Nm); @time settest(A,Nm);
  0.356040 seconds (4 allocations: 160 bytes)
  0.326640 seconds (4 allocations: 160 bytes)

@time pushtest(A,Nm); @time pushtest(A,Nm);
  0.899436 seconds (4 allocations: 160 bytes)
  0.895555 seconds (4 allocations: 160 bytes)

@time fastpushtest(A,Nm); @time fastpushtest(A,Nm);
  0.577715 seconds (4 allocations: 160 bytes)
  0.606671 seconds (4 allocations: 160 bytes)

Hence, a 30% speedup is possible if we manage to get the fastpath inlined.

It would mayhaps be nice if it was somehow possible to make llvm aware of julia's runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions.

Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop! that this ends up being benign.

PS. This is related to https://github.com/JuliaLang/julia/pull/24901, which removes the boundcheck from push!/pop!.

I also posted this in discourse.julialang.org before. The fastpush is deliberately not a pull request, because it is dirty and architecture dependent.

arrays performance

Most helpful comment

I think the broader solution here would be to reimplement Array in pure Julia. It'd be messy, but certainly do-able. The hardest part would be factoring out what needs to still exist in array.c, mostly from a GC perspective and what can be moved to julia. Certainly all the array growing/shrinking code could be moved to julia, but we would still need something like a jl_buffer_t type in src that the Julia Array type would have as a data field.

All 12 comments

I think the broader solution here would be to reimplement Array in pure Julia. It'd be messy, but certainly do-able. The hardest part would be factoring out what needs to still exist in array.c, mostly from a GC perspective and what can be moved to julia. Certainly all the array growing/shrinking code could be moved to julia, but we would still need something like a jl_buffer_t type in src that the Julia Array type would have as a data field.

That would definitely be very cool. The most interesting thing would be using the 32 bytes remaining in Vector to store very small vectors inline (so that the data and the array_T are in the same cache line).

The second most interesting thing would be to make non-resizeable arrays reliably be one cache-line before their data, for the net effect that type-inferred array access can skip one level of indirection (this placement is already relatively reliable, but afaik neither used for prefetching nor for removal of indirection).

@quinnj Would we really need a jl_buffer_t in src ?
Couldn't we have a LowLevelBuffer{T} in julia,
which is just an a thin wrapper around a Ptr{T}, a size::Int, and a ccall to malloc ?

something like:

mutable struct LowLevelBuffer{T}
    nelements::Int
    data :: Ptr{T}
    function LowLevelBuffer{T}(nelements) where {T}
        nbytes = nelements*sizeof(T)
        data = Ptr{T}(Libc.malloc(nbytes))
        this = new(nelements, data)
        finalizer(this) do buff
            Libc.free(buff.data)
        end
    end
end

Base.endof(buff::LowLevelBuffer) = buff.nelements
Base.unsafe_getindex(buff::LowLevelBuffer, ii::Integer) = unsafe_load(buff.data, ii)
Base.unsafe_setindex!(buff::LowLevelBuffer, x, ii::Integer) = unsafe_store!(buff.data, x, ii)

# define getindex and setindex by adding checks to the unsafe versions
# and the rest
# don't define too much as got to leave some work for `Array`

Or would it be too hard to make that performant?

There's quite a bit more involved here; the julia GC has no awareness of your Libc.malloc memory. We also need to make sure the GC is aware of any boxed elements. I'm not well-versed in the GC internals to say exactly what a reasonable split would be between what's defined in src/ vs. Julia itself.

Or would it be too hard to make that performant?

Yes, the simple answer would be "because malloc is so very slow".

Upon reconsideration, the push! is really bad, also in practice. Consider:

@inline function fpush!(vec, val, pos)
    if pos >= length(vec)
        resize!(vec, 10+ 2*length(vec))
    end
    @inbounds vec[pos+1] = val
    return pos+1
end

function filter_f(pred, A)
    res = Vector{eltype(A)}()
    pos = 0
    @inbounds for a in A 
        if pred(a)
            pos=fpush!(res, a, pos)
        end
    end
    resize!(res, pos)
    res
end

And then

A = rand(10^6); pred(x) = (x<0.5);

@btime res = filter(pred, A);
  12.272 ms (19 allocations: 5.00 MiB)

@btime res = filter_f(pred, A);
  8.468 ms (17 allocations: 6.25 MiB)

Introducing, the PushVector

mutable struct PushVector{T, A<:AbstractVector{T}} <: AbstractVector{T}
    v::A
    l::Int
end

PushVector{T}() where {T} = PushVector(Vector{T}(undef, 4), 0)

Base.length(v::PushVector) = v.l
Base.size(v::PushVector) = (v.l,)
@inline function Base.getindex(v::PushVector, i)
    @boundscheck checkbounds(v, i)
    @inbounds v.v[i]
end

function Base.push!(v::PushVector, i)
    v.l += 1
    if v.l > length(v.v)
        resize!(v.v, v.l * 2)
    end
    v.v[v.l] = i
    return v
end

finish!(v::PushVector) = resize!(v.v, v.l)

using BenchmarkTools


function pushit(v)
    for i in 1:10^4
        push!(v, i)
    end
end
@btime begin
    p = PushVector{Int64}()
    pushit(p)
    finish!(p)
end
# 24.601 渭s (13 allocations: 192.58 KiB)

@btime begin
    p = Vector{Int64}()
    pushit(p)
end
# 71.176 渭s (14 allocations: 256.70 KiB)

Mostly a joke

This is a user question, but here I immediately reach the people that can answer this:

In LinearMaps.jl I use resize! on a temporary array when combining several linear maps. However, this fails if the user applies a linear map which resizes reshapes the incoming or outgoing vector. Therefore, we would like a cheap way to detect whether an array can be resized, without triggering the error and using a try ... catch block.

Is this the way to go:

if Sys.WORD_SIZE == 64
 _flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 9)
elseif Sys.WORD_SIZE == 32
 _flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 5)
else
 error("Unknown word size")
end
_isshared(A::Array) = !(_flags(A) & 0x4000 == 0x0000)

I don't have a 32 bit machine, so I don't know if the position 5 is correct in that case. Do both the pointer to the data and the length of the array become a 32 bit field? And should I worry about the flags.how == 3 case if I allocated the array myself using the Array{T}(undef) constructor and just passed it along to other functions?

Do both the pointer to the data and the length of the array become a 32 bit field?

Mostly yes. But I think the proper way would be
_flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 1 + sizeof(Csize_t)>>1 + sizeof(Ptr{Cvoid})>>1 )

But then, there is the compile-time option STORE_ARRAY_LEN. I don't know how to check for that at runtime.

And should I worry about the flags.how == 3 case ...?

I don't know, sorry. But is the exception-route so bad in your case?

No don't do that. None of the code above are well defined.

The fun of hacking around :-). Any other suggestion for accessing the isshared flag from within Julia? I assume there is some overhead to using try ... catch versus checking a simple flag?

FWIW, I packaged the workaround of @KristofferC (with his kind permission) as

https://github.com/tpapp/PushVectors.jl

because I found I was copy-pasting it to various pieces of code.

I intend to obsolete/retire it when this issue is fixed, but in the meantime it may be useful.

Was this page helpful?
0 / 5 - 0 ratings