When looking over my code_native, I recently saw that every push! costs me a call into the runtime library. This appears...excessive, at least for the standard case where the push can be accommodated without copying.
So, lets compare the push! and the fastpush!
First, lets look at what a vector actually is:
struct array_T
data :: Ptr{Void}
length:: UInt64
flags::UInt16
elsize::UInt16
offset::UInt32
nrows::UInt64
ncols_maxsz::UInt64
stuff1::UInt64
stuff2::UInt64
stuff3::UInt64
end
This gives us the fastpush:
@inline function fastpush!(A::Vector{T},val::T) where T
if isbits(T)
arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A)))
if arr.length + arr.offset < arr.ncols_maxsz
unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2)
unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4)
@inbounds A[arr.length+1] = val
else
push!(A,val)
end
else
push!(A,val)
end
A
end
And lets run a race:
function pushtest(A,n)
resize!(A, 0);
for i = 1:n push!(A,i) end
nothing
end
function fastpushtest(A,n)
resize!(A, 0)
for i = 1:n fastpush!(A,i) end
nothing
end
function settest(A,n)
resize!(A, n)
for i = 1:n A[i]=i end
nothing
end
yielding:
Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0);
#after warmup:
@time settest(A,Nm); @time settest(A,Nm);
0.356040 seconds (4 allocations: 160 bytes)
0.326640 seconds (4 allocations: 160 bytes)
@time pushtest(A,Nm); @time pushtest(A,Nm);
0.899436 seconds (4 allocations: 160 bytes)
0.895555 seconds (4 allocations: 160 bytes)
@time fastpushtest(A,Nm); @time fastpushtest(A,Nm);
0.577715 seconds (4 allocations: 160 bytes)
0.606671 seconds (4 allocations: 160 bytes)
Hence, a 30% speedup is possible if we manage to get the fastpath inlined.
It would mayhaps be nice if it was somehow possible to make llvm aware of julia's runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions.
Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop! that this ends up being benign.
PS. This is related to https://github.com/JuliaLang/julia/pull/24901, which removes the boundcheck from push!/pop!.
I also posted this in discourse.julialang.org before. The fastpush is deliberately not a pull request, because it is dirty and architecture dependent.
I think the broader solution here would be to reimplement Array
in pure Julia. It'd be messy, but certainly do-able. The hardest part would be factoring out what needs to still exist in array.c
, mostly from a GC perspective and what can be moved to julia. Certainly all the array growing/shrinking code could be moved to julia, but we would still need something like a jl_buffer_t
type in src
that the Julia Array
type would have as a data
field.
That would definitely be very cool. The most interesting thing would be using the 32 bytes remaining in Vector to store very small vectors inline (so that the data and the array_T are in the same cache line).
The second most interesting thing would be to make non-resizeable arrays reliably be one cache-line before their data, for the net effect that type-inferred array access can skip one level of indirection (this placement is already relatively reliable, but afaik neither used for prefetching nor for removal of indirection).
@quinnj Would we really need a jl_buffer_t
in src
?
Couldn't we have a LowLevelBuffer{T}
in julia,
which is just an a thin wrapper around a Ptr{T}
, a size::Int
, and a ccall
to malloc
?
something like:
mutable struct LowLevelBuffer{T}
nelements::Int
data :: Ptr{T}
function LowLevelBuffer{T}(nelements) where {T}
nbytes = nelements*sizeof(T)
data = Ptr{T}(Libc.malloc(nbytes))
this = new(nelements, data)
finalizer(this) do buff
Libc.free(buff.data)
end
end
end
Base.endof(buff::LowLevelBuffer) = buff.nelements
Base.unsafe_getindex(buff::LowLevelBuffer, ii::Integer) = unsafe_load(buff.data, ii)
Base.unsafe_setindex!(buff::LowLevelBuffer, x, ii::Integer) = unsafe_store!(buff.data, x, ii)
# define getindex and setindex by adding checks to the unsafe versions
# and the rest
# don't define too much as got to leave some work for `Array`
Or would it be too hard to make that performant?
There's quite a bit more involved here; the julia GC has no awareness of your Libc.malloc
memory. We also need to make sure the GC is aware of any boxed elements. I'm not well-versed in the GC internals to say exactly what a reasonable split would be between what's defined in src/
vs. Julia itself.
Or would it be too hard to make that performant?
Yes, the simple answer would be "because malloc is so very slow".
Upon reconsideration, the push! is really bad, also in practice. Consider:
@inline function fpush!(vec, val, pos)
if pos >= length(vec)
resize!(vec, 10+ 2*length(vec))
end
@inbounds vec[pos+1] = val
return pos+1
end
function filter_f(pred, A)
res = Vector{eltype(A)}()
pos = 0
@inbounds for a in A
if pred(a)
pos=fpush!(res, a, pos)
end
end
resize!(res, pos)
res
end
And then
A = rand(10^6); pred(x) = (x<0.5);
@btime res = filter(pred, A);
12.272 ms (19 allocations: 5.00 MiB)
@btime res = filter_f(pred, A);
8.468 ms (17 allocations: 6.25 MiB)
Introducing, the PushVector
mutable struct PushVector{T, A<:AbstractVector{T}} <: AbstractVector{T}
v::A
l::Int
end
PushVector{T}() where {T} = PushVector(Vector{T}(undef, 4), 0)
Base.length(v::PushVector) = v.l
Base.size(v::PushVector) = (v.l,)
@inline function Base.getindex(v::PushVector, i)
@boundscheck checkbounds(v, i)
@inbounds v.v[i]
end
function Base.push!(v::PushVector, i)
v.l += 1
if v.l > length(v.v)
resize!(v.v, v.l * 2)
end
v.v[v.l] = i
return v
end
finish!(v::PushVector) = resize!(v.v, v.l)
using BenchmarkTools
function pushit(v)
for i in 1:10^4
push!(v, i)
end
end
@btime begin
p = PushVector{Int64}()
pushit(p)
finish!(p)
end
# 24.601 渭s (13 allocations: 192.58 KiB)
@btime begin
p = Vector{Int64}()
pushit(p)
end
# 71.176 渭s (14 allocations: 256.70 KiB)
Mostly a joke
This is a user question, but here I immediately reach the people that can answer this:
In LinearMaps.jl I use resize!
on a temporary array when combining several linear maps. However, this fails if the user applies a linear map which resizes reshapes the incoming or outgoing vector. Therefore, we would like a cheap way to detect whether an array can be resized, without triggering the error and using a try ... catch
block.
Is this the way to go:
if Sys.WORD_SIZE == 64
_flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 9)
elseif Sys.WORD_SIZE == 32
_flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 5)
else
error("Unknown word size")
end
_isshared(A::Array) = !(_flags(A) & 0x4000 == 0x0000)
I don't have a 32 bit machine, so I don't know if the position 5
is correct in that case. Do both the pointer to the data and the length of the array become a 32 bit field? And should I worry about the flags.how == 3
case if I allocated the array myself using the Array{T}(undef)
constructor and just passed it along to other functions?
Do both the pointer to the data and the length of the array become a 32 bit field?
Mostly yes. But I think the proper way would be
_flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 1 + sizeof(Csize_t)>>1 + sizeof(Ptr{Cvoid})>>1 )
But then, there is the compile-time option STORE_ARRAY_LEN
. I don't know how to check for that at runtime.
And should I worry about the flags.how == 3 case ...?
I don't know, sorry. But is the exception-route so bad in your case?
No don't do that. None of the code above are well defined.
The fun of hacking around :-). Any other suggestion for accessing the isshared
flag from within Julia? I assume there is some overhead to using try ... catch
versus checking a simple flag?
FWIW, I packaged the workaround of @KristofferC (with his kind permission) as
https://github.com/tpapp/PushVectors.jl
because I found I was copy-pasting it to various pieces of code.
I intend to obsolete/retire it when this issue is fixed, but in the meantime it may be useful.
Most helpful comment
I think the broader solution here would be to reimplement
Array
in pure Julia. It'd be messy, but certainly do-able. The hardest part would be factoring out what needs to still exist inarray.c
, mostly from a GC perspective and what can be moved to julia. Certainly all the array growing/shrinking code could be moved to julia, but we would still need something like ajl_buffer_t
type insrc
that the JuliaArray
type would have as adata
field.