Julia: Performance of splatting a number

Created on 10 Sep 2018 · 21Comments · Source: JuliaLang/julia

In a few places, the performance of splatting a number has shown up: https://github.com/JuliaLang/julia/pull/29060, https://github.com/JuliaLang/julia/pull/28764#issuecomment-419917783 and working around this give signficiant speed-ups.
The code is well inferred but there is a call to _apply instead of the actual method. This dissapears if the number is wrapped in a tuple.

It would be nice to have this fixed so these type of workarounds are no longer needed.

Ref https://github.com/JuliaLang/julia/pull/27434#issuecomment-400335764

cc @mbauman, @martinholters

inference optimizer performance

Source

KristofferC

👍6

Most helpful comment

Let me repeat the main points from https://github.com/JuliaLang/julia/pull/27434#issuecomment-400335764: Inference already simulates the iteration protocol (to some degree), however, that is solely used to determine the return type of the _apply, not to elide it:

julia> foo(x) = (x...,)
foo (generic function with 1 method)

julia> @code_typed foo(1)
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

As you can see, inference correctly determines that splatting an Int64 will yield exactly one Int64 and calling tuple with it will give a Tuple{Int64}. The missing puzzle piece is having the inliner replace the _apply call with an appropriate sequence of iterate calls. Note that all the intermediate return types of these iterate calls have been determined during inference, but I think we lack the infrastructure to feed this information through to the optimizer. Given that information, it should be relatively easy (cough I guess) to synthesize the appropriate iterate sequence for cases with statically known, small number of resulting elements (e.g. a Number).

That said, inference does not currently propagate constants during the iteration emulation, so that e.g.

julia> @code_typed foo(@SVector [1,2,3])
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64,Vararg{Int64,N} where N}
└──      return %1
) => Tuple{Int64,Vararg{Int64,N} where N}

That might be fixable, however, recognizing that it is actually on of the elementary types (here a Tuple) that is being splat might still proof advantageous, not sure. In hindsight, it would have been nice if we had added an initial iterates_as call to the iteration protocol, defaulting to iterates_as(x) = x. Then one could easily e.g. hand out the internal tuple from an SVector without risking inconsistency between iterating and splatting as with the proposed splatarg.

martinholters on 21 Aug 2019

❤3

All 21 comments

We could special case numbers as we do tuples in the inliner. The more generic solution would have to emulate the iteration protocol.

Keno on 10 Sep 2018

Forgive me if this is a naive understanding of inference, but it seems like there's a middle-ground here — isn't it fairly easy to see that iterate(::X,::Any)::Const(nothing) for Xs that only iterate once.

mbauman on 10 Sep 2018

Yes, but that's still emulating the iteration protocol in the optimizer (inference already does some of this) but the iterate isn't part of the code (since _apply does it internally). Basically you need to implement this TODO: https://github.com/JuliaLang/julia/blob/3c8119f4c1fe3a83a488e04caa834f26432317a2/base/compiler/ssair/inlining.jl#L831
or special case it there.

Keno on 10 Sep 2018

I just noticed we share basically the same problem over at https://github.com/JuliaArrays/StaticArrays.jl/issues/361.

What if we change lowering of splatting so that f(x...) turns into something like Core._apply(f, Base.splatarg(x))? We can have Base.splatarg(x) = x by default (or possibly collect into a tuple via iteration?) For some more trivial cases like Number and StaticVector a couple of overrides would solve this issue:

Base.splatarg(x::Number) = (x,)
Base.splatarg(v::StaticVector) = Tuple(v)

This does mean that splatting would become configurable by user types but that seems generally consistent with the way we usually lower syntax into function calls.

c42f on 25 Jul 2019

❤3

Nice idea. Maybe it can be Base.splattable for consistency with Base.broadcastable.

yurivish on 26 Jul 2019

Base.splattable ...

Interesting analogy; we could go with it, though I think the name is less evocative than splatarg.

@Keno's point about making inference aware of iteration in _apply is might just be a better option if we want splatting to be inherently defined by iteration. Right now I can't think of a reason why you'd ever want splatting to be different, but maybe someone with a better vision of the Julian far future can (I'm looking at you @andyferris :-P)

c42f on 26 Jul 2019

😄1

Haha... no, I think I believe in strong interfaces, and guarantees the compiler doesn’t break for the purpose of performance optimizations.

Using splatarg at least puts control/correctness in the hands of users. On the other hand, it puts control/correctness in the hands of users... ! It feels a bit like literal_pow - a clever and useful lowering trick but relies on users’ good behaviour to maintain the illusion of the fundamental guarantee of referencial transparency (not saying I’ve thought of anything better...).

So, to my personal tastes we would only resort to that if Keno’s approach is too messy. But my opinion isn’t strong.

andyferris on 26 Jul 2019

👍1

Yes, something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases. Otherwise it's just a nasty performance hack. My question was really whether such a use case exists. In hindsight that's actually rather orthogonal to solving this issue.

Another thought: I don't know how hard it is to teach inference about iteration in _apply, but an interesting alternative would be to surface the C implementation of argument list construction in _apply to the Julia level. That is, to lower

f(x, y..., z...)

into something like

args = flatten_args((x,), y, z)
_apply(f, args)

where flatten_args applies the iteration protocol to construct the splatted argument list.

c42f on 26 Jul 2019

👍1

OK - that seems pretty interesting!

andyferris on 26 Jul 2019

It's currently tricky to surface the C implementation of _apply argument flattening in a useful way. Ideally we'd want something roughly like

"When possible", arguments are flattened into a single tuple via iteration
With dynamically known small total length, fall back to using a stack allocated buffer of Any. I guess that's not possible right now.
With unknown total length, flatten args into a Vector{Any}

One approach to implementing "when possible" would be to generalize the IteratorSize trait to express lengths which are known based on the type of the input; a Base version of StaticArrays.Length{N}(). This would require that users know about and implement Base.IteratorSize for statically sized objects. A difficulty is that IteratorSize mixes together the expression of length via HasLength and size via HasShape.

Doing more in inference to constant propagate the iteration state for small finite iterations seems helpful and we might get rid of Base.indexed_iterate as part of that. I found the output of the following confusing in fairly recent julia-1.3:

julia> function iter_unpack(t)
           i1 = iterate(t)
           a = i1[1]
           i2 = iterate(t, i1[2])
           b = i2[1]
           b
       end
iter_unpack (generic function with 1 method)

julia> @code_warntype optimize=true iter_unpack((1,2))
Variables
  #self#::Core.Compiler.Const(iter_unpack, false)
  t::Tuple{Int64,Int64}
  i1::Union{Nothing, Tuple{Int64,Int64}}
  a::Int64
  i2::Union{Nothing, Tuple{Int64,Int64}}
  b::Int64

# big pile of code ...
# i1 and i2 are partially const but that seems not to be inferred

c42f on 29 Jul 2019

something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases

If we are going to add an API to manipulate splatting behavior, it would be great if it can also be dispatched on in the function side. For example, this lets you dispatch +(xs...) to sum(xs) when xs is an array. I think this is appealing because it would let any binary function op dispatch op(xs...) to reduce(op, xs) or foldl(op, xs). This would give us a concise and intuitive syntax for reduction. I'm thinking something like op(xs::GenericVararg{<:AbstractArray}) while op(xs::GenericVararg{<:Tuple}) is equivalent to op(xs::Vararg). It would also be nice if f.(xs) in op(f.(xs)...) is not materialized; we can then fuse the mapping and reduction by defining op(xs::GenericVararg{<:Broadcasted}).

tkf on 8 Aug 2019

🚀2

@tkf that's a very interesting idea to get actual syntax for reduce and mapreduce while preserving the meaning of splatting.

*(xs...)        sum(xs)
+(xs...)        prod(xs)
min(xs...)      minimum(xs)
max(xs...)      maximum(xs)

+(abs.(xs)...)  mapreduce(abs, +, xs)
+(xs.^2...)     mapreduce(x->x^2, +, xs)

That last one is particularly appealing syntax. One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}}) for many different op which doesn't seems scalable. Just like the vectorized function problem again. Perhaps there's an acceptable lowering somewhere here which pairs some-syntax and reduction in a similar way that . and broadcast work.

c42f on 9 Aug 2019

👍2 ❤1

One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}}) for many different op which doesn't seems scalable.

I think there are multiple solutions. For example:

Abstract-type based solution

abstract type AssociativeOperator <: Function end

(op::AssociativeOperator)(xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
    reduce(op, xs)

struct Plus <: Associative end
struct Multiply <: Associative end
struct Vcat <: Associative end

const + = Plus()
const * = Multiply()
const vcat = Vcat()

+(x, y) = ...
*(x, y) = ...
vcat(x, y) = ...

Different lowering

Alternatively, you can lower f(xs...) to splatapply(f, GenericVararg(xs)) and define

const AssociativeOperator = Union{
    typoef(*),
    typoef(+),
    typoef(vcat),
    ...and so on...
}

splatapply(f::AssociativeOperator, xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
    reduce(f, xs)

tkf on 10 Aug 2019

I guess it's probably a better idea to discuss shorter term solution here for focusing on performance improvements and I don't want to further clutter this issue. So, I opened a new issue #32860 to discuss more wild ideas. @c42f Let's discuss the idea there.

tkf on 10 Aug 2019

julia> foo(x) = (x...,)
foo (generic function with 1 method)

julia> @code_typed foo(1)
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

That said, inference does not currently propagate constants during the iteration emulation, so that e.g.

julia> @code_typed foo(@SVector [1,2,3])
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64,Vararg{Int64,N} where N}
└──      return %1
) => Tuple{Int64,Vararg{Int64,N} where N}

martinholters on 21 Aug 2019

❤3

Regarding the more general issue, I did some experimenting I also want to share.

I started by assuming Tuple(x) == (x...,) axiomatic and turning it around, lowering e.g. f(x..., y) as Core._apply(f, Tuple(x), (y,)). Of course, this needs a definition of Tuple(x) that does not rely on splatting, except for splatting Tuples, as Tuple(x::Tuple) = x is trivial. What I came up with is an iterate sequence (basically a manually unrolled loop) up to a certain limit, and if that is exceeded, calling a new bulit-in function that can turn a SimpleVector, NamedTuple, or Vector into a tuple---basically a trimmed down version of _apply with the trivially inferred cases only.

I got this working (mostly) with modest effort and my conclusions are: Excellent for cases with inferable (=size statically known) result of Tuple(x), where the _apply call can be elided. But performance penalties otherwise, as having to store the intermediate values in a dynamically allocated tuple is less than ideal. E.g. splatting a vector became about 3x slower.

I have a feeling this is a dead end due to the unavoidable penalties in the size-not-statically-known case, but maybe someone has a good idea...

martinholters on 21 Aug 2019

What I came up with is an iterate sequence (basically a manually unrolled loop) up to a certain limit

Right, the difficulty is all about the flattening "when possible" referred to in https://github.com/JuliaLang/julia/issues/29114#issuecomment-515843814.

We can try to get at that information via your partial unrolling approach, via traits, or via better inference. It's dissatisfying to introduce traits if we can solve the problem by being smarter with iteration in inference.

It would be interesting to see where you got to with the partial unrolling approach. It looks like part of that is on the mh/totuple_builtin branch?

c42f on 22 Aug 2019

The mh/totuple_builtin branch contains the code for the new builtin, but I'm not sure that's actually needed. While the code is simpler than for _apply, it doesn't seem to add any performance advantage compared to using _apply(tuple, x). Nevertheless, I'll try to bring my code in a less messy state and put it in a branch for others to experiment. I'll post here once done.

I also had another idea that might save this approach. Say we lower f(x...) as _apply(f, Tuple(x)) and define Tuple(x) = _apply(tuple, x), then inlining would give _apply(f, _apply(tuple, x)) where then the optimizer could elide the inner _apply and get back to _apply(f, x). However, we could still provide for efficient Tuple constructors for specific types (and maybe a partially unrolled default fallback, as long as it is inlined so that the same elision of the inner _apply remains possible). Not sure when I'll find time to drive this further, though.

martinholters on 22 Aug 2019

Just pushed the current state of my efforts to mh/rework_apply.

martinholters on 23 Aug 2019

❤2

Great thanks @martinholters! Glancing at the unrolling I'm a little more optimistic about that approach. To use Core._apply we need a sequence of collections which will all be flattened together into the final arg list. But there doesn't need to be a 1:1 correspondence between the input args in the surface syntax and the intermediate set of tuples. So I wonder whether we can use that to our advantage.

c42f on 26 Aug 2019

One inconvenience when wrapping the arguments in a function, whether it is splatarg or Tuple, is nested splatting. (I only learned we supported this yesterday, btw.) E.g. f((x...)...) becomes _apply(_apply, (f,), x). Sure, we could lower this to _apply(_apply, (f,), splatarg(x)) instead, but we also need to call splatarg on the elements of x.