In a few places, the performance of splatting a number has shown up: https://github.com/JuliaLang/julia/pull/29060, https://github.com/JuliaLang/julia/pull/28764#issuecomment-419917783 and working around this give signficiant speed-ups.
The code is well inferred but there is a call to _apply
instead of the actual method. This dissapears if the number is wrapped in a tuple.
It would be nice to have this fixed so these type of workarounds are no longer needed.
Ref https://github.com/JuliaLang/julia/pull/27434#issuecomment-400335764
cc @mbauman, @martinholters
We could special case numbers as we do tuples in the inliner. The more generic solution would have to emulate the iteration protocol.
Forgive me if this is a naive understanding of inference, but it seems like there's a middle-ground here — isn't it fairly easy to see that iterate(::X,::Any)::Const(nothing)
for X
s that only iterate once.
Yes, but that's still emulating the iteration protocol in the optimizer (inference already does some of this) but the iterate isn't part of the code (since _apply does it internally). Basically you need to implement this TODO: https://github.com/JuliaLang/julia/blob/3c8119f4c1fe3a83a488e04caa834f26432317a2/base/compiler/ssair/inlining.jl#L831
or special case it there.
I just noticed we share basically the same problem over at https://github.com/JuliaArrays/StaticArrays.jl/issues/361.
What if we change lowering of splatting so that f(x...)
turns into something like Core._apply(f, Base.splatarg(x))
? We can have Base.splatarg(x) = x
by default (or possibly collect into a tuple via iteration?) For some more trivial cases like Number
and StaticVector
a couple of overrides would solve this issue:
Base.splatarg(x::Number) = (x,)
Base.splatarg(v::StaticVector) = Tuple(v)
This does mean that splatting would become configurable by user types but that seems generally consistent with the way we usually lower syntax into function calls.
Nice idea. Maybe it can be Base.splattable
for consistency with Base.broadcastable
.
Base.splattable ...
Interesting analogy; we could go with it, though I think the name is less evocative than splatarg
.
@Keno's point about making inference aware of iteration in _apply
is might just be a better option if we want splatting to be inherently defined by iteration. Right now I can't think of a reason why you'd ever want splatting to be different, but maybe someone with a better vision of the Julian far future can (I'm looking at you @andyferris :-P)
Haha... no, I think I believe in strong interfaces, and guarantees the compiler doesn’t break for the purpose of performance optimizations.
Using splatarg
at least puts control/correctness in the hands of users. On the other hand, it puts control/correctness in the hands of users... ! It feels a bit like literal_pow
- a clever and useful lowering trick but relies on users’ good behaviour to maintain the illusion of the fundamental guarantee of referencial transparency (not saying I’ve thought of anything better...).
So, to my personal tastes we would only resort to that if Keno’s approach is too messy. But my opinion isn’t strong.
Yes, something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases. Otherwise it's just a nasty performance hack. My question was really whether such a use case exists. In hindsight that's actually rather orthogonal to solving this issue.
Another thought: I don't know how hard it is to teach inference about iteration in _apply
, but an interesting alternative would be to surface the C implementation of argument list construction in _apply
to the Julia level. That is, to lower
f(x, y..., z...)
into something like
args = flatten_args((x,), y, z)
_apply(f, args)
where flatten_args
applies the iteration protocol to construct the splatted argument list.
OK - that seems pretty interesting!
It's currently tricky to surface the C implementation of _apply argument flattening in a useful way. Ideally we'd want something roughly like
Any
. I guess that's not possible right now.Vector{Any}
One approach to implementing "when possible" would be to generalize the IteratorSize
trait to express lengths which are known based on the type of the input; a Base
version of StaticArrays.Length{N}()
. This would require that users know about and implement Base.IteratorSize
for statically sized objects. A difficulty is that IteratorSize
mixes together the expression of length via HasLength
and size via HasShape
.
Doing more in inference to constant propagate the iteration state for small finite iterations seems helpful and we might get rid of Base.indexed_iterate
as part of that. I found the output of the following confusing in fairly recent julia-1.3:
julia> function iter_unpack(t)
i1 = iterate(t)
a = i1[1]
i2 = iterate(t, i1[2])
b = i2[1]
b
end
iter_unpack (generic function with 1 method)
julia> @code_warntype optimize=true iter_unpack((1,2))
Variables
#self#::Core.Compiler.Const(iter_unpack, false)
t::Tuple{Int64,Int64}
i1::Union{Nothing, Tuple{Int64,Int64}}
a::Int64
i2::Union{Nothing, Tuple{Int64,Int64}}
b::Int64
# big pile of code ...
# i1 and i2 are partially const but that seems not to be inferred
something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases
If we are going to add an API to manipulate splatting behavior, it would be great if it can also be dispatched on in the function side. For example, this lets you dispatch +(xs...)
to sum(xs)
when xs
is an array. I think this is appealing because it would let any binary function op
dispatch op(xs...)
to reduce(op, xs)
or foldl(op, xs)
. This would give us a concise and intuitive syntax for reduction. I'm thinking something like op(xs::GenericVararg{<:AbstractArray})
while op(xs::GenericVararg{<:Tuple})
is equivalent to op(xs::Vararg)
. It would also be nice if f.(xs)
in op(f.(xs)...)
is not materialized; we can then fuse the mapping and reduction by defining op(xs::GenericVararg{<:Broadcasted})
.
@tkf that's a very interesting idea to get actual syntax for reduce
and mapreduce
while preserving the meaning of splatting.
*(xs...) sum(xs)
+(xs...) prod(xs)
min(xs...) minimum(xs)
max(xs...) maximum(xs)
+(abs.(xs)...) mapreduce(abs, +, xs)
+(xs.^2...) mapreduce(x->x^2, +, xs)
That last one is particularly appealing syntax. One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}})
for many different op
which doesn't seems scalable. Just like the vectorized function problem again. Perhaps there's an acceptable lowering somewhere here which pairs some-syntax
and reduction in a similar way that .
and broadcast work.
One problem is that it would mean defining a lot of
op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}})
for many differentop
which doesn't seems scalable.
I think there are multiple solutions. For example:
abstract type AssociativeOperator <: Function end
(op::AssociativeOperator)(xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
reduce(op, xs)
struct Plus <: Associative end
struct Multiply <: Associative end
struct Vcat <: Associative end
const + = Plus()
const * = Multiply()
const vcat = Vcat()
+(x, y) = ...
*(x, y) = ...
vcat(x, y) = ...
Alternatively, you can lower f(xs...)
to splatapply(f, GenericVararg(xs))
and define
const AssociativeOperator = Union{
typoef(*),
typoef(+),
typoef(vcat),
...and so on...
}
splatapply(f::AssociativeOperator, xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
reduce(f, xs)
I guess it's probably a better idea to discuss shorter term solution here for focusing on performance improvements and I don't want to further clutter this issue. So, I opened a new issue #32860 to discuss more wild ideas. @c42f Let's discuss the idea there.
Let me repeat the main points from https://github.com/JuliaLang/julia/pull/27434#issuecomment-400335764: Inference already simulates the iteration protocol (to some degree), however, that is solely used to determine the return type of the _apply
, not to elide it:
julia> foo(x) = (x...,)
foo (generic function with 1 method)
julia> @code_typed foo(1)
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64}
└── return %1
) => Tuple{Int64}
As you can see, inference correctly determines that splatting an Int64
will yield exactly one Int64
and calling tuple
with it will give a Tuple{Int64}
. The missing puzzle piece is having the inliner replace the _apply
call with an appropriate sequence of iterate
calls. Note that all the intermediate return types of these iterate
calls have been determined during inference, but I think we lack the infrastructure to feed this information through to the optimizer. Given that information, it should be relatively easy (cough I guess) to synthesize the appropriate iterate
sequence for cases with statically known, small number of resulting elements (e.g. a Number
).
That said, inference does not currently propagate constants during the iteration emulation, so that e.g.
julia> @code_typed foo(@SVector [1,2,3])
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64,Vararg{Int64,N} where N}
└── return %1
) => Tuple{Int64,Vararg{Int64,N} where N}
That might be fixable, however, recognizing that it is actually on of the elementary types (here a Tuple
) that is being splat might still proof advantageous, not sure. In hindsight, it would have been nice if we had added an initial iterates_as
call to the iteration protocol, defaulting to iterates_as(x) = x
. Then one could easily e.g. hand out the internal tuple from an SVector
without risking inconsistency between iterating and splatting as with the proposed splatarg
.
Regarding the more general issue, I did some experimenting I also want to share.
I started by assuming Tuple(x) == (x...,)
axiomatic and turning it around, lowering e.g. f(x..., y)
as Core._apply(f, Tuple(x), (y,))
. Of course, this needs a definition of Tuple(x)
that does not rely on splatting, except for splatting Tuple
s, as Tuple(x::Tuple) = x
is trivial. What I came up with is an iterate
sequence (basically a manually unrolled loop) up to a certain limit, and if that is exceeded, calling a new bulit-in function that can turn a SimpleVector
, NamedTuple
, or Vector
into a tuple---basically a trimmed down version of _apply
with the trivially inferred cases only.
I got this working (mostly) with modest effort and my conclusions are: Excellent for cases with inferable (=size statically known) result of Tuple(x)
, where the _apply
call can be elided. But performance penalties otherwise, as having to store the intermediate values in a dynamically allocated tuple is less than ideal. E.g. splatting a vector became about 3x slower.
I have a feeling this is a dead end due to the unavoidable penalties in the size-not-statically-known case, but maybe someone has a good idea...
What I came up with is an
iterate
sequence (basically a manually unrolled loop) up to a certain limit
Right, the difficulty is all about the flattening "when possible" referred to in https://github.com/JuliaLang/julia/issues/29114#issuecomment-515843814.
We can try to get at that information via your partial unrolling approach, via traits, or via better inference. It's dissatisfying to introduce traits if we can solve the problem by being smarter with iteration in inference.
It would be interesting to see where you got to with the partial unrolling approach. It looks like part of that is on the mh/totuple_builtin
branch?
The mh/totuple_builtin
branch contains the code for the new builtin, but I'm not sure that's actually needed. While the code is simpler than for _apply
, it doesn't seem to add any performance advantage compared to using _apply(tuple, x)
. Nevertheless, I'll try to bring my code in a less messy state and put it in a branch for others to experiment. I'll post here once done.
I also had another idea that might save this approach. Say we lower f(x...)
as _apply(f, Tuple(x))
and define Tuple(x) = _apply(tuple, x)
, then inlining would give _apply(f, _apply(tuple, x))
where then the optimizer could elide the inner _apply
and get back to _apply(f, x)
. However, we could still provide for efficient Tuple
constructors for specific types (and maybe a partially unrolled default fallback, as long as it is inlined so that the same elision of the inner _apply
remains possible). Not sure when I'll find time to drive this further, though.
Just pushed the current state of my efforts to mh/rework_apply
.
Great thanks @martinholters! Glancing at the unrolling I'm a little more optimistic about that approach. To use Core._apply
we need a sequence of collections which will all be flattened together into the final arg list. But there doesn't need to be a 1:1 correspondence between the input args in the surface syntax and the intermediate set of tuples. So I wonder whether we can use that to our advantage.
One inconvenience when wrapping the arguments in a function, whether it is splatarg
or Tuple
, is nested splatting. (I only learned we supported this yesterday, btw.) E.g. f((x...)...)
becomes _apply(_apply, (f,), x)
. Sure, we could lower this to _apply(_apply, (f,), splatarg(x))
instead, but we also need to call splatarg
on the elements of x
.
Most helpful comment
Let me repeat the main points from https://github.com/JuliaLang/julia/pull/27434#issuecomment-400335764: Inference already simulates the iteration protocol (to some degree), however, that is solely used to determine the return type of the
_apply
, not to elide it:As you can see, inference correctly determines that splatting an
Int64
will yield exactly oneInt64
and callingtuple
with it will give aTuple{Int64}
. The missing puzzle piece is having the inliner replace the_apply
call with an appropriate sequence ofiterate
calls. Note that all the intermediate return types of theseiterate
calls have been determined during inference, but I think we lack the infrastructure to feed this information through to the optimizer. Given that information, it should be relatively easy (cough I guess) to synthesize the appropriateiterate
sequence for cases with statically known, small number of resulting elements (e.g. aNumber
).That said, inference does not currently propagate constants during the iteration emulation, so that e.g.
That might be fixable, however, recognizing that it is actually on of the elementary types (here a
Tuple
) that is being splat might still proof advantageous, not sure. In hindsight, it would have been nice if we had added an initialiterates_as
call to the iteration protocol, defaulting toiterates_as(x) = x
. Then one could easily e.g. hand out the internal tuple from anSVector
without risking inconsistency between iterating and splatting as with the proposedsplatarg
.