from https://discourse.julialang.org/t/julia-ism-for-two-dimensional-map/10613/13 .
is there any way to recognize constructs such as
m[i,:]= [ 1.0, 2.0, 3.0 ]
and replace this with
m[i,1]= 1.0; m[i,2]=2.0; m[i,3]=3.0
(I know why, of course.) . the latter is orders of magnitudes faster than the latter, and this seems like very low hanging fruit. There are of course other similar constructs that could benefit from the same treatment. A code analysis of random julia code could probably point to cases in which a mem alloc is followed by an immedate mem dealloc; and then consider unrolling this.
PS: UNTIL something like this is implemented, may I suggest that in the sections on advice for speed, which mention issues such as type variability, please include this one, too. just state that x[i,1]=y is much faster than x[i,:]=[y].
regards,
/iaw
Shouldn't you just use .=
here?
How do you know what setindex!
does with Colon
for any m
?
This would be a really good optimization to have. It's difficult to implement since, as Kristoffer said, the compiler has to prove that the rewrite is valid, and it's a pretty complex one. But if everything is inlined and unrolled, it could be a tractable allocation elimination optimization.
With the .=
syntax, you can use a tuple instead. Small tuples like these are _far_ more efficient than arrays, and often are able to do precisely the optimization you're after here.
@mbauman, the tuple version was suggested in the discourse thread, but simply unrolling the 3-element loops was still far faster (at least in Julia 0.6), so making sure the tuple version is fast would be a good starting point.
More generally, the problem is that inexperienced Julia programmers allocate unnecessary little arrays in lots of ways, not just the one cited here, because they've been trained by other languages that "vectorization is always good", and it will be difficult to get the compiler to recognize and unroll more than a tiny subset of these cases. (I've lost track of the number of times I've seen people do things like sum([x,y,z])
rather than x+y+z
.) Some of these cases could use StaticArrays, and some could be handled by compiler optimizations, but mainly people need to re-calibrate their mental model of performance. It's not entirely clear to me whether a more clever compiler helps or hurts with this, since it could make reasoning about performance more complicated. Put another way, for a compiler optimization like this to be really useful, it should apply to a lot of cases (not just code written in a very specific idiom like x .= [....]
) and it should be easy to reason about when it applies.
Yup, I definitely do not disagree. The tuple still poses a challenge to the broadcast machinery as it is heterogeneous and we only use the special tuple implementation in the out-of-place broadcast. It may be worth a special case.
julia> @btime sum([x,y,z])
89.331 ns (2 allocations: 128 bytes)
julia> @btime sum( (x,y,z,) )
18.903 ns (1 allocation: 16 bytes)
julia> @btime x+y+z;
18.889 ns (1 allocation: 16 bytes)
in general, I think the julia compiler could be more "warning-y". the need to optimize is much less when it can emit a warning on demand. ("user, please consider replacing array with tuple?") this would also help steven's concern---educating the rest of us.
not sure I understand kristoffer's comment, and don't need to. I have no knowledge of the real internals. I was wondering when, in an assignment, the compiler cannot switch an array on the RHS into an equivalent tuple to avoid the alloc-dealloc, and whether such conflicts are weird to detect.
but in any case, point is made. hope it helped. and thanks, steven, for teaching us.
/iaw
@iwelch, Julia does include tools to track memory allocations: https://docs.julialang.org/en/stable/manual/profile/#Memory-allocation-analysis-1
Most helpful comment
@mbauman, the tuple version was suggested in the discourse thread, but simply unrolling the 3-element loops was still far faster (at least in Julia 0.6), so making sure the tuple version is fast would be a good starting point.
More generally, the problem is that inexperienced Julia programmers allocate unnecessary little arrays in lots of ways, not just the one cited here, because they've been trained by other languages that "vectorization is always good", and it will be difficult to get the compiler to recognize and unroll more than a tiny subset of these cases. (I've lost track of the number of times I've seen people do things like
sum([x,y,z])
rather thanx+y+z
.) Some of these cases could use StaticArrays, and some could be handled by compiler optimizations, but mainly people need to re-calibrate their mental model of performance. It's not entirely clear to me whether a more clever compiler helps or hurts with this, since it could make reasoning about performance more complicated. Put another way, for a compiler optimization like this to be really useful, it should apply to a lot of cases (not just code written in a very specific idiom likex .= [....]
) and it should be easy to reason about when it applies.