In a PR for a blog post, I described a number of challenges for several potential new AbstractArray
types. Very briefly, the types and the challenges they exhibit are:
ReshapedArrays
: the only efficient way of indexing some types is by indexing the parent array directly (i.e., with an iterator optimized for the pre-reshaping array). This means that indexing a 2D array as B[i,j]
, for integer i
and j
, may not always be efficient.TransposedMatrix
(for #4774): iteration is most efficient when it produces a cache-friendly sequence of array accesses. Unfortunately, algorithms that mix column-major and row-major arrays imply that we can't naively allow eachindex(A)
to always return the most efficient iterator.OffsetArrays
(sometimes called "Fortran arrays"): these violate the assumptions that array indexes run from 1:size(A,d)
, and more generally question how one should match elements of two differentIn the blog post, only two concrete algorithms were considered: matrix-vector multiplication and copy!
. At the risk of being insufficiently general, let's explore possible APIs in terms of these two functions.
EDIT: note that the thoughts on the API are evolving as a consequence of this discussion. To jump ahead to an updated proposal, see https://github.com/JuliaLang/julia/issues/15648#issuecomment-205306513. There may be further updates even later. The material below is left for reference, but some of it is outdated.
For the moment, let's worry about correctness and efficiency more than elegance. Correctness trumps all other considerations. Efficiency means that we have to be able to iterate over the matrix in
cache-friendly order. That means that the array B
has to be "in charge" of the sequence of values and pick its most efficient iterator.
I should emphasize that I haven't taken even the first step towards attempting to implement this API. So this is crystal-ball development, trying to imagine what might work while hoping to shake
out likely problems in advance via community feedback. That caveat understood, here's a possible implementation of matrix-vector multiplication:
RB = eachindex(B)
Rv = eachindex(RB[*,:], v)
Rdest = eachindex(RB[:,*], dest)
for (idest, iB, iv) in zip(Rdest, RB, Rv)
dest[idest] += B[iB] * v[iv]
end
eachindex(iter, obj)
should perform bounds-checking, to ensure that the indexes of iter
align with obj
(even handling things like OffsetArrays
). The notation RB[*,:]
means "corresponding to the
second dimension of RB
"; interestingly, this expression parses without error, attempting to dispatch to getindex
with argument types Tuple{typeof(RB), Function, Colon}
. Hopefully there isn't another pending meaning for indexing with a Function
.
Because either idest
or iv
is constant over the "inner" loop (depending on storage order of B
), ideally one would like to hoist the corresponding access dest[idest]
or v[iv]
. I'd argue that's a
job for the compiler, and not something we should have to bake into the API.
Because zip(X, Y)
does not allow Y
to see the state of X
, it seems quite possible that this could not be made efficient in general (consider the case described for ReshapedArray
s, where the
efficiently-parametrized iterator for the first dimension depends on the state of the index for the second dimension). If this proves to be the case, we have to abandon zip
and write this more like
RB = eachindex(B)
for (iB, idest, iv) in couple(RB, (RB[:,*], dest), (RB[*,:], v))
dest[idest] += B[iB] * v[iv]
end
where couple
is a new exported function.
One interesting thing about this API is that sparse multiplication might be implemented efficiently just by replacing the first line with
RB = eachstored(B)
so that only "stored" elements of B
are visited. (@tkelman may have been the first to propose an eachstored
iterator.) It should be pointed out that several recent discussions have addressed some of the complications that arise from Inf
and NaN
, and it remains to be determined whether a version that doesn't ignore these complications can be written with this (or similar) API.
Likewise, copy!
might be implemented
Rdest = eachindex(dest)
for (idest, isrc) in couple(Rdest, (Rdest, src))
dest[idest] = src[isrc]
end
At least for the array types described here, this API seems plausibly capable of generating both correct and (fairly) efficient code. One might ideally want to write generic "tiled" implementations, but that seems beyond the scope of this initial effort.
This API is still more complicated than naive iteration. One might possibly be able to hide a lot of this with a magical @iterate
macro,
@iterate B dest[i] += B[i,j]*v[j]
for matrix-vector multiplication and
@iterate dest dest[i] = src[i]
for copy!
. The idea is that the @iterate
macro would expand these expressions into their longer forms above. Note that the first argument to @iterate
specifies which object is "in charge" of the
sequence of operations. @iterate
might be too limited for many things, but may be worth exploring. The KernelTools repository may have some useful tidbits.
Thanks a lot for writing the blog post and this issue, I learned a lot from it.
I have a question: You propose
RB = eachindex(B)
Rv = eachindex(RB[*,:], v)
If B is an Array
RB
will simply be 1:length(B)
, and lose all the shape information of B
and makes your life harder interpreting second eachindex
call.
In general, if you can come up with a couple
function (which I think would be great), do you need the prior eachindex
call at all? Why not simply do for (iB, idest, iv) in couple(B, (B[:,*], dest), (B[*,:], v))
. where again B
is the array that determines the iteration order.
So if possible I would prefer having a single couple
function that could be heavily specialized for different argument types and one would need eachindex
and zip
only internally (called by couple
).
I am also not sure if it is always a good idea to put one iterator _in charge_ of the whole iteration order. How would you, for example formulate the loop for map!(dest,f,a,b)
, where the 3 arrays might have different storage orders? Could one implement a couple
function that lets the three arrays dest
, a
, and b
decide via "majority vote" how to iterate?
You're absolutely right; the intent was to keep all the shape information, and I overlooked the problem you pointed out. I was trying to avoid directly indexing arrays with something weird (the function *
), since it seems like that could be unpopular. But what I wrote won't work. A possible alternative would be B[?,:]
, because ?
is not defined yet, and so indexing arrays with ?
seems like it might be OK.
Your two points also come together very nicely if one thinks about matrix-matrix (rather than matrix-vector) multiplication. First, you pretty much have to have a syntax that addresses the arrays directly, rather than some "master" iterator. Second, you're right that you probably don't want any iterator in charge. Something like this?
for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
dest[idest,jdest] += A[idest,kA]*B[kB,jB]
end
which could hopefully be simplified to
@iterate dest[i,j] += A[i,k]*B[k,j]
Here, there's not actually much magic in @iterate
(other than parsing the pattern of indexes and array-accesses), all the hard work is being done by couple
.
My big worry is that couple
could be quite hard to write. Still, it seems like the right place to put the complexity: do a bunch of analysis right at the beginning of the loop to figure out the optimal access pattern.
In some situations it may be difficult/impossible to make it type-stable (though all the ?
/:
indexing is designed to make this more feasible). As a fallback one could do
function foo!(dest, A, B)
iters = couple(...)
_foo(dest, A, B, iters)
end
@noinline function _foo!(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
end
end
However, for the @iterate
macro this would need to be
function foo!(dest, A, B)
iters = couple(...)
@noinline function _foo!(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
end
end
_foo(dest, A, B, iters)
end
I haven't yet tested whether _foo!
could be placed inside foo!
and yet solve the type-instability problem. (I suspect so, but worth testing.)
On Tue, Mar 29, 2016 at 11:18 AM, Tim Holy [email protected] wrote:
You're absolutely right; the intent was to keep all the shape information,
and I overlooked the problem you pointed out. I was trying to avoid
directly indexing arrays with something weird (the function *), since it
seems like that could be unpopular. But what I wrote won't work. A possible
alternative would be B[?,:], because ? is not defined yet, and so
indexing arrays with ? seems like it might be OK.Your two points also come together very nicely if one thinks about
matrix-matrix (rather than matrix-vector) multiplication. First, you pretty
much have to have a syntax that addresses the arrays directly, rather than
some "master" iterator. Second, you're right that you probably don't want
any iterator in charge. Something like this?for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
dest[idest,jdest] += A[idest,kA]*B[kB,jB]end
Ooooooh. This is now awfully close to an abstract index notation, which
allows generalizing this to more than two dimensions:
for magic in joined(dest[:i,:j], A[;i,:k], B[:k,:j])
dest[magic,:i,:j] += A[magic,:i,:k] * B[magic,:k,:j]
end
Obviously the syntax for the loop body can be improved, maybe with a macro,
maybe with clever types (e.g. turning the indices i,j,k into functions that
take the array as argument).
-erik
which could hopefully be simplified to
@iterate dest[i,j] += A[i,k]*B[k,j]
Here, there's not actually much magic in @iterate (other than parsing the
pattern of indexes and array-accesses).My big worry is that couple could be quite hard to write. Still, it seems
like the right place to put the complexity: do a bunch of analysis right at
the beginning of the loop to figure out the optimal access pattern.In some situations it may be difficult/impossible to make it type-stable
(though all the ?/: indexing is designed to make this more feasible). As
a fallback one could dofunction foo(dest, A, B)
iters = couple(...)
_foo(dest, A, B, iters)end
@noinline function _foo(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
endendHowever, for the @iterate macro this would need to be
function foo(dest, A, B)
iters = couple(...)
@noinline function _foo(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
end
end
_foo(dest, A, B, iters)endI haven't yet tested whether _foo could be placed inside foo and yet
solve the type-instability problem. (I suspect so, but worth testing.)—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/15648#issuecomment-202949234
Erik Schnetter [email protected]
http://www.perimeterinstitute.ca/personal/eschnetter/
@eschnett, what you're suggesting is exactly what the @iterate
macro does. Symbols don't have different types, so to do this via functions you have to introduce the appropriate types. But macros can parse the symbols, and emit code that is properly type-stable. In other words, the @iterate
macro is effectively a pretty wrapper around couple
.
I was trying to generalize the notation for :
and ?
, and then got side-tracked by how the loop body should look like. So instead of writing
couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
one could write
coupled(dest[:i,:j,:k], A[:i,:j,:l], B[:j,:k,:m], C[:m])
which allows for more than two indices that can be used in arbitrary order, allowing transposition.
Let's look at it from the other side. It seems there are three things @iterate
needs to do for an expression like dest[i,j] += A[i,k]*B[k,j]
:
@iterate (i, j, k) begin # specify iteration order
D[i,j] += A[i,k]*B[k,j]
end
# would expand to:
(sz1, sz2, sz3) = … # get loop lengths and validate array sizes
Ditr = eachindex(D, (1, 2)) # will want Val{(1,2)} or some such
Aitr = eachindex(A, (1, 2))
Bitr = eachindex(B, (2, 1))
Dstate = start(iD)
Astate = start(iA)
for _i=1:sz1
Bstate = start(iB) # B isn't dependent upon i, @iterate just restarts it every loop
for _j=1:sz2
(iD, Dstate) = next(Ditr, Dstate) # D isn't dependent on k, so it hoists it
Astate′ = Astate # A isn't dependent upon j, so we jump back to where we were
for _k=1:sz3
(iB, Bstate) = next(Bitr, Bstate)
(iA, Astate′) = next(Aitr, Astate′)
D[iD] += A[iA]*B[iB]
end
end
Astate = Astate′
end
I believe this is sufficiently general that @iterate
will always be able to work out where to start, hoist, and hold iteration states. The magic, of course, now happens within eachindex(A, permutation)
. And while I think that will also be rather challenging to write for more than two dimensions, I think it'd be much simpler than trying to couple iterators together.
I'm sure you're aware, Tim, but I'd like to also point to TensorOperations.jl, which I think does some of @iterate
(just not the fast iterated non-cartesian part). And although it's reason for being is completely different, numpy.einsum is also somewhat similar. Lots of people seem to use it, but I've never been able to sink my teeth into it — I find it quite unreadable.
@eschnett, the reason I think you need to set this up in a type-stable manner is that some array types aren't efficiently indexed by an NTuple
of integers. For example, if A
is a SparseMatrixCSC
, A[i,j]
(for integer i
, j
) does not have impressive performance; a much better strategy is to iterate over an index into rowindex
/nzval
and keep track of the corresponding i
, j
as you go. ReshapedArrays
are another good example, and the one that drove me to think about this. For both of these cases, you'd really like to use a custom iterator specialized to extract the last bit of performance. (Particularly for the sparse case, if you only need to visit filled values the savings are _huge_.)
@mbauman, I suspect you're right about some of the challenges. What you're describing is basically how the macros in KernelTools work currently; I also wrote a nifty little macro (@time
on steroids) that tests all possible loop orderings and reports the time for each, thus giving the user the chance to hard-code the one that works best. I should take a look at TensorOperations, I haven't looked in a long time so probably don't remember all the tools (or they may have changed).
But for generic code (that might have to handle Matrix
and MatrixTranspose
), it sure would be nice to be able to be flexible without having to write tons of cases out by hand. May be a pipe dream (I am sure that many have tried to do this before...e.g., Polly).
@mbauman, in your eachindex(A, (d1, d2))
, am I reading this right that you're specifying that dimension d1
is the "inner" dimension and d2
is the "outer" dimension? Seems like a promising idea!
Yes, precisely. The core idea is that each array only needs to concern itself about the _order_ of iteration, and the @iterate
macro itself handles starting, hoisting, and holding the iteration state at particular points in the expansion to ensure things stay in sync.
I guess the reason I didn't go this way is because I was attracted by the hope of solving the loop-ordering problem. Since the macro can't see the types, I was looking for a way for it to set things up in a form that would be amenable to type analysis.
@timholy I don't understand the connection between the order in which the indices are traversed and the syntax to describe which indices need to be coupled.
I agree that sparse (or reshaped, or symmetric where only part of the elements are stored, etc.) arrays require special care.
Let's take this loop as example:
for i,j,k in coupled(r[:i,:j,:k], a[:i,:j,:k], b[:j,:k,:i], c[:k,:i,:j])
r[i,j,k] = a[i,j,k] + b[j,k,i] + c[k,i,j]
end
How would this look like in the ?
/ :
notation?
Something like couple(iblock, jblock, kblock)
where
iblock = (r[:,?,?], a[:,?,?], b[?,?,:], c[?,:,?])
jblock = (r[?,:,?], a[?,:,?], b[:,?,?], c[?,?,:])
kblock = (r[?,?,:], a[?,?,:], b[?,:,?], c[:,?,?])
I.e., you match the indexes with the :
s.
One of the points perhaps worth making is that I was also looking for an analog of for i = 2:n...
, and the ?
/:
syntax might generalize reasonably well there. But bad things happen if there are no ?
, because then it would dispatch just to regular getindex
.
All this is to simply explain my reasoning; while I'm attracted by solving the loop ordering problem, I recognize that this may not be possible/practical. (I don't know the literature here.)
@timholy Okay; that's a straightforward transformation from the (:i,:j,:k)
syntax I like. There's essentially one block per index, specifying where the index appears for each array.
Determine the optimal loop order. This is hard… and an endless task if we consider BLAS-like partial loop striding. I say we punt, and require the user to specify the order.
I just want to support @timholy 's point that I think (at least partially) solving the loop ordering problem would be very nice.Ootherwise one would have to accept that generic library functions like map!
and copy!
might be slow for transposed matrices, since the user would not be able to influence the loop order.
Perhaps I'm most interested in a stepwise approach: if possible, pick an API that doesn't prevent us from tackling this someday, but punt on loop-ordering for now. Revamping how we do array iteration is a big problem no matter how we slice it, and we might make faster progress by first choosing the smallest feasible unit and getting that working.
I think that it still may be possible do something about the for loop ordering problem, but it's a bear. To do so, we'll need to use a bunch of flat iterators. Let's take a step back from syntax and type-stability, and just look at what kinds of iterators are required for a flat matrix multiplication loop:
for i=1:ni, j=1:nj, k=1:nk
D[i,j] += A[i,k]*B[k,j]
end
Consider a vector that's been reshaped into a matrix. How many different iterators does this type need to implement in order to efficiently participate as either A
, B
, or D
in the above expression?
D
: This is column major! Great, so we can just use a range 1:length(A)
. Oh, but wait, there's an inner k
loop that we don't care about. We'll need a lazy version of repeat(1:length(A), inner=[nk])
, which I don't think we have. But we can write that iterator wrapper once and be done.B
: Shucks, row major traversal. Gotta use row-major cartesian iteration. Oh, and we also have to cycle it ni
times. Easy enough, we have cycle
.A
: Ok, column major again, that's good. But… how in the world do we express this? It needs to produce chunks of nk
elements at a time, repeating each chunk nj
times before moving on to the next chunk. Ok, we can write that iterator, too. But what if there was a fourth loop? I'm sure there's a way to do this generally, but man, it'll be a pain.I think this demonstrates that there are two orthogonal concerns here. As far as the individual array goes, it's really only concerned about traversal ordering. couple
(or whatever) can wrap those iterators with the required wrapper in order to comply with the structuring of the other loops.
So what kind of _information_ do we need to dynamically construct those iterators with full generality? An eachindex
API would just need to know the ordering of dimensions. couple
needs to know which indices are used in each array so it can arrange for an ordering and wrap iterators appropriately.
for (iD, iA, iB) in couple(D, (:i,:j), A, (:i,:k), B, (:k,:j))
D[iD] += A[iA]*B[iB]
end
Now with this API, I can finally begin to imagine how coupled
would work. It can try to find an ordering of ijk
that satisfies the most arrays, then request the ordered index iterator from each array, then wrap it with the required iterator to reflect the chosen for-loop structure. Of course, with three-dimensional arrays, there'd also be partially optimal orderings. And this will almost certainly be dependent upon array size, too. Can we make JuMP a dependency of array iteration?
I think this is the fully general, fully orthogonal design. If we break this orthogonality, I'm afraid that we'll never stop writing different flavors of eachindex
iterators for every single array type.
Can we make JuMP a dependency of array iteration?
I was also wondering if we'd need an optimizer. I doubt that's a direction we want to explore, unless it would also be useful for making more rational decisions about automatic inlining (which it surely would be). Punting to Polly, by marking a block or function with a future @polly
hint, seems like a better way to handle this problem.
@mbauman, focusing on the reshaped-vector part of your comment above, a lot of the concerns can be resolved if we have an API for iterating along a particular dimension, aka #15459. This is a different form of orthogonality than you're talking about, and it would resolve the need for many different variants of repeat
. It might also be easier to write: e.g., for a SparseMatrixCSC
, as demonstrated in #15459 the notion of a "column iterator" and "row iterator" are much more straightforward than supporting some kind of repeat
iterator that can start and stop anywhere. For ReshapedArrays, single-dimension iterators are annoying but straightforward; as described in the blog post, even when they require a div
for construction it greatly reduces the cost of the whole loop.
A second issue is that fully-orthogonal iterators may not generalize well to efficient sparse iteration: you might want your sparse array iterator to "jump" to the next stored entry, and you might need to correspondingly advance any coupled iterators. This kind of thinking is what first suggested to me that something like couple
is an important addition to the API: zip
provides complete independence, but in zip(eachstored(A), eachstored(B))
, if A
is sparse and B
is dense, the two will rapidly get out-of-sync. In contrast, in couple(eachstored(A), eachstored(B))
, the state for B
might be updated in a manner "slaved" to the changes in A
. It would be nice to make this symmetric, so that no one iterator is "in charge," but I haven't thought that through yet. If it is symmetric, then couple
isn't actually expressive enough on its own: in the example above, if A
and B
have different numbers of stored elements, should it visit the location of only those that are stored in both? Or stored in either one? At a minimum there's an ||
/&&
problem here.
I suspect that array-iterators need to be a specialized class, supporting more operations than just start
, next
, and done
. Thinking about the sparse case, one candidate is skip
, where the skip-value could be positive or negative. This won't be terribly efficient if skip
requires calling div
(which it would for something like a ReshapedArray). But I think that's OK; we can't have everything. I'll be happy with 90% of everything :smile:.
In my mind you would express couple(eachstored(A), eachstored(B))
as either union
or intersect
depending on the operation being performed.
Thanks to #13412, I'm noticing that this is also possible: couple(any, x, y)
or couple(all, x, y)
. I'm not sure you'd really use the any
and all
functions directly, but you can dispatch on them:
@eval foo(::$(typeof(any))) = 1
@eval foo(::$(typeof(all))) = 2
To help see what different APIs "feel like" for writing code, as an alternative to endlessly creating gists I opened ArrayIterationPlayground and invite collaborators. This doesn't (yet) contain any implementations of any API, but there's one file that implements a couple of algorithms using a fictitious API.
I'd recommend we keep discussion mostly here, though, since it's presumably of general interest.
The design space here is just ginormous even before you begin considering sparse arrays. I'm trying to bring a dose of reality to all the magic that's flying around. :)
My knee-jerk reaction is that trying to extract a single cartesian index out of a given iterator to share with another array will prove to be intractable to code generically. There's two operations that may or may not be possible: eachindex iterators may or may not be able to quickly compute the cartesian indices, and the array that index is shared with may or may not be able to use a cartesian index (especially if you consider OffsetArrays, but I'm still not convinced we should support them without requiring a special offset index type). So you've got to support independent iteration no matter what.
I don't think you want to express this as foo[:, ?]
, since, to my reading, that'd require foo
(be it an array or index iterator) to conjure up a wholly independent iterator object. Better to pass the shared index information separately and let couple
sort it out.
As far as eachindex
over particular dimensions goes, I'm in support. One possibility to allow for non-column major access is to support something like eachindex(A, Order{2}(:), 42, Order{1}(2:size(A,3))
— that'd iterate over (1, 42, 2), (1, 42, 3), (1, 42, 4), …, (2, 42, 2)
. Expressing this with dispatch in a sensible way worries me… as does specializing it for each iterator type. And I don't think it helps couple
simplify any of the work it requires, though, since it's still got to start and stop these iterators at the right points for a flat loop to work.
All that said, I'd be very happy to be proven wrong.
Reality is good. I agree that the challenge here is to define the impactful contribution that's feasible and doesn't prevent more progress by future generations.
The implementation of foo[:,?]
I had in mind is just
immutable AnyEntry end
const ? = AnyEntry()
# Typically AA<:AbstractArray, but not sure I yet see a reason to require that
immutable DimSelection{AA,D<:Tuple{Vararg{Union{Colon,AnyEntry}}}}
A::AA
end
@inline getindex(A, I::Union{Colon,AnyEntry}...) = DimSelection{typeof(A), I}(A)
which is just basically a syntax for passing it on, as you say. I'm not sure whether it's a readability-enhancer, however, and the explicit eachindex
is definitely on the table. (It's basically a "lazy eachindex" since it doesn't yet make a choice about which particular iterator type gets chosen.)
And yes, writing couple
is going to be hard. Before trying that (with couple
or any other API), I'm beginning to think the best thing is to see how well the API works before trying to implement it. The recent PRs from several kind folks addressing the TODO list in #15434 are a huge asset, since they help highlight "interesting" iteration cases.
To take out a little of the magic, I'm currently imagining something vaguely like this:
immutable CoupledIterator{I,F<:Tuple{Vararg{Function}}}
iter::I
itemfuns::F
end
This has a "master" iterator iter
, and a tuple-of-functions that produces the item
for each of the coupled iterators. next
might be defined generally as just
next(iter::CoupledIterator, state) = map(iter.itemfuns, state), next(iter.iter, state)
or something. That map
is mapping a tuple of functions over a single object state
, which is kinda the opposite of our usual meaning, but hopefully you get the drift. The point is, this produces an item
for each of the arrays that went into the iteration. Since it can be of any type, it has at least the potential to allow for efficient access.
As an example, for sparse matrix-vector multiplication A'*v
(so we're taking a transpose), I could imagine that couple(all, eachstored(A, (:,j)), v)
would have a default fallback behavior of something like
function couple(::typeof(all), iter::SparseCSCColumnIterator, v::AbstractVector)
CoupledIterator(iter, state->state, state->state.row)
end
This doesn't work out all the details, but it suggests that couple
maybe isn't as hard as feared. What this achieves is "slaving" the index for v
to the iterator that only visits the stored items of A
.
Now, if v
is a ReshapedArray
or itself sparse, this won't be optimal. I'm prepared to live with that in exchange for the benefit that one gets from handling the sparse matrix sensibly. I'm hoping that this kind of foundation will (1) give a big improvement over current behavior, (2) may not be an impossible dream, and (3) might be an API that's flexible enough that it won't stand in the way of anyone wanting to tackle more ambitious cases some day.
Does that help? Or did you already read my mind (so this is all old news to you) and then see another few miles ahead to big scary roadblocks?
It does help some, yes. I had been thinking you were prioritizing reshaped and offset arrays over eachstored
.
I think the couple
that you propose in that latest example can be more generally expressed as:
couple(iter, ::Union{Val{:row}, Val{:col}}...) # or dim numbers, or in the value domain, doesn't matter
Now you can also get the column index for the destination vector: (ia, iv, idest) = couple(eachstored(A), Val{:row}, Val{:col})
. Heck, make those Val
arguments tuples and now you can allow coupling A
element-wise to another matrix of the same size (or transpose). That's tractable. But only A
gets special handling. All other arrays used here get indexed by cartesian indices.
Where I get all fuzzy is how to extend this to algorithms like gemm, where there's no one array in charge of the entire triplet of loops. Or in allowing the slave arrays to have a say in the kind of iterator they're indexed by. It's those cases where I think I can see a combinatorial explosion of methods with crazy ambiguity problems coming over the horizon.
OK, over at ArrayIterationPlayground I've written out a number of algorithms in a proposed API, and that experience seem to suggest we're converging. (It's possible I wasn't consistent, as my API plans evolved as I tried new things.) I found that this API may suffice to implement things as complicated as cholfact!
, provided that we only care about arrays with numeric indexes and not arrays indexed as A[:dog, :cat]
.
Here's a summary of where I am now:
inds(A, d) # index vector for dimension d (default 1:size(A, d))
inds(A) # tuple-of-inds
zeros(Int, -3:3, -1:5) # creates matrix of size 7-by-7 with the given inds
fill(val, indexes...) # likewise
a[first+1:(last+first)÷2] # copy (or view) of the first half of array, skipping the first element
# `index` and `stored` return "indexing hints," the laziest of wrappers
index(A, :, j) # lazy-wrapper indicating that one wants indexes associated with column j of A
stored(A, :, j) # just the stored values of A in column j
index(stored(A, :, j)) # just the row-indexes of the stored values in column j
index(A, :, ?) # row-index iterator for an arbitrary (unknown) column of A
index(A, Dimension{2}) # similar to index(A, ?, :, ?...)
# Likely: (notice this is different from sub, because this keeps original indexes)
index(A, first+1:last) # would iterate over 2:length(A) for a typical Array
index(A, first+1:last, j) # like above, but just over indexes for dimension 1
couple(iter1, iter2) # iterates over iter1, iter2 containers, keeping them in sync
# Do we want/need this? I suspect not (default would be `any`)
couple(any, stored(iter1), stored(iter2)) # visits i if either iter1[i] or iter2[i] has a value
couple(all, stored(iter1), stored(iter2)) # visits i if both iter1[i] and iter2[i] have values
icat(a, b) # index-preserving concatenation
# For example:
# icat(7:10, 3) 7:10 is indexed with 1:4, so this creates a vector indexed from 1:5 (numbers aren't tied to an index)
# icat(3, 7:10) creates a vector indexed from 0:4
# icat(5:7, 2:4) is an error, because they have overlapping indexes 1:3
# icat(5:7, OffsetArray(2:4, 4:6)) indexed from 1:6
# icat(5:7, OffsetArray(2:4, 5:7)) an error, non-contiguous indexes
Other than the Order
type proposed by @mbauman above, I think this incorporates most of the feedback so far. I might prefer to tackle storage order in a second pass, since this is already peering pretty deeply into the crystal ball. And by having iterators for specific dimensions, that may be less relevant.
I've got a lot of teaching to do in the immediate future, so am unlikely to make much further progress for a while. But unless I hear otherwise I may slowly start implementing pieces of this. It's a lot of new exports, so obviously this deserves some thought.
This looks interesting. I'll probably implement them as well for https://github.com/eschnett/FlexibleArrays.jl.
Painting the bike shed: The name couple
is too generic; it doesn't indicate how they are coupled. (When I read it first, it evoked a "truck-and-trailer" coupling, which would sequence the iterators -- also a nice idea.) Given that the function takes two iterators as arguments, it's kind of obvious that it combines them in some way, so that doesn't need to be part of the name. You could call it parallel
, par
, insync
, synced
, simultaneously
, multi
, union
, intersect
, zip
, zipindex
, izip
, ...
Thanks for reminding me about FlexibleArrays
, I knew I had been told about another "offset" array indexing package out there and just forgot what it was called and who wrote it. It's obviously further along than the one I've been referring to.
Anyway, on the bike shed: I agree couple
is too vague, and I really like both insync
and synced
. zip
is out because it already has a meaning different from couple
: the iterators increment independently of one another, and that's opposite the intent here.
I'm not sold on the complexity of couple
being necessary. But I haven't put anywhere near as much effort into this as you, so I'll just note my concerns. However, it's not clear to me that you would be able to define this functionality in a general way. Even if you could achieve 80% of the speed of a raw array in the basic case, I still think you are giving up the possibility of vectorizing and blocking in the kernel. At that point, I believe it becomes faster to make a copy first. That also means that the kernel library author only needs to ensure their algorithm is optimized (and tested) for regular Arrays and the array-subtype library author only needs to ensure that they have an optimized copy method.
@iterate B dest[i] += B[i,j]*v[j]
I'm fairly certain this is just special syntax for broadcast
(with an accumulator?) or more generally, another implementation of @devec
?
Likewise, copy! might be implemented
My sense for copy!
is that the most sensible definition is:
for x in keys(src)
dest[x] = src[x]
end
which is specifically not an iteration over dest (except in the common array case where the keys of dest were the same as the keys of src)
The main place you need something like couple
is in generalizing sparse matrix algebra. If we don't have a generic iteration framework, we need to implement special methods for each of, e.g., Tridiagonal*Bidiagonal
, Bidiagonal*Tridiagonal
, Banded*SparseMatrixCSC
, etc. Conversion to full is very expensive for such matrices.
Yes, now that I think arrays should be thought of as associative containers, I agree that's the right meaning of copy!
. In my current thinking, keys(src)
is equivalent to index(stored(src))
, and could be supported for arrays (which it isn't currently).
keys
is particularly nice in that it doesn't guarantee an order to the iteration, possibly pointing a path towards a row-major solution.
If we don't have a generic iteration framework
I'm not against that, I'm only questioning whether couple
is better than zip
. From the top post, I understood that zip
should be able to implement the algorithm, but might do a bit of extra work when computing the iterators? Re the initial motivating example of ReshapedArrays, I think the copy-then-kernel or generate-then-copy approach may be faster (and as a bonus resolves the aliasing question)
I should have clarified that your copy!
implementation is what I think of as the right "meaning" of copy!
, with one potential caveat: right now we support copy!(dest::Array{T,2}, src::Array{S,3})
as long as length(dest) >= length(src)
and convert(T, s::S)
works. A strict notion would probably favor forcing this to become something like copy!(vec(dest), vec(src))
or something (where vec
creates a view). That would require that the array's indexes can be meaningfully converted to a linear index, which then gives a point of correspondence between the two arrays. Of course, for iterating through them, there will be cases where you might like something more efficient than a linear index. (For example, copy!(sub(A, 1:end-1, 1:end-1), src)
.) It's also a good example of where "just make a copy, that solves most performance problems" isn't viable: it might work for src
, but not for dest
.
Regarding couple
vs zip
, I guess my point isn't that one is better than the other, but that we need both because we want different behaviors in different circumstances. For something like computing ::SparseMatrixCSC * ::AbstractMatrix
, I'm struggling to see how to use our regular zip
---perhaps you can explain? As I'm seeing it, for zip
I think the contract is that each iterator is independent of the other, and so I don't see a good way to skip over the non-stored elements of the first array without correspondingly "advancing" the iterator for the second.
Unless you're proposing that zip
get some keyword arguments endowing it with flexible behavior? Something along those lines seems fine to me; I'm not trying to argue we need a new name, just that we need new functionality.
I also edited the top post to try to reduce confusion from the fact that the API proposal is being hammered out here, and hence the top post isn't very representative of my current thinking.
I haven't been much involved in various stages of iterator development, but looking at #1032 sparked an idea: what if there was a callback function that each iterator gets passed to before get/setindex? This would abstract away the problem of iterators needing to know about things like whether the array indices start at 1, 0 or some other number. The iterator can say "I point to the 3rd element in the 5th row" and the callback function is responsible for transforming this into a value that get/setindex! will accept.
This could solve:
The generic fallbacks for this callback function would define the "standard" array implementation (1-based indexing etc.), but users could provide specific methods for new AbstarctArray
subtypes to get different behavior.
That's basically what the status quo is with cartesian indices. In fact, when you index an Array
with cartesian indices, it eventually gets reduced to a linear memory offset. There's a little extra overhead involved compared to just iterating over the linear range 1:length(A)
, but it's not terrible. Our array infrastructure was built upon the assumption that indexing with cartesian indices is relatively fast — either it's the fastest possible way to index into the array (what we call LinearSlow
), or it can be converted to a linear index without much overhead (for LinearFast
arrays).
That's not true, though, when ReshapedArray
s wrap LinearSlow
arrays. Accessing the elements of reshape(sprand(2,3,1.0), 6)
as though it's a vector, for example, requires using division to figure out that the fifth element is actually pointing at (1,3)
in the parent array. That's dog-slow, especially if you have to do it in every single loop. But if you know ahead of time that you're just going to iterate over all the elements in column major order, then you can be clever. With reshapes, the cleverness is just that the order of the elements is the same, regardless of the shape. So if you ask for an iterator over all indices, then ReshapedArray
can just wrap the parent array's iterator, and it'll be just as fast as if it weren't reshaped at all. You, as the user, don't even need to know how it worked this magic: eachindex(A)
just does it for you.
The trouble is that you don't always want to iterate over all indices, and not always in column-major order. If you're able to describe the access pattern up-front when you request the index iterator, then it allows for cleverness of this nature, even in more complicated nested loops with multiple arrays. And if we're able to always ask arrays for index iterators, then maybe this can safely allow offset arrays to define non-standard Int
indexing.
I see, thanks for the clear explanation. ReshapedArrays
are a much harder problem that I was envisioning.
ReshapedArrays are a much harder problem that I was envisioning.
Only if you want to avoid performance regressions. If you don't care about that, they're really easy :smile:.
I came across @Jutho's TensorOperations.jl. The correspondence between the APIs proposed here and that in TensorOperations is fairly clear. As far as the DSL part of the solution to this issue goes, TensorOperations' DSL, seems to be good at expressing loop-ridden algorithms without committing to any implementation scheme or even a commitment to what the allowed indices are.
I wanted to point out this excerpt from the README that talks about a clever implementation strategy:
For
add!
,trace!
and the native implementation ofcontract!
, the problem is recursively divided into smaller blocks by slicing along those dimensions which correspond to the largest strides for all of the arrays. When the subproblem reaches a sufficiently small size, it is evaluated by a separate kernel using a set of nested for loops. The implementation depends heavily on metaprogramming and Julia's unique@generated
functions to implement this strategy efficiently for any dimensionality. The minimal problem size is a constant which could be tuned depending on the cache size. The modularity of the implementation also allows to easily replace the kernels if better implementations would exist (e.g. when more SIMD features become available).
I think parallel implementations can exploit facilities of this divide-and-conquer strategy as well. Food for thought.
Perhaps I am way off base on this one but it seems to me that in its most general form this is an attempt to re-invent database joins, especially in the case of sparse matrices. But it has been know for some time that for large numbers of indexes (i.e. rows) it will be hard to beat hash tables on the specific (fields/columns) indexes being joined.
Essentially sparse matrix multiplication is a form of inner join, and sparse matrix addition is a form of outer join.
Put another way, while it may seem you want to march both arrays in some form of "simultaneous" manner, in the most general case, where you give up all prior knowledge of the memory layout, you would want to read one array in the most efficient linear manner possible, and then the other array, while using a pair of hashes that are dual to one another (e.g. hash 1 maps in column major order and hash 2 maps in row major order).
Ref. https://github.com/JuliaLang/julia/blob/296d5728a2974ac1e3efaf358288c40b05e3abee/stdlib/SparseArrays/src/higherorderfns.jl#L32 for a little work in this direction. (Namely a hacky common interface for iteration over SparseVector
s and SparseMatrixCSC
s sufficient for the purposes of sparse map and broadcast.) Best!
Is this issue still relevant after the revision of the iteration protocol?
Skimming quickly, I don't think it had anything to do with the protocol, but may have been enabled by the changes to the broadcast
framework and perhaps motivating future work such as "For Loops 2.0: Index Notation And The Future Of Tensor Compilers by Peter Ahrens" https://www.youtube.com/watch?v=Rp7sTl9oPNI
I don't think we need an issue open though, if there's nothing to change specifically
(Oops, I've been preparing to post a possible new solution here then I just noticed that it's closed. But please allow me to continue the discussion. I think "how to use the best/a good-enough iteration strategy for a given set of containers?" is still an issue.)
I'd like to suggest an approach to this problem "dual" to the iterator-based APIs: use foldl
(or reduce
) and foreach
derived from it for complex loops such as the ones discussed here. The basic API I came up with is:
using NDReducibles: ndreducible
using Referenceables: referenceable
# Implementing C += A * B
foreach(
# Create a "virtual container" (reducible) where the indices are coupled
# based on the specification below:
ndreducible(
referenceable(C) => (:i, :j), # mark that we want to mutate C
A => (:i, :k),
B => (:k, :j),
),
) do (c, a, b)
c[] += a * b
end
This tries to choose the best nesting order of the for
loop. ATM, it only supports DenseArray
and alike composed with Transpose
/Adjoint
and Broadcasted
. A few more demos and docs can be found here: https://tkf.github.io/NDReducibles.jl/dev/ (and https://tkf.github.io/Referenceables.jl/dev/ ) Note that, as you can see, I haven't optimized the terseness of the syntax yet.
This API is based on Transducers.jl's version of foldl
. It means that it is as powerful as iterator-based protocol in terms of the semantics. That is to say, things like early termination ("break
") and filtering also work. However, by completely going to a "callback"-based API, the "loop executor" has much more control than the iterator-based API. For example, it's possible to implement different parallel execution strategies based on the given set of containers (without macro magics). Fancy iterations like the one based on space filling curve is also possible. In fact, callback-based API naturally fits well with the divide-and-conquer approach. Furthermore, it allows you to define copying-to-Array
-first as a valid strategy if the combination of the given custom arrays is too complicated to handle (of course, it may need to do a quick memory estimate and use this path only when the copies are not huge). Similarly, references to an element to arrays (as c[]
above) do not have to directly invoke getindex
/setindex!
. This could be an interesting optimization when none of the indices for the destination array(s) are used in the inner-most loop (i.e., read once in an outer loop, put it in a RefValue
, use it in the inner loops, and then write it back in the outer loop). [1]
One point of view is that this API is a generalization of the broadcasting API. That is to say, copyto!
can be derived from foldl
or reduce
but not the other way around. So, I think it is worth considering to formalise foldl
/reduce
as the basic interface that is used from broadcasting as well as mapping, filtering, and reduction. Then, complex loops like GEMM can also be derived as a foreach
(side-effect-only mapping).
[1] the last few points are actually doable in zip
-like iterator-based interface as well (but not with foreach
-like interface)
Most helpful comment
(Oops, I've been preparing to post a possible new solution here then I just noticed that it's closed. But please allow me to continue the discussion. I think "how to use the best/a good-enough iteration strategy for a given set of containers?" is still an issue.)
I'd like to suggest an approach to this problem "dual" to the iterator-based APIs: use
foldl
(orreduce
) andforeach
derived from it for complex loops such as the ones discussed here. The basic API I came up with is:This tries to choose the best nesting order of the
for
loop. ATM, it only supportsDenseArray
and alike composed withTranspose
/Adjoint
andBroadcasted
. A few more demos and docs can be found here: https://tkf.github.io/NDReducibles.jl/dev/ (and https://tkf.github.io/Referenceables.jl/dev/ ) Note that, as you can see, I haven't optimized the terseness of the syntax yet.This API is based on Transducers.jl's version of
foldl
. It means that it is as powerful as iterator-based protocol in terms of the semantics. That is to say, things like early termination ("break
") and filtering also work. However, by completely going to a "callback"-based API, the "loop executor" has much more control than the iterator-based API. For example, it's possible to implement different parallel execution strategies based on the given set of containers (without macro magics). Fancy iterations like the one based on space filling curve is also possible. In fact, callback-based API naturally fits well with the divide-and-conquer approach. Furthermore, it allows you to define copying-to-Array
-first as a valid strategy if the combination of the given custom arrays is too complicated to handle (of course, it may need to do a quick memory estimate and use this path only when the copies are not huge). Similarly, references to an element to arrays (asc[]
above) do not have to directly invokegetindex
/setindex!
. This could be an interesting optimization when none of the indices for the destination array(s) are used in the inner-most loop (i.e., read once in an outer loop, put it in aRefValue
, use it in the inner loops, and then write it back in the outer loop). [1]One point of view is that this API is a generalization of the broadcasting API. That is to say,
copyto!
can be derived fromfoldl
orreduce
but not the other way around. So, I think it is worth considering to formalisefoldl
/reduce
as the basic interface that is used from broadcasting as well as mapping, filtering, and reduction. Then, complex loops like GEMM can also be derived as aforeach
(side-effect-only mapping).[1] the last few points are actually doable in
zip
-like iterator-based interface as well (but not withforeach
-like interface)