Julia: `stack(vec_of_vecs)` for `vcat(vec_of_vecs...)`

Created on 2 May 2017  Â·  54Comments  Â·  Source: JuliaLang/julia

This comes up periodically and it's a bit embarrassing that our official answer right now involves a inefficient splatting. We should have a function – maybe call it stack? – that efficiently does what vcat(vec_of_vecs...) does without the splatting. It should also stack in slices in different dimensions so that you can either stack slices as rows or columns, etc.

arrays

Most helpful comment

To echo a previous comment, it would be nice to handle the empty case appropriately: just as reduce(+, Int[]) returns 0, so should reduce(vcat, Vector{Int}[]) return Int[]. An operation as basic as concatenating a vector of vectors shouldn't be so finicky!

All 54 comments

And return an Array or a custom array? Either way, there's a simple implementation here: http://stackoverflow.com/a/37488453/176071. In this context I'd swap the index ordering so it's column-major contiguous.

For the base implementation, this would just convert a vector of vectors to matrices, as a specific case. I think whether it stacks in column major or row major order should probably be controlled by a dimension argument. There are generalizations one can imagine doing here.

There's a general issue with the names of reduction operations on collections of things (e.g. one-argument join) are sometimes different from the names of operations on things given as varargs (e.g. string, which is effectively a splatted-out one-argument join), even if they are conceptually the same operation.

Examples are:

  • maximum (collection) vs. max (splat)
  • minimum vs. min
  • sum vs. +
  • prod vs. *
  • join vs. string
  • append! vs. push!

These are not so bad because the names are similar and standard, but vcat and stack would exacerbate the problem, because stack is certainly not a standard name for this operation.

In my opinion, a better name for stack is simply reduce(vcat, ...). This is inefficient because of substantial copying, but it would not change the semantics to define a new overload of reduce specialized on typeof(vcat) so as to avoid the excessive copying, as vcat does presently. Overall there should be no performance improvement from using a dedicated stack function versus a specialized reduce(vcat, ...) with the same code. This has the benefit of not requiring additional nomenclature to be learned.

For example, the task of "make all these vectors xss row vectors, then stack them vertically" translates to

reduce(vcat, map(transpose, xs))

which is quite readable, combines existing vocabulary and is nevertheless (thanks to RowVector) possible to make efficient through adding a specialization to reduce.

FWIW stack is already a standard term in the data frames/data tables world (and not only in Julia). See the DataTables docs. So I'd rather use a different name -- or go with @TotalVerb's suggestion.

I agree with @StefanKarpinski that we need a single-word function for this.

What about flatten?

FWIW, numpy uses stack for this operation.

flatten is a quite distinct operation that already exists.

The function stack could be used to mean one thing with arrays and another with data frames/tables – I don't think the operations are really at odds with each other, although IIUC the data table operation might be better called vcat.

I've been wondering if it's better to first support a "view" which you can copy, rather than the reverse process (where we have a function in Base which makes a contiguous copy, and later on we decide it would be nice to have a view for such an operation).

Thus, we could potentially have

m = NestedArray([[1, 2], [3, 4]]) # A 2x2 AbstractMatrix{Int} which is a "view", perhaps called `StackedArray`
copy(m) # contiguous in memory, `Matrix{Int}`. Perhaps `Array(m)` would be clearer/allow possibility for `similar` to return a nested structure.

and no stack function.

See also CatViews.jl by @ahwillia.

The function stack could be used to mean one thing with arrays and another with data frames/tables – I don't think the operations are really at odds with each other, although IIUC the data table operation might be better called vcat.

Yes, that's certainly possible. I just wanted to mention this because if other terms are available it would be better not to mix these two different operations (I wasn't aware of the Numpy precedent).

FWIW, data frames/tables already implement vcat, but stack is a a very different operation (which operates on a single data frame/table).

Another idea that occurred to me yesterday is to just have specialized methods for reduce(vcat, Vector{<:Vector}) and reduce(hcat, Vector{<:Vector}). It doesn't generalize, but it may be sufficient. There's no reason not to have optimized methods for those specific cases anyway.

What does stack do on data frames? From the name, I would expect it to stack on data frame on top of another with the same columns.

I think stack on dataframes is the same operation that in R is called gather. Correct me if i'm wrong. See this cheat sheet from RStudio.

Another idea that occurred to me yesterday is to just have specialized methods for reduce(vcat, Vector{<:Vector}) and reduce(hcat, Vector{<:Vector}). It doesn't generalize, but it may be sufficient. There's no reason not to have optimized methods for those specific cases anyway.

Yes, that's what @TotalVerb proposed above, right?

What does stack do on data frames? From the name, I would expect it to stack on data frame on top of another with the same columns.

This is vcat. stack rearranges rows and column in a data frame/table by creating one column for each level of a variable. It's a bit hard to describe in a few words, but see this example. That's indeed called gather or melt in the recent iterations of Hadley Wickham (from RStudio)'s R packages, but the term in base R and in Pandas is stack. Now that several variants of the name are commonly used, it's not very clear which terminology is the most standard though...

Correct me ifI'm wrong, but I thought the "stacked" form of a data frame was a bit like a sparse representation of the table, storing (row, column, value) tuples. (Or maybe that was called something else?)

Yes, that's more or less the result of stack, i.e. only three columns giving variable name, level and value, and which can be converted to one column for each variable (plus the value column) using unstack.

It's rather unfortunate that we can't use stack, which is the most natural name for this operation, because the name is used for an operation on data frames for which the term "stack" is neither standard nor intuitive. I'm somewhat inclined to "take" the term and let data frames use something more intuitively appropriate like gather or melt for that operation. I always liked the "melt" / "cast" terminology from R's reshape2.

Sure, the existence of the data frames/tables function shouldn't be a blocker if stack is natural in the present context. I'd need to do more research to check which of stack, gather and melt is the most common terminology (and most likely to remain so in the future) for data frames. BTW, we already have melt, which is equivalent to stack but with a slightly different interface (for those interested, the original discussion is here).

Sorry, I didn't expect my original comment to generate such a long sub-thread.

I'm not sure stack is really such an ideal name here; it's fine, but I don't think it's all that obvious or meaningful. I was hoping to find some inspiration from the myriad of SO posts asking about this, but they almost all use verbiage like convert or transform. I don't particularly like pigeon-holing this to just reduce(hcat given that it's so easy to generalize with Cartesian iteration such that the resulting size is (size(A[1])..., size(A)...).

One possible alternative: unnest. This makes it clear that it's operating on arrays of arrays and evokes the imagery of transforming A[i][j] to A[j, i].

I think I'd also lean towards doing the Array copy in base, with a package like CatViews or GrowableArrays supporting the custom array type.

The term "stack" makes sense to me because you're taking a bunch of separate vectors/slices of some shape and stacking them along an axis to create a structure with the same shape but an additional "stacked" dimension. I agree that it makes less sense when you're thinking of it in terms of indices, where nesting and unnesting is clearer – that could be a better way to think about it.

One way to express this could be how you'd like indices to be nested in the output. I.e. you could write nested(A, (2,(1,3))) to transform A so that the output value is indexed as B[j][i,k] where, if the original input was a vector of vectors of vectors, it would be indexed as A[i][j][k]. Indices would implicitly be numbered and any shape of input would be allowed, so the input could be A[i,j,k] or A[i][j,k] or A[i,j][k] or A[i][j][k].

The B = nested(A, (2,(1,3))) thing seems pretty cool. Would it return a view or a contiguous copy?

Assuming it's a view - one gotcha is that (these days) indices for different array types and dimensions could be of different types, so the (2, (1, 3)) part may need to be recognized as a compile time constant so that indices(B) (and even the internals of getindex, if different indices use different types of Integer) is type stable. E.g. Perhaps in the end you want to create a NestedArray{T, 3, (2, (1, 3))} <: AbstractArray{T, 3}) or something, a bit like how PermutedDimsArray stores the permutation as a type parameter (I'm now looking at it's outer constructor here and wondering if it is type stable?).

I think I'd also lean towards doing the Array copy in base, with a package like CatViews or GrowableArrays supporting the custom array type.

Right, this functionality seems perfect for being developed upon outside of Base and merged when we all like the API. To support custom indices, perhaps dense (or full) would be a good generic way to go from the (nested) view to the (stacked) copy? (unless we are happy to have similar always drop the nesting, in which case copy is natural).

I think the most straightforward approach here would be to optimize these two operations in Base:

julia> vov = [[1,2,3],[4,5,6],[7,8,9]]
3-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5, 6]
 [7, 8, 9]

julia> reduce(hcat, vov)
3×3 Array{Int64,2}:
 1  4  7
 2  5  8
 3  6  9

julia> mapreduce(transpose, vcat, vov)
3×3 Array{Int64,2}:
 1  2  3
 4  5  6
 7  8  9

Since these operations already produce the desired result, having optimized implementations for these and then telling people to use these as the idiom for doing this would work nicely without any risk of breaking anyone's code. There's also no reason not to optimize this. Any more sophisticated functionality can be developed out of Base in a package.

Summary: the action item here, rather than introduce any new verbs into the standard library, is to write optimized methods of reduce and mapreduce for arrays of arrays with vcat and hcat as reducers.

While at it, can we have (hpush!) ( or with underscores push_column!)
and vpush! (i.e. push_row!) ?

Perhaps that should be a separate issue, but it is closely linked,
implementation is very similar to what needs to be done to make the optimised reduce(hcat...

There aren't any resizeable multidimensional arrays in Base, is there?

(It would be a good feature for a nested array view/wrapper type, though!)

There aren't any resizeable multidimensional arrays in Base, is there?

Yes, but there are resizeable 1D Arrays.
push! adds an element to a Vector.
and by mixing reshape and push!, you can do this in 2D

(It would be a good feature for a nested array view/wrapper type, though!)

Sure, would.
Make a PR to CatViews.jl, that would kick-arse

and by mixing reshape and push!, you can do this in 2D

Correct me if I'm wrong, but I thought there's something in the implementation that prevents resizing if there exists a multidimensional view (being an Array, from reshape, not a something like a SubArray) of the same memory. (or perhaps doesn't prevent resizing, but does perform some unexpected memory copying that would make this less efficient, I can't remember exactly).

Make a PR to CatViews.jl, that would kick-arse

:+1:

TensorFlow calls this operation stack
so there some prior to it.

Further before shortly 1.0 release they renames it to stack, from its old name pack

In particular for larger than two-dimensional arrays such an explicit function would really help entering data.

From the already existing function names my personal favourite would be merge: It translates fairly easy to other (spoken) languages; unnest or stack are not exactly common verbs, and flatten and reshape somewhat imply a modifying indices. It also matches how I would probably explain it: merge(a) merges an array of arrays into a single array while keeping the total number of dimensions, e.g. a vector of vectors becomes a matrix.

I can also imagine an optional output container type as first argument, say merge(Array, a) since in (most?) cases data must be copied anyway. This would also give the possibility to say e.g. merge(ArrayView, a) to generate some view.

Merge is a really overloaded term and it's pretty unclear to me how this is a merge operation.

For the base implementation, this would just convert a vector of vectors to matrices, as a specific case. I think whether it stacks in column major or row major order should probably be controlled by a dimension argument. There are generalizations one can imagine doing here.

Indeed this is the way to go, mostly because there are a ton of design decision for how to actually use a Vector of Arrays. A good example of this is the VectorOfArray type in RecursiveArrayTools.jl (courtesy of @gabrielgellner).

https://github.com/JuliaDiffEq/RecursiveArrayTools.jl
https://github.com/JuliaDiffEq/RecursiveArrayTools.jl/blob/master/src/vector_of_array.jl

This is the backbone of all JuliaDiffEq solutions, the replacement of GrowableArrays, and there are definitely some decisions to make there, like how to iterate and count length (preserving the vector idea? Or rejecting that and now iterating as a matrix?). And there are still things to work out, like what is the appropriate norm to use?

So the rules one would like to impose on a view for an object like this is very application-specific, which is why I think it's probably best to leave it to packages like this (and CatViews) to implement.

(If you check the history, you'll also find earlier versions which check for reggedness and all sorts of things. We decided that was unnecessary and will just make the assumption it's not ragged in cases where it's not just using the indexing. Again, the design-space here is huge.)

tidyr has unnest for something similar to this. Also, dplyr has bindrows, which works both on individual dataframes and a list of dataframes

This is an optimization / feature issue so it can be punted to a 1.x release.

There are other functions that have the same problem. For instance merge for merging Dicts. Doing merge(array_of_dicts...) is prohibitively inefficient for large arrays. In this case, a new method for merge should suffice.

Of course, this can wait til after 1.0 is released. OTOH, I could make a PR, since the change should be easy and won't break stuff.

reduce(merge, array_of_dicts) would work here too, right?

Please make a PR if you feel like it, it would be great to have an efficient solution in 1.0 given how often this pattern appears.

reduce(merge, array_of_dicts) would work here too, right?

No, unless I am missing something. reduce is not a varargs function, but merge is. The problem is splatting a huge number of dicts as varargs. I searched _Collections and Data Structures_ for .... But, not all varargs functions have this problem. Eg for map(f,iters...) splatting a huge array of iterators is probably not a good idea. The only candidates I found are:

union(s1,s2,...) yes. Taking the union of 10^6 small arrays with many duplicate elements sounds reasonable. Likewise intersect, and maybe symdiff.

push! yes, but that's exactly what append! is for, so no.

unshift!, yes, and I don't see a counterpart to append! so yes. (I did not do a thorough search).

I can start with a PR for merge.

reduce is not a varargs function, but that's the point: you pass it a collection of arrays/dicts rather than splatting them, which eliminates the cost of splatting too many arguments.

Sorry, I misunderstood. Didn't realize you were giving reduce as an example that does not have this efficiency problem.

To be clear: the solution which appears to make consensus above is to use reduce(vcat, vec_of_vecs) instead of vcat(vec_of_vecs...). The same approach could probably be adopted for merge.

@nalimilan I re-read and understand that the consensus is to make reduce(merge, array_of_dicts) efficient. But, it turns out the the inefficiency was in helper function for merge. I fixed this, so that
merge(ten_thousand_dicts...) is now efficient.

I can make a pull request. But, PRs are supposed to include one or more tests. I wonder what kind of test to include. The most unpleasant aspect of the previous behavior is that 10^4 dicts will use all 16GB on my machine. Any ideas ?

@jlapeyre It's indeed a little tricky to test performance in our unit tests - mostly we test for correctness. The existing unit tests provide a harness for you to change the implementation (to be faster) with the confidence of knowing you didn't break anything in the meantime. Generally posting before/after benchmarks you did along with the PR and/or adding stuff for nanosoldier to run (BaseBenchmarks.jl) if you're really keen is sufficient to prove that you made something better.

@andyferris Yes, that's the question. Unit tests are for what Stefan calls "observable" behavior. (But, I am able to detect the desktop crashing because all memory is allocated, even though its non-observable.) Nanosoldier looks a bit ...complicated, but worthwhile. Its been a couple of years since I did a PR to julia. I am concentrating on finding a way to run all tests without allocating all memory. It looks like make testall has no way to limit cpus. Setting JULIA_NUM_THREADS to 1 and running Base.runtests(ncores=1) may work. Is there something obvious I am overlooking ? I guess asking a question on discourse so there is record in the proper place is the way to go.

For this PR to fix merge, I ran the most obvious tests, and will rely of travis for the rest. I'll include a benchmark in the PR comment. But, I have another PR to make that has the possibility to affect many more tests, so I need to find a way to run them all.

You don't need to deal with Nanosolider yourself, just make a PR against BaseBenchmarks to add a benchmark. In general better avoid using a lot of memory, a smaller test should be enough to measure the performance difference. You probably shouldn't add a standard test in Julia itself as these only check behavior, not performance.

Anyway open a PR first so that we can discuss the fix.

@nalimilan. I didn't add any tests to the standard Julia test suite. I just wanted to run them to be certain (better: less uncertain) that the change does not break anything. For the PR at hand, I just ran the test files that call merge explicitly. But, in the future, I may want to run all tests. I this case, I find that I run out of memory with the standard Julia test suite. Maybe because I am doing this on a desktop and not a dedicated server; I don't know.

Anyway, to move forward, I made the PR. #26440

Hmm. I see that I accidentally left "(#21672)", which may be confusing.

Doesn't this already work?

collect(Iterators.flatten([[1, 2, 3], [4, 5, 6]]))

We could just define reduce(vcat, vectors) as this?

A question which popped up when working on https://github.com/JuliaLang/julia/pull/26440 is whether we should define optimized methods only for reduce(merge, items::AbstractVector{<:Dict}), or for reduce(merge, items::AbstractVector) and use mapreduce(typeof, typejoin, items) to also take the optimized code path even for Vector{Any} objects which happen to contain only Dict entries. The former is definitely simpler to implement, but it can miss cases which would benefit from the optimization. Opinions?

Turns out the minimal fix to support reduce([hvcat], A) was simple. See https://github.com/JuliaLang/julia/pull/27188.

Here is a proposed solution. The other option is to define method for reduce(merge, items::AbstractVector{<:Dict}) that uses mapreduce to get the key types and also the value types.

It might be late to comment on this, but one advantage of a dedicated function is that it can handle empty vectors easily (eg. stack(Vector{Int}[])). reduce(vcat, Int[], vectors) works fine, but requires consciously thinking of the special-case. And it's not trivial to write with generic inputs.

We can probably improve the reduce specialization to handle the case of empty but concretely typed arrays, similar to how reduce(+, Int[]) works. Might be as simple as calling eltype(eltype(vectors)) first when choosing the element type, and only falling back to mapreduce(eltype, promote_type, vectors) (used currently in https://github.com/JuliaLang/julia/pull/27188) if that's not concrete.

To echo a previous comment, it would be nice to handle the empty case appropriately: just as reduce(+, Int[]) returns 0, so should reduce(vcat, Vector{Int}[]) return Int[]. An operation as basic as concatenating a vector of vectors shouldn't be so finicky!

Shouldn't be too hard to fix. We should just avoid calling mapreduce(eltype, promote_type, A) when isconcretetype(eltype(A)), and use the eltype directly instead. Would you make a PR?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ararslan picture ararslan  Â·  3Comments

musm picture musm  Â·  3Comments

sbromberger picture sbromberger  Â·  3Comments

Keno picture Keno  Â·  3Comments

dpsanders picture dpsanders  Â·  3Comments