Julia: Base.unzip()

Created on 11 Nov 2015  Â·  80Comments  Â·  Source: JuliaLang/julia

Hi there,

apologies if this has already been addressed somewhere, but is there a reason that there is no unzip() function in Base?

Ideally this would be a function that would take a Vector{Tuple{ ... }} and return a Tuple{Vector, ..., Vector} for output. E.g.

julia> v = [(1,"a",:meow), (2,"b",:woof), (3,"c",:doh!)]; unzip(v)
([1,2,3],ASCIIString["a","b","c"],[:meow,:woof,:doh!])

A naive implementation might be something like

function unzip(input::Vector)
    n = length(input)
    types  = map(typeof, first(input))
    output = map(T->Vector{T}(n), types)

    for i = 1:n
       @inbounds for (j, x) in enumerate(input[i])
           (output[j])[i] = x
       end
    end
    return (output...)
end
good first issue help wanted

Most helpful comment

We definitely need an unzip function in Base.

All 80 comments

We definitely need an unzip function in Base.

Or even simpler

unzip(a) = zip(a...) 

No

Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.

I just ran into this issue when I absentmindedly used the Python-esque zip(arr...) where length(arr)=6000. It was not pretty. A dedicated unzip would be nice. Maybe even throw a warning that you shouldn't splat anything nontrivial.

Maybe even throw a warning that you shouldn't splat anything nontrivial.

Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.

It looks like this is not covered in the performance tip?

Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.

Yeah, a warning is a bad idea.

It looks like this is not covered in the performance tip?

I think this should also be marked under differences from other languages. People coming from Python are probably used to the zip(*arr) syntax and zip(arr...) might be a tempting target.

You can dispatch only on arrays of tuples:

function unzip{T<:Tuple}(A::Array{T})
    res = map(x -> x[], T.parameters)
    res_len = length(res)
    for t in A
        for i in 1:res_len
            push!(res[i], t[i])
        end
    end
    res
end

Bump, we still need and unzip function.

BUMP if anyone is bored...


function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
    n = length(zipped)
    vec1 = Vector{T1}(n)
    vec2 = Vector{T2}(n)
    i = 1
    for pair in enumerate(zipped)
        vec1[i], vec2[i] = pair[2]
        i = i+1
    end
    return vec1, vec2
end

maybe <= 1.5x is

function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
    return map(first,zipped), map(last,zipped)
 end
unzip(zipped::Base.Iterators.Zip2) = (zipped.a, zipped.b)

But all those differ in how they treat length(a) != length(b). I think I'd favor the behavior of the first.

This trivial case apart, the desired semantics are not absolutely clear to me: Given an iterator itr, should unzip(itr) return an iterator unzipped, that generates iterators unzipped[1] to unzipped[N], where N is the length of the iterators generated by itr? What if the first 10000 values generated by itr are Tuples of length N==3, but then comes a Tuple of length 2? Throw when iterating at least one unzipped[n] up to that point? Or eagerly collect itr the moment unzip is called?

Think about

unzip(f(n) for n in Base.Iterators.countfrom(0))

for different fs to get an idea of my open questions.

There seems to be multiple use cases for an unzip, with different requirements.

A common case is involves Arrays of Tuples, in which the size is known and array shape must be preserved. Some suggestions here do not obey this constraint.

I have come up with a solution for this specific case using Base.Cartesian. It's up to twice as fast as the alternative commonly suggested alternative for NTuples, and more than 10x for heterogeneous tuples.
See: https://github.com/spalato/Destruct.jl
Feel free to use and steal.

Can we get this for generalized iterators and not just arrays? Basically just a collect_many modeled after collect?

I could give it a stab? The trickiest bit (I think) will be getting an equivalent of default_eltype for zero length eltype unknown iterators

Go for it, @bramtayl!

Tada, this seems to work. Great excuse not to finish my grading.

```julia
struct Unzip{Iterator}
iterator::Iterator
end

iterate(iterator::Unzip) = iterate(iterator.iterator)
iterate(iterator::Unzip, state) = iterate(iterator.iterator, state)

IteratorEltype(iterator::Unzip) = IteratorEltype(iterator.iterator)
eltype(iterator::Unzip) = eltype(iterator.iterator)

IteratorSize(iterator::Unzip) = IteratorSize(iterator.iterator)
axes(iterator::Unzip) = axes(iterator.iterator)
size(iterator::Unzip) = size(iterator.iterator)
length(iterator::Unzip) = length(iterator.iterator)

struct ModelArray{ElementType, NumberOfDimensions, Model, Rest <: Tuple} <:
AbstractArray{ElementType, NumberOfDimensions}
model::Model
rest::Rest
end

model_array(model, rest...) =
ModelArray{
Tuple{eltype(model), eltype.(rest)...},
ndims(model),
typeof(model),
typeof(rest)
}(model, rest)

axes(array::ModelArray) = axes(array.model)
size(array::ModelArray) = size(array.model)

IndexStyle(array::ModelArray) = IndexLinear()

@propagate_inbounds getindex(array::ModelArray, index::Int) = (
getindex(array.model, index),
getindex.(array.rest, index)...
)

@propagate_inbounds setindex!(array::ModelArray, value::Tuple, index::Int) = (
setindex!(array.model, value[1], index),
setindex!.(array.rest, tail(value), index)...
)

push!(array::ModelArray, value::Tuple) = (
push!(array.model, value[1]),
push!.(array.rest, tail(value))
)

TODO: use fieldtypes instead of value_field_types in a future with more constant propagation

value_field_types(a_type) =
ntuple(index -> Val(fieldtype(a_type, index)), fieldcount(a_type))

function similar(
array::Unzip,
::Type{ElementType},
dims::Dims
) where {ElementType, NumberOfDimensions}

inner_array(::Val{InnerElementType}) where InnerElementType =
    Array{InnerElementType}(undef, dims)
model_array(inner_array.(value_field_types(ElementType))...)

end

Unzip needs to replicate AbstractArray similar machinery

similar(a::Unzip, ::Type{T}) where {T} = similar(a, T, to_shape(axes(a)))
similar(a::Unzip{T}, dims::Tuple) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip{T}, dims::DimOrInd...) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::DimOrInd...) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::Tuple{Union{Integer, OneTo}, Vararg{Union{Integer, OneTo}}}) where {T} = similar(a, T, to_shape(dims))

similar(array::ModelArray, ::Type{ElementType}, dims::Dims) where ElementType =
similar(Unzip(nothing), ElementType, dims)

_similar_for(c::Unzip{Nothing}, T, itr, ::SizeUnknown) =
similar(c, T, 0)
_similar_for(c::Unzip{Nothing}, T, itr, ::HasLength) =
similar(c, T, Int(length(itr)::Integer))
_similar_for(c::Unzip{Nothing}, T, itr, ::HasShape) =
similar(c, T, axes(itr))

collect(iterator::Unzip) = _collect(
Unzip(nothing),
iterator,
IteratorEltype(iterator),
IteratorSize(iterator)
)

Examples

Generator(x -> (x, x + 1.0), [1]) |> Unzip |> collect
Generator(x -> (x, x + 1.0), [1, missing]) |> Unzip |> collect
zip([1], [1.0]) |> Unzip |> collect
[(1, 1.0)] |> Unzip |> collect
Filter(x -> x[1] == x[2], [(1, 1.0)]) |> Unzip |> collect
``

The above strategy turns out to be extremely sub-optimal for EltypeUnknown iterators, but it still works. I'm not sure I have the programming wherewithall to design a better system.

I'm thinking it would probably just need specialized ::ModelArray methods for grow_to! and collect_to! to be efficient, I'll see what I can cook up

Ok now that #30076 has gone through here are the last two methods to make this efficient (I think):

import Base: setindex_widen_up_to

maybe_setindex_widen_up_to(dest::AbstractArray{T}, el, i) where T = 
    if isa(el, T)
        @inbounds dest[i] = el
        dest
    else 
        setindex_widen_up_to(dest, el, i)
    end

setindex_widen_up_to(dest::ModelArray, el, i) = 
    model_array(maybe_setindex_widen_up_to.((dest.model, dest.rest...), el, i)...)

Should this go into base? Live as a package? If so, any comments before I cook up a PR?

I'm leaving the code here for now https://github.com/bramtayl/Unzip.jl

Is there a reason that it does not build?

It should only work on master

Can this become a PR, and then get merged?
Some docs would also need to be updated

I'm not sure I have the programming chops to make a PR. Or rather, I have some questions: what file should the code go into? What modifications do I need to make to take into account bootstrap order? Do I need boundschecking in the ModelArray constructor to make sure all the inputs are the same size? How can I adjust the code to take into account offset axes? What would be a good way to effectively benchmark this? How can I tell if the widening optimizations are working as intended?

Also, I've run into a bit of a problem. Not sure what do in the case of EltypeUnknown, SizeUnknown iterators where @default_eltype yields Union{}. We need to know at least the length of the tuple to start unzipping... I suppose, for an inference free implementation, we'd have to require the tuple length explicitly... ugh...

I think it is okay to make some assumption on the iterator which is to be unzipped (e.g. not returning tuples of different length) and it would okay to give an error message in the case this contract is not fulfilled.

Well, well. take that in python,
write this in julia:

s1,s2 = (1,2,3),('a','b','c')
s = tuple(zip(s1,s2)...)

@test tuple(zip(s...)...) == (s1,s2)
Test Passed

so

unzip(z) = tuple(zip(z...)...)

@test (unzip ∘ zip)(s1,s2) == (s1,s2)
Test Passed

One more impl, quickly usable.
Let's call it the haiku version of julia unzip

PS: edited v1 w array removed, was not type stable. v2 w tuple seems ok

We have an implementation; all that's required is for someone to make a PR to add it to Base. The issue is already tagged as "good first issue" and "help wanted" so hopefully someone feels like trying it.

@KristofferC

Sorry, code was so short it pass under my radar

using Test
unzip(z) = tuple(zip(z...)...)

s,t = [1:1_000_000...],[1:1_000_000...];
z = zip(s,t) |> collect;
@test_throws StackOverflowError unzip(z)

s,t = [1:100...],[1:100...];
z = zip(s,t) |> collect;
@time unzip(z);
# 4.105195 seconds (26.41 M allocations: 1.021 GiB, 17.28% gc time)

it's an useless toy

Yes, that's been noted:

Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.

I was referring to @bramtayl's Unzip package, which has a real implementation... but also appears to have been deleted from GitHub. So I guess we do not actually have an implementation. @bramtayl's is there any chance you can resurrect your Unzip code? Was there some reason you deleted it?

Of course, that's was clearly understandable.
I was just answering to Kristoffer about the splat option.
Didn't figure numbers were so bad.

Thanks for pointing out the correct track.
Still a bit hard for me at this point to contribute to Base.
Good luck to other people...

No worries. I didn't bother measuring the splatting option because I knew it was going to be terrible 😬

OK. That is surprisingly pretty good however in python

def unzip(z): return zip(*z)

def bench_zip_unzip():
    z = [*zip(range(1_000_000), range(1_000_000))]
    u = [*unzip(z)]

import timeit
timeit.timeit(bench_zip_unzip, number=1)
# 0.4816088059997128 second

one other thing that would be useful would be to allow it to take a function first argument, so that you could do e.g.

s,c = unzip(sincos, x)

for a vector (or array) x.

Copying some thoughts from a thread on Slack, the utility of unzip is that it works like collect does on a single iterable, materializing it into a concrete vector but producing as many vectors as there are elements in each tuple that the iterable yields. The iterable _must_ yield the same sized tuple each time; it need not, however, yield the same element types each time; unzip should handle this the same way that collect or comprehensions do (they're slightly different, so I guess that's a design choice), for each vector that it outputs. So the result of unzip(itr) should be the same as

ntuple(length(first(itr))) do i
    [t[i] for t in itr]
end

except that it only consumes itr a single time and it should error if any of the tuples has a different length than the first one. That will be a bit tricky to write 🤔. Mocked up example:

julia> itr = [(1,2,3),(3,1.5,'x')]
2-element Array{Tuple{Int64,Real,Any},1}:
 (1, 2, 3)
 (3, 1.5, 'x')

julia> unzip(itr)
([1, 3], Real[2, 1.5], Any[3, 'x'])

See #30987

Thanks, @bramtayl!

It will be good if we can unzip Array{Pair}, Base.Iterators.Pairs too. eg.

unzip([k=>v for (k,v) in zip('a':'c',1:3)]) == unzip(zip('a':'c',1:3))

bump

https://gist.github.com/bramtayl/2d6d323582f7c3158f9a1b5299997d68

Not skilled enough to make a PR, but I think this should be a pretty robust implementation

Would this be considered an acceptable solution?

unzip(z::Iterators.Zip) = z.is
unzip(itr) = unzip(collect(itr))
unzip(::AbstractArray) = error("Can only unzip array of tuples.")

@generated function unzip(a::AbstractArray{T}) where {T<:Tuple}
    types = T.types
    types[end] <: Vararg && error("Tuples in the array should have the same length.")
    isempty(types) && return :(())
    n = length(types)
    quote
        @nexprs $n i -> out_i = similar(a,$types[i])
        @inbounds for j in eachindex(a)
            @nexprs $n i -> out_i[j] = a[j][i]
        end
        return @ntuple $n out
    end
end

90% of this implementation is taken from the package Destruct.jl by @spalato, who commented earlier in this thread.

This implementation is maximally efficient for dense arrays of tuples (which was the motivating example for this issue). It is slightly less efficient for arrays with non-standard indexing, due to reliance on eachindex.

For the Zip iterator, it just returns the underlying iterators without truncating them to a common size. I.e. a,b,c = unzip(zip(a,b,c)) is a no-op.

For other iterators and generators, it first collects the output into an array and unzips that. This is not ideal, since it requires twice as much memory, but I simply don't have enough experience with iterator internals to implement it better. (This is also why I haven't tried to ressurect @bramtayl's implementation: I simply couldn't wrap my head around it yet :smile:).

AFAICT, the above code fulfils all the functional requirements set out by @StefanKarpinski above. If this implementation is deemed acceptable, despite using generated functions and sacrificing efficiency in some cases, I can add tests and documentation and open a PR.

The code in the gist was written relatively recently, so I think it should be copy-pastable? It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items). Just need someone to figure out the PR details (something I never seem to be able to do successfully)

I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.

It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items).

Iterators are supported, but with some overhead. I think the main difference between the two variants is that you produce arrays with missing values if the tuples have different length. Stefan's comment implies that an error should be thrown in this case, but I can see how using missing values could be sometimes more useful. If this behavior is preferred, I think it can be implemented in the generating-function-based approach as well.

I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.

Thanks, I might have to take you up on that offer yet :smile:. But I really don't feel qualified to finish what you started there. I just spent several hours trying to understand your implementation and still only have a general idea of how it works. And that's without even touching on the missing and widening stuff.

I also have to ask, why is this style preferred in Julia (in array library code, at least)? As far as I can tell, the end result is that a specialized method is produced for every combination of tuple and container type that is invoked in user code. Which is more or less the same as what a generated function would do. On the other hand, a generated function does exactly what it advertises, while the other solution goes back and forth through twenty layers of abstraction (this is not a metaphor; I think the call chain really is this deep, even though it all gets inlined in the end). I'm sorry if this sounds argumentative, and I'm also starting to think this may be the wrong place to ask such questions, but I'm genuinely curious why layered abstractions relying on vararg splatting on every other step are preferred to gen-funcs in general.

Two things:

1) Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any
2) Because Julia is a dynamic language, there is no return type. So you can't look at an iterator and guess what types each of the columns will end up being, or for that matter, how many columns there are going to be in the first place.

  1. Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any

You're right about that, of course. Though I think you are holding yourself to a higher standard than the standard library (no pun intended). Maybe it would be preferable to use your proposed changes to improve collect, so that it doesn't widen types more than necessary? Then less skilled people, such as myself, could get away with falling back on it without too high a cost :smile:.

Also, if the "iterator" passed to unzip is an actual array, there is the same problem: [(1,"a"), (2,3)] infers as Array{Tuple{Int64,Any},1}. Does your implementation tighten the output type to Tuple{Array{Int64,1}, Array{Union{Int64,String},1}} in this case as well?

Yes

Sorry, I didn't mean for that to sound short. I really appreciate all the work you're doing, and happy to collaborate on any solutions.

That's impressive. Can you explain, in general terms, how you do it? In my understanding that would require backtracking and converting some of the columns when you encounter a tuple with an unexpected length or type. And the result type could not be determined until the operation is complete, no?

Yes and yes. push_widen and setindex_widen_up_to already do this in collect for Base (and I outsource to them). Whether or not the result type can be determined depends on how smart inference is being that day, but the short answer is that because Julia is a dynamic language, it doesn't matter anyway.

But I do have a doctest for inference:

julia> stable(x) = (x, x + 0.0, x, x + 0.0, x, x + 0.0);

julia> @inferred unzip(Generator(stable, 1:4))

So, in some cases, inference can figure out what's going on just fine.

But in general, this is not type-stable, I guess, especially with arrays rather than generators?

Yes. Which is ok, that's par for the course in Julia.

Thanks for the explanations! Gives me a lot to think about.

Last question: Do you know why this behavior is not implemented for collect? Is it just that nobody got around to it yet, or is there some other hidden cost to consider?

I'm not sure I'm qualified to answer that question, but I think #25553 is the place to start looking

The answer is here https://github.com/JuliaLang/julia/pull/26501#issuecomment-373957902 I think

Hmm... correct me if I'm wrong, but #26501 (comment) and #26501 (comment) seems to suggest that diverging tuple fields widening to Any rather than Union is by design (because compile times). So the tradeoff here is partly in how much we want to tax the compiler.

@StefanKarpinski, could you chime in with an opinion?

Notice that there is some mismatch between what we expect from iterator protocol v1

annex - unzip as iterators dot

And how unzip works internally

annex - unzip as parallelized-iterator dot

Without seekable io / indexable iterator and/or buffering, that could never fit.

I took at stab at simplifying the code in my gist. I made a little headway but not much. If it's any comfort, many of the functions (map_unrolled, partial_map, zip_missing) would fit nicely into a unexported mini-module for unrolling tuple operations. I've been kicking around a plan for that for a long time.

It would be really nice to get this solved for 1.6.

@StefanKarpinski I remember that you had a full grasp of this -- is anything of deep import holding it back?

Only performance, Iirc.

@JeffreySarnoff, I may be misremembering some things, but I think there were some unresolved questions aside from performance:

  1. What to do if tuples have different length?

    • Fill in missings?

    • Throw error?

  2. What to do if the same field has different types in different tuples?

    • Error?

    • Promote to Any?

    • Promote to smallest possible Union?

    • Promote to parent type?

    • Widen numeric types?

  3. Lazy or strict semantics?

    • In particular, do we want unzip of zip be the identity? (Rather than throwing an error if the input arrays are of different size, as zip currently does).

Maybe some other things too, but it's been a while since I read through the entire thread.

Note that these issues only really arise in TypeUnknown iterators. For homogeneous tuple arrays there is the Destruct package, which implements a simple and efficient unzip operation.

33324 seems pretty good to me still. If we want to simplify the code I think the first step would be to solve #31909

I wrote up https://github.com/bramtayl/Unzip.jl and it should be registered 3 days from now.

_afaic a clear candidate after looking at_ dump(zip(avec, bvec))

  • unzip that which is zipped
    unzip(x::Base.Iterators.Zip) = x.is

Feel free to open up a PR over there, but I'm not so sure about that method. The behavior in my version unzip is to collect new vectors, not pass along old ones. z.is also pretty easy to type if you already have a zip.

_iknow uknow this is furiously obvious_

* `unzip` that which is `zip`ped
  unzip(x::Base.Iterators.Zip) = z.is

How is this obvious? It's one possible way to define unzip of zip, and has the (IMO major) disadvantage that unzip is not guaranteed to return arrays of equal length.

(it works better with z.is replaced by x.is)
It will return the arrays that were given to zip.
How is that a major disadvantage for unzip?
I conceived that its essential purpose is to recover what zip got.

Certainly, the unzip of a collected zip operates on equilength firsts and .. and lasts, irrespective of what zip got. And that is appropriate, as one can only process information that exists when the processing occur. The most performant approaches to unzip(x::Vector{Tuple{_}}) cannot be expected to perform as quickly as does doing x.is.

To forgo that very high throughput advancing some computation because it provides unequal length vectors iff unequal length vectors were given to zip seems unnecessary. Where that is undesirable, one may apply e.g.rtrim_longer_vecs(unzip(zipped)) and obtain equi-sized vectors.

@zzyxyzz I meant "obvious" in the sense that this is what dump(zip(a,b)) shows. I had not known about the inner structure of our zips until today .. so, I revised that phrase: "afaic a clear candidate after looking at dump(zip(avec, bvec))".

return arrays of equal length

good point

Maybe parents would be a good name for what function (x::Base.Iterators.Zip) = x.is end does

@JeffreySarnoff, I'm not saying that making unzip(zip(...)) a no-op is necessarily a bad idea. It's just that unzip has a surprising number of loose ends and subtle trade-offs for something that is so simple in principle. This is marked as "good first issue", but it really isn't. Over the years several people have taken a stab at it (@bramtayl has come furthest), but it's no accident that there is still no generally accepted solution.

Maybe this issue should be renamed, following Julia tradition, "Taking unzip seriously". :stuck_out_tongue:

Registered!

For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?

@JeffreySarnoff

For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?

My personal take on this:

In any decent implementation, unzip should call its source iterator exactly once for each input tuple. This forces unzip to construct the output arrays in lock-step; it cannot finish the first array before starting on the second. Thus, unzip cannot act as a lazy iterator in any reasonable sense. It is fundamentally a collect-like operation. So the safe, consistent thing to do would be to emulate the behavior of collect and always return freshly allocated arrays.

But I do admit that the temptation to special-case unzip(::Base.Iterators.Zip) is strong. And it wouldn't be the first time that safe, predictable behavior is sacrificed for performance in Julia. :smile:

Was this page helpful?
0 / 5 - 0 ratings