Hi there,
apologies if this has already been addressed somewhere, but is there a reason that there is no unzip()
function in Base?
Ideally this would be a function that would take a Vector{Tuple{ ... }}
and return a Tuple{Vector, ..., Vector}
for output. E.g.
julia> v = [(1,"a",:meow), (2,"b",:woof), (3,"c",:doh!)]; unzip(v)
([1,2,3],ASCIIString["a","b","c"],[:meow,:woof,:doh!])
A naive implementation might be something like
function unzip(input::Vector)
n = length(input)
types = map(typeof, first(input))
output = map(T->Vector{T}(n), types)
for i = 1:n
@inbounds for (j, x) in enumerate(input[i])
(output[j])[i] = x
end
end
return (output...)
end
We definitely need an unzip
function in Base.
Or even simpler
unzip(a) = zip(a...)
Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.
I just ran into this issue when I absentmindedly used the Python-esque zip(arr...)
where length(arr)=6000
. It was not pretty. A dedicated unzip
would be nice. Maybe even throw a warning that you shouldn't splat anything nontrivial.
Maybe even throw a warning that you shouldn't splat anything nontrivial.
Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.
It looks like this is not covered in the performance tip?
Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.
Yeah, a warning is a bad idea.
It looks like this is not covered in the performance tip?
I think this should also be marked under differences from other languages. People coming from Python are probably used to the zip(*arr)
syntax and zip(arr...)
might be a tempting target.
You can dispatch only on arrays of tuples:
function unzip{T<:Tuple}(A::Array{T})
res = map(x -> x[], T.parameters)
res_len = length(res)
for t in A
for i in 1:res_len
push!(res[i], t[i])
end
end
res
end
Bump, we still need and unzip function.
BUMP if anyone is bored...
function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
n = length(zipped)
vec1 = Vector{T1}(n)
vec2 = Vector{T2}(n)
i = 1
for pair in enumerate(zipped)
vec1[i], vec2[i] = pair[2]
i = i+1
end
return vec1, vec2
end
maybe <= 1.5x is
function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
return map(first,zipped), map(last,zipped)
end
unzip(zipped::Base.Iterators.Zip2) = (zipped.a, zipped.b)
But all those differ in how they treat length(a) != length(b)
. I think I'd favor the behavior of the first.
This trivial case apart, the desired semantics are not absolutely clear to me: Given an iterator itr
, should unzip(itr)
return an iterator unzipped
, that generates iterators unzipped[1]
to unzipped[N]
, where N
is the length of the iterators generated by itr
? What if the first 10000 values generated by itr
are Tuple
s of length N==3
, but then comes a Tuple
of length 2? Throw when iterating at least one unzipped[n]
up to that point? Or eagerly collect itr
the moment unzip
is called?
Think about
unzip(f(n) for n in Base.Iterators.countfrom(0))
for different f
s to get an idea of my open questions.
There seems to be multiple use cases for an unzip, with different requirements.
A common case is involves Arrays of Tuples, in which the size is known and array shape must be preserved. Some suggestions here do not obey this constraint.
I have come up with a solution for this specific case using Base.Cartesian
. It's up to twice as fast as the alternative commonly suggested alternative for NTuples, and more than 10x for heterogeneous tuples.
See: https://github.com/spalato/Destruct.jl
Feel free to use and steal.
Can we get this for generalized iterators and not just arrays? Basically just a collect_many modeled after collect?
I could give it a stab? The trickiest bit (I think) will be getting an equivalent of default_eltype for zero length eltype unknown iterators
Go for it, @bramtayl!
Tada, this seems to work. Great excuse not to finish my grading.
```julia
struct Unzip{Iterator}
iterator::Iterator
end
iterate(iterator::Unzip) = iterate(iterator.iterator)
iterate(iterator::Unzip, state) = iterate(iterator.iterator, state)
IteratorEltype(iterator::Unzip) = IteratorEltype(iterator.iterator)
eltype(iterator::Unzip) = eltype(iterator.iterator)
IteratorSize(iterator::Unzip) = IteratorSize(iterator.iterator)
axes(iterator::Unzip) = axes(iterator.iterator)
size(iterator::Unzip) = size(iterator.iterator)
length(iterator::Unzip) = length(iterator.iterator)
struct ModelArray{ElementType, NumberOfDimensions, Model, Rest <: Tuple} <:
AbstractArray{ElementType, NumberOfDimensions}
model::Model
rest::Rest
end
model_array(model, rest...) =
ModelArray{
Tuple{eltype(model), eltype.(rest)...},
ndims(model),
typeof(model),
typeof(rest)
}(model, rest)
axes(array::ModelArray) = axes(array.model)
size(array::ModelArray) = size(array.model)
IndexStyle(array::ModelArray) = IndexLinear()
@propagate_inbounds getindex(array::ModelArray, index::Int) = (
getindex(array.model, index),
getindex.(array.rest, index)...
)
@propagate_inbounds setindex!(array::ModelArray, value::Tuple, index::Int) = (
setindex!(array.model, value[1], index),
setindex!.(array.rest, tail(value), index)...
)
push!(array::ModelArray, value::Tuple) = (
push!(array.model, value[1]),
push!.(array.rest, tail(value))
)
value_field_types(a_type) =
ntuple(index -> Val(fieldtype(a_type, index)), fieldcount(a_type))
function similar(
array::Unzip,
::Type{ElementType},
dims::Dims
) where {ElementType, NumberOfDimensions}
inner_array(::Val{InnerElementType}) where InnerElementType =
Array{InnerElementType}(undef, dims)
model_array(inner_array.(value_field_types(ElementType))...)
end
similar(a::Unzip, ::Type{T}) where {T} = similar(a, T, to_shape(axes(a)))
similar(a::Unzip{T}, dims::Tuple) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip{T}, dims::DimOrInd...) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::DimOrInd...) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::Tuple{Union{Integer, OneTo}, Vararg{Union{Integer, OneTo}}}) where {T} = similar(a, T, to_shape(dims))
similar(array::ModelArray, ::Type{ElementType}, dims::Dims) where ElementType =
similar(Unzip(nothing), ElementType, dims)
_similar_for(c::Unzip{Nothing}, T, itr, ::SizeUnknown) =
similar(c, T, 0)
_similar_for(c::Unzip{Nothing}, T, itr, ::HasLength) =
similar(c, T, Int(length(itr)::Integer))
_similar_for(c::Unzip{Nothing}, T, itr, ::HasShape) =
similar(c, T, axes(itr))
collect(iterator::Unzip) = _collect(
Unzip(nothing),
iterator,
IteratorEltype(iterator),
IteratorSize(iterator)
)
Generator(x -> (x, x + 1.0), [1]) |> Unzip |> collect
Generator(x -> (x, x + 1.0), [1, missing]) |> Unzip |> collect
zip([1], [1.0]) |> Unzip |> collect
[(1, 1.0)] |> Unzip |> collect
Filter(x -> x[1] == x[2], [(1, 1.0)]) |> Unzip |> collect
``
The above strategy turns out to be extremely sub-optimal for EltypeUnknown iterators, but it still works. I'm not sure I have the programming wherewithall to design a better system.
I'm thinking it would probably just need specialized ::ModelArray methods for grow_to! and collect_to! to be efficient, I'll see what I can cook up
Ok now that #30076 has gone through here are the last two methods to make this efficient (I think):
import Base: setindex_widen_up_to
maybe_setindex_widen_up_to(dest::AbstractArray{T}, el, i) where T =
if isa(el, T)
@inbounds dest[i] = el
dest
else
setindex_widen_up_to(dest, el, i)
end
setindex_widen_up_to(dest::ModelArray, el, i) =
model_array(maybe_setindex_widen_up_to.((dest.model, dest.rest...), el, i)...)
Should this go into base? Live as a package? If so, any comments before I cook up a PR?
I'm leaving the code here for now https://github.com/bramtayl/Unzip.jl
Is there a reason that it does not build?
It should only work on master
Can this become a PR, and then get merged?
Some docs would also need to be updated
I'm not sure I have the programming chops to make a PR. Or rather, I have some questions: what file should the code go into? What modifications do I need to make to take into account bootstrap order? Do I need boundschecking in the ModelArray constructor to make sure all the inputs are the same size? How can I adjust the code to take into account offset axes? What would be a good way to effectively benchmark this? How can I tell if the widening optimizations are working as intended?
Also, I've run into a bit of a problem. Not sure what do in the case of EltypeUnknown, SizeUnknown iterators where @default_eltype
yields Union{}
. We need to know at least the length of the tuple to start unzipping... I suppose, for an inference free implementation, we'd have to require the tuple length explicitly... ugh...
I think it is okay to make some assumption on the iterator which is to be unzipped (e.g. not returning tuples of different length) and it would okay to give an error message in the case this contract is not fulfilled.
Well, well. take that in python,
write this in julia:
s1,s2 = (1,2,3),('a','b','c')
s = tuple(zip(s1,s2)...)
@test tuple(zip(s...)...) == (s1,s2)
Test Passed
so
unzip(z) = tuple(zip(z...)...)
@test (unzip ∘ zip)(s1,s2) == (s1,s2)
Test Passed
One more impl, quickly usable.
Let's call it the haiku version of julia unzip
PS: edited v1 w array removed, was not type stable. v2 w tuple seems ok
https://github.com/JuliaLang/julia/issues/13942#issuecomment-155668226 and the comment below.
We have an implementation; all that's required is for someone to make a PR to add it to Base. The issue is already tagged as "good first issue" and "help wanted" so hopefully someone feels like trying it.
@KristofferC
Sorry, code was so short it pass under my radar
using Test
unzip(z) = tuple(zip(z...)...)
s,t = [1:1_000_000...],[1:1_000_000...];
z = zip(s,t) |> collect;
@test_throws StackOverflowError unzip(z)
s,t = [1:100...],[1:100...];
z = zip(s,t) |> collect;
@time unzip(z);
# 4.105195 seconds (26.41 M allocations: 1.021 GiB, 17.28% gc time)
it's an useless toy
Yes, that's been noted:
Trying the
zip(...)
thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.
I was referring to @bramtayl's Unzip package, which has a real implementation... but also appears to have been deleted from GitHub. So I guess we do not actually have an implementation. @bramtayl's is there any chance you can resurrect your Unzip code? Was there some reason you deleted it?
Of course, that's was clearly understandable.
I was just answering to Kristoffer about the splat option.
Didn't figure numbers were so bad.
Thanks for pointing out the correct track.
Still a bit hard for me at this point to contribute to Base.
Good luck to other people...
No worries. I didn't bother measuring the splatting option because I knew it was going to be terrible 😬
OK. That is surprisingly pretty good however in python
def unzip(z): return zip(*z)
def bench_zip_unzip():
z = [*zip(range(1_000_000), range(1_000_000))]
u = [*unzip(z)]
import timeit
timeit.timeit(bench_zip_unzip, number=1)
# 0.4816088059997128 second
one other thing that would be useful would be to allow it to take a function first argument, so that you could do e.g.
s,c = unzip(sincos, x)
for a vector (or array) x
.
Copying some thoughts from a thread on Slack, the utility of unzip
is that it works like collect
does on a single iterable, materializing it into a concrete vector but producing as many vectors as there are elements in each tuple that the iterable yields. The iterable _must_ yield the same sized tuple each time; it need not, however, yield the same element types each time; unzip
should handle this the same way that collect or comprehensions do (they're slightly different, so I guess that's a design choice), for each vector that it outputs. So the result of unzip(itr)
should be the same as
ntuple(length(first(itr))) do i
[t[i] for t in itr]
end
except that it only consumes itr
a single time and it should error if any of the tuples has a different length than the first one. That will be a bit tricky to write 🤔. Mocked up example:
julia> itr = [(1,2,3),(3,1.5,'x')]
2-element Array{Tuple{Int64,Real,Any},1}:
(1, 2, 3)
(3, 1.5, 'x')
julia> unzip(itr)
([1, 3], Real[2, 1.5], Any[3, 'x'])
See #30987
Thanks, @bramtayl!
It will be good if we can unzip Array{Pair}
, Base.Iterators.Pairs
too. eg.
unzip([k=>v for (k,v) in zip('a':'c',1:3)]) == unzip(zip('a':'c',1:3))
bump
https://gist.github.com/bramtayl/2d6d323582f7c3158f9a1b5299997d68
Not skilled enough to make a PR, but I think this should be a pretty robust implementation
Would this be considered an acceptable solution?
unzip(z::Iterators.Zip) = z.is
unzip(itr) = unzip(collect(itr))
unzip(::AbstractArray) = error("Can only unzip array of tuples.")
@generated function unzip(a::AbstractArray{T}) where {T<:Tuple}
types = T.types
types[end] <: Vararg && error("Tuples in the array should have the same length.")
isempty(types) && return :(())
n = length(types)
quote
@nexprs $n i -> out_i = similar(a,$types[i])
@inbounds for j in eachindex(a)
@nexprs $n i -> out_i[j] = a[j][i]
end
return @ntuple $n out
end
end
90% of this implementation is taken from the package Destruct.jl by @spalato, who commented earlier in this thread.
This implementation is maximally efficient for dense arrays of tuples (which was the motivating example for this issue). It is slightly less efficient for arrays with non-standard indexing, due to reliance on eachindex
.
For the Zip
iterator, it just returns the underlying iterators without truncating them to a common size. I.e. a,b,c = unzip(zip(a,b,c))
is a no-op.
For other iterators and generators, it first collects the output into an array and unzips that. This is not ideal, since it requires twice as much memory, but I simply don't have enough experience with iterator internals to implement it better. (This is also why I haven't tried to ressurect @bramtayl's implementation: I simply couldn't wrap my head around it yet :smile:).
AFAICT, the above code fulfils all the functional requirements set out by @StefanKarpinski above. If this implementation is deemed acceptable, despite using generated functions and sacrificing efficiency in some cases, I can add tests and documentation and open a PR.
The code in the gist was written relatively recently, so I think it should be copy-pastable? It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items). Just need someone to figure out the PR details (something I never seem to be able to do successfully)
I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.
It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items).
Iterators are supported, but with some overhead. I think the main difference between the two variants is that you produce arrays with missing
values if the tuples have different length. Stefan's comment implies that an error should be thrown in this case, but I can see how using missing
values could be sometimes more useful. If this behavior is preferred, I think it can be implemented in the generating-function-based approach as well.
I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.
Thanks, I might have to take you up on that offer yet :smile:. But I really don't feel qualified to finish what you started there. I just spent several hours trying to understand your implementation and still only have a general idea of how it works. And that's without even touching on the missing
and widening stuff.
I also have to ask, why is this style preferred in Julia (in array library code, at least)? As far as I can tell, the end result is that a specialized method is produced for every combination of tuple and container type that is invoked in user code. Which is more or less the same as what a generated function would do. On the other hand, a generated function does exactly what it advertises, while the other solution goes back and forth through twenty layers of abstraction (this is not a metaphor; I think the call chain really is this deep, even though it all gets inlined in the end). I'm sorry if this sounds argumentative, and I'm also starting to think this may be the wrong place to ask such questions, but I'm genuinely curious why layered abstractions relying on vararg splatting on every other step are preferred to gen-funcs in general.
Two things:
1) Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any
2) Because Julia is a dynamic language, there is no return type. So you can't look at an iterator and guess what types each of the columns will end up being, or for that matter, how many columns there are going to be in the first place.
- Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any
You're right about that, of course. Though I think you are holding yourself to a higher standard than the standard library (no pun intended). Maybe it would be preferable to use your proposed changes to improve collect
, so that it doesn't widen types more than necessary? Then less skilled people, such as myself, could get away with falling back on it without too high a cost :smile:.
Also, if the "iterator" passed to unzip
is an actual array, there is the same problem: [(1,"a"), (2,3)]
infers as Array{Tuple{Int64,Any},1}
. Does your implementation tighten the output type to Tuple{Array{Int64,1}, Array{Union{Int64,String},1}}
in this case as well?
Yes
Sorry, I didn't mean for that to sound short. I really appreciate all the work you're doing, and happy to collaborate on any solutions.
That's impressive. Can you explain, in general terms, how you do it? In my understanding that would require backtracking and converting some of the columns when you encounter a tuple with an unexpected length or type. And the result type could not be determined until the operation is complete, no?
Yes and yes. push_widen
and setindex_widen_up_to
already do this in collect for Base (and I outsource to them). Whether or not the result type can be determined depends on how smart inference is being that day, but the short answer is that because Julia is a dynamic language, it doesn't matter anyway.
But I do have a doctest for inference:
julia> stable(x) = (x, x + 0.0, x, x + 0.0, x, x + 0.0);
julia> @inferred unzip(Generator(stable, 1:4))
So, in some cases, inference can figure out what's going on just fine.
But in general, this is not type-stable, I guess, especially with arrays rather than generators?
Yes. Which is ok, that's par for the course in Julia.
Thanks for the explanations! Gives me a lot to think about.
Last question: Do you know why this behavior is not implemented for collect
? Is it just that nobody got around to it yet, or is there some other hidden cost to consider?
I'm not sure I'm qualified to answer that question, but I think #25553 is the place to start looking
The answer is here https://github.com/JuliaLang/julia/pull/26501#issuecomment-373957902 I think
Hmm... correct me if I'm wrong, but #26501 (comment) and #26501 (comment) seems to suggest that diverging tuple fields widening to Any
rather than Union
is by design (because compile times). So the tradeoff here is partly in how much we want to tax the compiler.
@StefanKarpinski, could you chime in with an opinion?
Notice that there is some mismatch between what we expect from iterator protocol v1
And how unzip works internally
Without seekable io / indexable iterator and/or buffering, that could never fit.
I took at stab at simplifying the code in my gist. I made a little headway but not much. If it's any comfort, many of the functions (map_unrolled, partial_map, zip_missing) would fit nicely into a unexported mini-module for unrolling tuple operations. I've been kicking around a plan for that for a long time.
It would be really nice to get this solved for 1.6.
@StefanKarpinski I remember that you had a full grasp of this -- is anything of deep import holding it back?
Only performance, Iirc.
@JeffreySarnoff, I may be misremembering some things, but I think there were some unresolved questions aside from performance:
missing
s?Any
?Union
?unzip
of zip
be the identity? (Rather than throwing an error if the input arrays are of different size, as zip
currently does). Maybe some other things too, but it's been a while since I read through the entire thread.
Note that these issues only really arise in TypeUnknown
iterators. For homogeneous tuple arrays there is the Destruct package, which implements a simple and efficient unzip
operation.
I wrote up https://github.com/bramtayl/Unzip.jl and it should be registered 3 days from now.
_afaic a clear candidate after looking at_ dump(zip(avec, bvec))
unzip
that which is zip
ped unzip(x::Base.Iterators.Zip) = x.is
Feel free to open up a PR over there, but I'm not so sure about that method. The behavior in my version unzip
is to collect new vectors, not pass along old ones. z.is
also pretty easy to type if you already have a zip.
_iknow uknow this is furiously obvious_
* `unzip` that which is `zip`ped
unzip(x::Base.Iterators.Zip) = z.is
How is this obvious? It's one possible way to define unzip
of zip
, and has the (IMO major) disadvantage that unzip
is not guaranteed to return arrays of equal length.
(it works better with z.is
replaced by x.is
)
It will return the arrays that were given to zip
.
How is that a major disadvantage for unzip
?
I conceived that its essential purpose is to recover what zip
got.
Certainly, the unzip of a collect
ed zip operates on equilength firsts and .. and lasts, irrespective of what zip
got. And that is appropriate, as one can only process information that exists when the processing occur. The most performant approaches to unzip(x::Vector{Tuple{_}})
cannot be expected to perform as quickly as does doing x.is
.
To forgo that very high throughput advancing some computation because it provides unequal length vectors iff unequal length vectors were given to zip
seems unnecessary. Where that is undesirable, one may apply e.g.rtrim_longer_vecs(unzip(zipped))
and obtain equi-sized vectors.
@zzyxyzz I meant "obvious" in the sense that this is what dump(zip(a,b))
shows. I had not known about the inner structure of our zips until today .. so, I revised that phrase: "afaic a clear candidate after looking at dump(zip(avec, bvec))
".
return arrays of equal length
good point
Maybe parents
would be a good name for what function (x::Base.Iterators.Zip) = x.is end
does
@JeffreySarnoff, I'm not saying that making unzip(zip(...))
a no-op is necessarily a bad idea. It's just that unzip
has a surprising number of loose ends and subtle trade-offs for something that is so simple in principle. This is marked as "good first issue", but it really isn't. Over the years several people have taken a stab at it (@bramtayl has come furthest), but it's no accident that there is still no generally accepted solution.
Maybe this issue should be renamed, following Julia tradition, "Taking unzip seriously". :stuck_out_tongue:
Registered!
For unzip(x::Base.Iterators.Zip)
where the lengths of x.is
are equal, would you rather return x.is
or a copies of the constituents?
@JeffreySarnoff
For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?
My personal take on this:
In any decent implementation, unzip
should call its source iterator exactly once for each input tuple. This forces unzip
to construct the output arrays in lock-step; it cannot finish the first array before starting on the second. Thus, unzip
cannot act as a lazy iterator in any reasonable sense. It is fundamentally a collect
-like operation. So the safe, consistent thing to do would be to emulate the behavior of collect
and always return freshly allocated arrays.
But I do admit that the temptation to special-case unzip(::Base.Iterators.Zip)
is strong. And it wouldn't be the first time that safe, predictable behavior is sacrificed for performance in Julia. :smile:
Most helpful comment
We definitely need an
unzip
function in Base.