Julia: Merge collect() into Vector()?

Created on 24 Apr 2016  ·  44Comments  ·  Source: JuliaLang/julia

I wonder whether collect() isn't always equivalent to Vector(), and could therefore be merged into it:

julia> methods(collect)
#3 methods for generic function "collect":
collect(r::Range{T<:Any}) at range.jl:753
collect{T}(::Type{T}, itr) at array.jl:217
collect(itr) at array.jl:224

The method for Range returns a vector, so it could be replaced with Array(), as both call vcat() in the end. convert(Vector, ...) currently fails and should be fixed.

The methods for general iterables are more complex, but they essentially always create an Array too. Even when they call similar(), it's on a 1:1 range, so they actually create a Vector. So the collect() machinery could be moved to Array()/Vector() to make them support conversion from any iterable.

This issue is related to discussions regarding similar() (https://github.com/JuliaLang/julia/issues/13731). Indeed, Julia provides a few functions which appear to behave differently, but in practice behave like Array, and callers end up depending on this assumption. This makes the API more complex and less explicit.

See also https://github.com/JuliaLang/julia/issues/12251 about merging full into Array.

arrays collections design

Most helpful comment

Yes, I agree we seem to need collect. Most (all?) current uses can be replaced by Array(itr), which is great, but we need a function that picks an appropriate array type (e.g. for materializing a lazy view of an immutable or distributed array, etc.). AbstractArray(itr) or DenseArray(itr) might be able to do that, but (1) that's pretty long, and (2) lazy views might still be subtypes of AbstractArray.

All 44 comments

Indeed. Let's keep this issue to discuss the case of collect then, to avoid polluting the other thread.

It would be a lot cleaner, as that is actually what people want when using "our" collect method. I only know the method name collect from Ruby where it is identical to our map (and their own map,...), just to add to the confusion.

collect returns an Array, not (always) a Vector. To usecollect rather than convert(Array,...) indicates better that values/objects ) are to be gathered which possibly do not yet exists and which possibly will not exist anymore as iterable after the collection (for example when collecting iterables of random numbers, user inputs or unique ids). My impression is thatconvert should work on things which exist and should not wait an indefinite time for the generation of a new value/event.

convert(Array, itr) might indeed look a bit surprising when doing an irreversible operation, so we could only support the Array(itr) form. I don't think it would really be less explicit than collect. The fact that the operation might block is dependent on/indicated by the type of the iterator, not be the function. Currently collect(1:3) will never block, but other types may do so.

By the way convert(Array, 1:3) is defined anyway because isa(1:3,AbstractArray) is true.

I really like this suggestion, lets get rid of collect!

I like the idea too, but it is ambiguous with constructors that take dimensions, e.g. Array{T}((m,n)). That will need to be resolved.

Yes, but how? Can we deprecate Array{T}(::Tuple) in favor of Array{T}(::Integer...)? Or would that introduce a performance penalty?

It's not about the performance penalty so much as many people prefer to use the dimension tuple form, and in some sense it's a better, more consistent way to specify this.

It might be too late for this, but I'd almost say arguments like reshape would be ideal: Array(iter, dims). A zeros matrix would be Array(repeated(0), (m,n)). We could allow passing an empty iterator to get an uninitialized array.

So if we don't want to get rid of Array{T}(::Tuple), we can't use Array{T}(itr) as a replacement for collect(T, itr). That's why I wonder whether it's really needed: yes it's consistent, but varargs are just another way of passing a tuple, so it's also consistent IMHO.

Anyway, we can still use convert(Array{T}, itr) to replace collect(T, itr), though it's more verbose. Opinions?

@JeffBezanson I'm not sure I get your point. That indeed sounds interesting, but it doesn't replace collect, and there's still an ambiguity with Array{T}(iter, dims).

Just to add a data point, the recently introduced constructors from iterables of BitArrays (#19018) also inevitably ended up having the slightly annoying feature that tuples of Ints are special-cased:

julia> BitArray([0,1,0])
3-element BitArray{1}:
 false
  true
 false

julia> BitArray((0,1,0))
0×1×0 BitArray{3}

julia> BitArray((false,true,false))
3-element BitArray{1}:
 false
  true
 false

It's probably less severe than the Array case. But perhaps if a general solution is not found and collect is ultimately kept, we should have bitcollect, by analogy with other methods like bitrand and bitbroadcast, instead of BitArray(iter)? (I don't like it at all, but it would be more consistent.)

It's probably less severe than the Array case. But perhaps if a general solution is not found and collect is ultimately kept, we should have bitcollect, by analogy with other methods like bitrand and bitbroadcast, instead of BitArray(iter)? (I don't like it at all, but it would be more consistent.)

I thought we were rather trying to get rid of the bit* functions in favor or more generic constructs. We should either support collect(arraytype, itr) or recommend using BitArray. Unfortunately, the former in ambiguous if you want to collect an iterator of arrays, and the second is ambiguous if you want to collect a tuple of integers...

I thought we were rather trying to get rid of the bit* functions in favor or more generic constructs.

In principle yes, except that good generic constructs to specify the output container type have not yet emerged to my knowledge (other than the constructors, of course).

good generic constructs to specify the output container type

Ref. #16740, work in that direction. Best!

Now that Array( ... ) has been deprecated, it would be totally possible just to rename collect to Array. Sadly, that would include Array(T, iter), but that's no worse than collect itself.

For Array{T} and Array{T,N} the only unambiguous signature available is Array{T}(iter, dims::Tuple). That might be ok sometimes, but unfortunately Array{T}(iter) is what we'd want most.

One option is to deprecate e.g. Array{T}(1,2,3) to Array{T,3}(1,2,3). Or we could require something like Array{T}(uninitialized, dims) for the current behavior, explicitly requesting uninitialized memory (uninitialized would be a special constant).

Since collect produces a vector, what about spelling this as Vector(itr) and Vector{T}(itr)?

Edit: I realized the problem with this as soon as I posted it – Vector(n) and Vector{T}(n) alread mean something and integers are iterable. Sad trombone...

Arrays are at this point one of the only collections that doesn't have constructors that take arbitrary iterators of contents to construct with, making them quite the odd one out. Perhaps we should consider deprecating Vector(dims...) and Vector(dims) instead and have blank(T, dims...) and blank(T, dims) (or maybe uninitialized?) functions similar to zeros or ones instead. Then Vector(itr) would be available for this kind of construction.

A blank what? Why isn't Dict() also blank? Should it be blank(Array{T}, dims)? Then I can have blank(MyAbstractArray{T}, args...)?

Then I can have blank(MyAbstractArray{T}, args...)?

That's essentially what #16740 implements, for ones, zeros etc.

If we decided to get rid of #undef so that creating an uninitialized array no longer makes sense (except maybe as an optimization in very specific cases), we could require choosing between zeros, ones and nulls to construct an array. In Nulls.jl, the latter function creates an Array{Union{Null, T}} filled with nulls, which can now be converted without making a copy to an Array{T} once all nulls have been replaced with valid values. (Cf. https://github.com/JuliaLang/julia/issues/9147.)

The name junk came up on triage this week and everyone seemed to rather like it. As in, if you want a matrix full of zeros, you write zeros(10, 10), if you want a matrix full of ones, you write ones(10, 10), and if you want a matrix full of junk, you write junk(10, 10) 😁. If we went with junk we could reclaim Vector to do what collect currently does.

With that, Array can do what collect currently does. Is the plan to deprecate collect? Just that collect has the ability to choose the right container for an iterator, which does not need to be dense array as it is now. For example an RaggedArray.jl can support (ragged) generators which then can collect to RaggedArrays etc. etc.

You would just write RaggedArray(itr) for that case.

... if you created itr yourself, of course. But Julia has a rudimentary system in place where iterators inform about the mutable collection they are naturally collected into via traits - and collect(itr) is a way to hook into this system where itr is just any iterable argument to a function something(itr). Replacing collect by Array makes this system more restrictive, retaining both collect and Array leaves it open for extension. In the end this is a design decision and I'll better not keep discussing, but those options are not equivalent.

I guess we could allow using AbstractArray(itr) to let itr choose the most appropriate container type (with Array as a fallback)?

If a has non-1 indices, currently convert(Vector, a) and collect(a) behave differently: the first throws an error and the second treats a as a list starting with the first item. Which would Vector(a) do?

Sorry I didn't mention this earlier.

Which would Vector(a) do?

For general iterable a, Vector(a) should treat a as a list starting with the first item? Best!

(If the following doesn't make sense, please check out #24595.) Deprecation process conundrum: We can deprecate Array(shape...)-like constructors in favor of Array(uninitialized, shape...)-like constructors in 0.7, allowing Array(...) to take on collect(...)'s behaviors in 1.0. Great. Sticking point: That sequence of changes allows deprecation of collect(...) to Array(...) equivalents only in 1.0 (or requires a hard break somewhere). Thoughts on how best to go about this process? Thanks!

Working plan from triage: Deprecate collect in 1.0. Best!

What the status of this issue? Triage just decided to use collect at https://github.com/JuliaLang/julia/pull/25217#issuecomment-353438898, which sounds contradictory with last comment.

Right, that's a good point. I'm not sure we need to fully deprecate collect but it would be nice if Vector(itr) worked at least. Then collect(itr) or collect(lazy) can be the function that says "go evaluate all of these things" and collect them for me.

Everything necessary to give e.g. Arrays the relevant subset of collect's behaviors between 0.7 and 1.0 is complete IIRC. So happily the better constructor behavior can come about independent of collect's evolution :).

Yes, I agree we seem to need collect. Most (all?) current uses can be replaced by Array(itr), which is great, but we need a function that picks an appropriate array type (e.g. for materializing a lazy view of an immutable or distributed array, etc.). AbstractArray(itr) or DenseArray(itr) might be able to do that, but (1) that's pretty long, and (2) lazy views might still be subtypes of AbstractArray.

but we need a function that picks an appropriate array type (e.g. for materializing a lazy view of an immutable or distributed array, etc.). AbstractArray(itr) or DenseArray(itr) might be able to do that, but (1) that's pretty long

The very unpopular #22630 proposal would solve that. But obviously not (2) and I also realize that at this stage in the release process it is probably not even worth considering that option anymore :) So treat this comment as noise.

The very unpopular #22630 proposal would solve that. But obviously not (2) and I also realize that at this stage in the release process it is probably not even worth considering that option anymore :) So treat this comment as noise.

IMHO Factory pattern was too entangled in OOP land to succeed.
A Service Locator pattern escapes this hell and provides opportunies regarding cloud/cluster ctx that we face regularly today
https://stackoverflow.com/questions/5698026/is-the-service-locator-pattern-any-different-from-the-abstract-factory-pattern

Anyway having a simple wrapper - a function, let's call it collect - to handle creation with correct type info is a good - if not essential - step in the good direction.
So +1 with

Yes, I agree we seem to need collect. Most (all?) current uses can be replaced by Array(itr), which is great, but we need a function that picks an appropriate array type (e.g. for materializing a lazy view of an immutable or distributed array, etc.). AbstractArray(itr) or DenseArray(itr) might be able to do that, but (1) that's pretty long, and (2) lazy views might still be subtypes of AbstractArray.

And move forward to 1.0 :)

Looks like we are set for 1.0 here.

Given last week's triage's decision to extend copy for materialization of Adjoint/Transpose rather than collect, and the accompanying extensive and inconclusive discussion regarding semantics and naming of materialization-like operations, deprecating collect (at least in its present form) once again seems the way forward?

Operating under that assumption, I prepared #25450 as a head start on the potential deprecation between 0.7 and 1.0. Best!

Yes (note that the title is a bit misleading, since the goal is to merge it into Array, not Vector; possibly by using copyto)

Discussing this with @mschauer

I just wanted to check in on the thinking here. I understand completely the motivation to take more control over things with teh constructors in generic library code.

But collect is extremely useful for users - in part because it doesn't require explicitly defining the templated types that are easy for package developers.

For example, start with the simplest example:

collect(1:10);

#Would this now require?
Array{Int64,1}(1:10);

#Tricky to train users on this... In fact, this was my first try
Array{Int, 1)(1:10) #Not an array of concrete types

Next, Imagine the following function which uses the current collect

f(a,b) = collect(a:b)

#I could write the following, but it would be wrong.
f(a,b) = Array{Int64, 1)(a:b)

The issue, of course, is that typeof(a) is not necessarily Int64

So I think the only way to do it correctly (without worrying about promotion... another big issue we didn't need to worry about in the f(a,b) = collect(a:b)

function f(a::T, b::T) where T
    Array{T,1)(a:b)
end

This is all very confusing for a new user, when all they want to do is get an array from a call to linspace and don't really want to worry about generic programming.

Right now the following works great

grid = collect(linspace(0.0, 1.0, 10))

and I would hate to see this require a bunch of generic programming just to teach something similar.

No, the proposal here is to use the un-parameterized Array as the replacement for collect. Your example would be Array(a:b). You could still specify an element type if you wanted, but it wouldn't be required. You could also use Vector or Matrix to specify just a particular dimensional if you wanted, but again, not required.

OK, then I think I agree with the change. As long as I don't have to teach my students about generic programming before they can do the basics!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iamed2 picture iamed2  ·  3Comments

omus picture omus  ·  3Comments

sbromberger picture sbromberger  ·  3Comments

omus picture omus  ·  3Comments

wilburtownsend picture wilburtownsend  ·  3Comments