Julia: intersect, setdiff fallbacks do not preserve eltype

Created on 29 Aug 2017  Â·  12Comments  Â·  Source: JuliaLang/julia

julia> a = rand(1:10, 5)
5-element Array{Int64,1}:
  8
 10
  5
  2
  5

julia> b = rand(1:10, 5)
5-element Array{Int64,1}:
 8
 6
 7
 6
 2

julia> intersect(a, b)
2-element Array{Int64,1}:
 8
 2

julia> intersect(a, (x for x in b))
2-element Array{Any,1}:
 8
 2

julia> setdiff(a, b)
2-element Array{Int64,1}:
 10
  5

julia> setdiff(a, (x for x in b))
2-element Array{Any,1}:
 10
  5

The relevant methods are:

https://github.com/JuliaLang/julia/blob/6a47db885b8f959e757d2e2ae51ee1ce56fddd69/base/array.jl#L2392-L2406

https://github.com/JuliaLang/julia/blob/6a47db885b8f959e757d2e2ae51ee1ce56fddd69/base/array.jl#L2439-L2451

This is a bit subtle, however, because we need the promote type in cases where the eltypes can be promoted to a common type, but if no promotion is required then we can always just keep the eltype of the first argument. We could just tell people to use setdiff! or intersect! here but there are no fallbacks defined for the iterator cases of those:

julia> setdiff!(a, (x for x in b))
ERROR: MethodError: no method matching setdiff!(::Array{Int64,1}, ::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##20#21"))})
Closest candidates are:
  setdiff!(::IntSet, ::Any) at intset.jl:146
  setdiff!(::Set, ::Any) at set.jl:200

julia> intersect!(a, (x for x in b))
ERROR: MethodError: no method matching intersect!(::Array{Int64,1}, ::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##22#23"))})
collections

Most helpful comment

@rfourquet, I think whatever you want to do about this would be ok, including leaving it alone. We could call this a bug and fix it at any point :trollface:. Seriously, changing this would only be very mildly breaking so I'm going to take it off the milestone.

All 12 comments

For intersect and setdiff, the result is always a subset of the first argument, so we can unconditionally have the result of the same type as the first argument no? For union it's different though. I have been working recently on a couple of PRs touching this code, so assigning myself.

Yes, we could just unconditionally use the eltype of the first array, although when they can be promoted to a common type, would we prefer that? I.e. which is the preferable behavior:

julia> a = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> b = [2.0, 3.0, 4.0]
3-element Array{Float64,1}:
 2.0
 3.0
 4.0

julia> a ∩ b # current behavior
2-element Array{Float64,1}:
 2.0
 3.0

julia> a ∩ b # proposed behavior
2-element Array{Int64,1}:
 2
 3

A possible hybrid would be to use

T = eltype(v1)
P = promote_eltype(v1, vs...)
R = isleaftype(P) || !isleaftype(T) ? P : T
ret = Vector{R}(0)

In other words, choose between the eltype of the first array and the promotion type, preferring the promotion type as long as it doesn't cause you to return an abstractly typed array where you could have returned a concretely typed array. But perhaps that's too complicated and just using the eltype of the first array is simpler and less surprising.

I find it more natural to promote to a common type and not treat the first argument as special. I see union as similar to cat, and intersect as similar to union, so I'd rather treat them consistently. It's not a big deal if the element type ends up as abstract in some rare cases: most of the time promotion will give a concrete type for values which are comparable using isequal.

Except for the very common case I posted at the top – where some of the arguments are generators whose type cannot be inferred. Then you just get Any arrays even though everything you're actually working with is the same type, which is super annoying. We could handle cases where the inputs are HasEltype false differently. Even if we preserve the eltype of the first input, if that's a generator we get the dreaded Any-typed result. We could do what comprehensions do – that's probably the best option in that case.

Yes, using the same approach as for comprehensions makes sense.

Given that IntSet can not have arbitrary eltype, we already have that op(::IntSet, itrs...) (where op=(intersect, union, etc)) always gives back an IntSet, with Int eltype. Also those operations with a Set as first argument gives back a Set. So the first argument is already treated specially.. mathematically it's not nice to have this asymetry, but code-wise I find it great for predictability. Even for union and symdiff, I would even consider setting the eltype of the output to that of the first argument, and erroring if this doesn't work (like for union(IntSet(), 'a')), and maybe have a promote=false keyword argument.

Independently from the discussion regarding the element type, I think the best solution to choose the returned container type would be to define a new promotion mechanism, also used by cat. See https://github.com/JuliaLang/julia/issues/18472.

I'm loathe to do this, but we should sort this eltype out for 1.0 since changing it will break code.

I explored a bit today the possibilities, and I still don't know what solution I prefer.

  • typejoin will often end up with an abstract type, but is type stable (I think)
  • promote_type between the "current" eltype and the new element to add: type unstable, depends on the values (e.g. union([1], [big(1)]) => [1], but union([1], [big(2)]) => BigInt[1, 2]). Moreover union([1.1], 2^60+1) still fails, because promote_type(Int,Float64) == Float64, but 2^60+1 can't be stored in a Set{Float64}.
  • try to push the next value, if an exception is thrown, use promote_type or typejoin. The advantage is that union(Set(1), [big(2)]) would give Set([1, 2]) instead of the more expensive Set(BigInt[1, 2]).

Also, the chosen solution could depend on the performance, but I don't know when I will have time to do a careful implementation (like e.g. for comprehensions or collect).

Now that the mutating versions exist, is it thinkable to allow ourself to do this in 1.x, by marking the non-mutating versions as "unstable feature" and direct people to the mutating versions if they want to control the eltype of the result? And anyway, this work would result in most cases in more well typed collections (but would be breaking if we start to throw an exception like in the case of promote_type).

promote_type does not need to depend on the values AFAICT. Even if the first set contains 1, you can take into account that the second set contains big(1) even if that value is not actually used, by simply taking the element type (and Any for iterators without an element type). I don't think it's different from typejoin. Anyway it sounds more appropriate to use typejoin here, just like comprehensions. See what the collect code does, which is similar to your third proposal based on try.

I agree typejoin (or promote_eltype, which is the current behavior) will be the simplest, and probably most predictible, but this doesn't address the issue expected by the OP ;-)
Regarding collect, its precisely by seeing how sophisticated it is that I gave up for now, I can't do that in a rush.

@rfourquet, I think whatever you want to do about this would be ok, including leaving it alone. We could call this a bug and fix it at any point :trollface:. Seriously, changing this would only be very mildly breaking so I'm going to take it off the milestone.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Keno picture Keno  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

manor picture manor  Â·  3Comments

TotalVerb picture TotalVerb  Â·  3Comments