Julia: `findall(::Fix2{typeof(in)}, array)` and `_findin` make assumptions on finiteness

Created on 11 Sep 2018  路  9Comments  路  Source: JuliaLang/julia

When creating the predicate pred = in(something), it may be that something is not a finite, iterable collection, but that in(x, something) is still a fast operation. (In my case it's 3D points in a Sphere - fast to determine, but Sphere is an uncountable set).

In this case, the implementation of _findin(a, b) is not valid, because we assume b can be converted to a Set here.

This function is called by findall here, without consideration whether something (i.e. pred.x) is finite.

I would recommend changing this signature to the case where pred.x isa Union{AbstractArray, Tuple} for consistency, but I'm not sure if that will cause regressions for anyone?

search & find

Most helpful comment

We've been over this. But actually, SizeUnknown means that something is at least iterable, and in practice consumers will assume it's finite, so Set(itr) would still work. There is no value for a type that's not iterable at all, and I guess technically there also isn't a value that means the type's finite-ness is unknown.

I think all we can do here is limit that method to some types like AbstractArray, and other uses might have to explicitly write in(Set(x)) which frankly is for the best. In fact, this could give better performance if the container is already an AbstractSet but not a Set, since it avoids an unnecessary copy.

All 9 comments

Before https://github.com/JuliaLang/julia/pull/24673, this method was called by findin, and it has the same problem. I guess it's somewhat more severe now that it's called by find.

If we restrict the signature, calls will fall back to the most generic findall method. Maybe that's OK, but it's a bit annoying that custom types with a finite length which may be convertible to Set would have to define a better method manually. Maybe we should check Base.IteratorSize?

Since a Sphere is (probably?) not iterable, it shouldn't have to define IteratorSize. There is a slight impedance mismatch here, in that in is generically defined to check whether something is equal to an element of a collection, which is not quite the same meaning as being contained in a region.

I don't think that in need necessarily be restricted to being an element of a collection. Since a collection is discrete, in is naturally defined/implemented for it in terms of checking equality with each item the collection iterates. However, for non-discrete things, in could be defined/implemented in a more general way. I'm not sure how this relates to findin; presumably you don't want to iterate uncountably infinite objects like spheres.

I'm not sure how this relates to findin; presumably you don't want to iterate uncountably infinite objects like spheres.

To be clear, this issue is about the x argument find(in(x), y). You don't need to iterate over x to know whether an entry from y is present in x.

Hmm... Is there not a fallback definition for IteratorSize to unknown?

It falls back to HasLength unfortunately.

That seems... presumptuous.

We've been over this. But actually, SizeUnknown means that something is at least iterable, and in practice consumers will assume it's finite, so Set(itr) would still work. There is no value for a type that's not iterable at all, and I guess technically there also isn't a value that means the type's finite-ness is unknown.

I think all we can do here is limit that method to some types like AbstractArray, and other uses might have to explicitly write in(Set(x)) which frankly is for the best. In fact, this could give better performance if the container is already an AbstractSet but not a Set, since it avoids an unnecessary copy.

Even for tuples and small AbstractArrays it might be slower to use a Set than not (I suppose I should do some benchmarks to back up this claim).

If all the user needs to write is in(Set(x)) to get a speed up, at least that is concise and it鈥檚 very visible what kind of algorithm will occur under the hood (hash lookup) and gives ultimate control to the user. Maybe even explicit is better than implicit in this case?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omus picture omus  路  3Comments

i-apellaniz picture i-apellaniz  路  3Comments

wilburtownsend picture wilburtownsend  路  3Comments

sbromberger picture sbromberger  路  3Comments

dpsanders picture dpsanders  路  3Comments