Currently there are three families of search & find functions:
In the find
family, find
, findn
return indexes of non-zero or true
values. findfirst
, findlast
, findprev
and findnext
are very similar to find
, but kind of iterative, and they additionally allow looking for an element in a collection (the latter behavior being similar to findin
). The findmin
and findmax
functions are different, as they return the value and index of the min/max. Finally, findnz
is even more different as it only works on matrices and returns a tuple of vectors (I,J,V) for the row- and column-index and value.
In the search
family, [r]search
and [r]searchindex
look for strings/chars/regex in a string (though they also support bytes), the former returning a range, the latter the first index. searchsorted
, searchsortedlast
and searchsortedfirst
look for values equal to or lower than an argument, and return a range for the first, and index for the two others.
indexin
is the same as findin
(i.e. returns index of elements in a collection), but it returns 0
for elements that were not found, instead of a shorter vector.
I hope that summary is exact. Please correct me if not.
Questions/ideas:
findin
be renamed to find
, as the signatures do not conflict? That would mean findfirst
, findlast
, findprev
and findnext
would just be iterating versions of find
. Currently find
offers less methods than the others.indexin
could be renamed to findin
to reunite the family (or add an argument to switch behaviors?)find
and search
? I suggest we rename all search
functions to find*
: searchsorted*
would become findsorted*
, searchindex
would be merged with findfirst
, rsearchindex
with findlast
.search
could be renamed to findfirstrange
, and rsearch
to findlastrange
, making them find any sequence of values in any collection, and not only in strings; if not, nicer names could be findstr
and findrstr
.find[tab][tab]
.findfirst
, findlast
, findprev
and findnext
could be replaced/supplemented with an iterator eachfind
/findeach
?Ah, just found https://github.com/JuliaLang/julia/issues/5664. Keeping this issue open because of the long summary above.
There's also https://github.com/JuliaLang/julia/pull/7327, which is about finding maximum and minimum. It could be more logical to change findmax
and findmin
to return the linear index (was indmax
and indmin
currently do), in order to be consistent with other find*
functions. Just and idea.
See the spreadsheet at https://docs.google.com/spreadsheets/d/1ZLnlYQyRIWa50-mxOmKHNCaJEzGShLVSEkfid2KA-Ps/edit#gid=0
Any comments?
+1 to the general unification of these APIs. If there is general consensus, we should also do it immediately.
I'm a little disappointed this hasn't garnered any attention. I'd love to see these functions cleaned up, too… I often have difficulty remembering which function to use if it's been a little while. I'm afraid that nobody has dared comment simply because the scope here is potentially huge.
Some thoughts:
findnext
and friends into a findeach
iterator. It'd be a net loss of functionality as you couldn't change the predicate of the iterator mid-stream… but that use-case is pretty limited.sorted
methods should be moved into dispatch on a SortedVector
type. There's been talk of this before but I don't think there's an open issue for it.In general, though, I think we should start from scratch and define what we want the find
name to mean as @JeffBezanson suggests in https://github.com/JuliaLang/julia/issues/5664#issuecomment-34346576. In that context, I think there are a few key questions:
find
by value? I think everyone will agree that find(A)
and find(predicate, A)
return indices of nonzero elements and where predicate(elt)
returns true, respectively. The next/prev versions _also_ support a findnext(A, v, i)
method to find the next _value_ v
after index i
. Similarly, findin(a, b)
returns indexes in a
that match a _value_ in b
. I think we should only fold findin→find if the general form supports find by value. In practice, most folks simply use find(A.==v)
for this purpose, but that doesn't compose well with a findeach
iterated method.As a practical matter, we simply cannot fold in all these behaviors to one name, so perhaps that is the significant difference between search and find?
find(A) # Indices of nonzero elements
find(predicate, A) # would prefer to duck-type predicate and just call it
find(A, values) # like findin; would prefer to duck-type values and just iterate over them
find(A, value) # somewhat like search (but all at once); just want to check equality
search
and find
to be consistent in operating iteratively or all-at-once? Currently, I think some of the confusion for me comes from the fact that the two names have very similar purposes but different interfaces.In terms of orthogonal design elements, we currently have a jumbled mishmash of combinations of:
Thanks, that's a useful way of summarizing the requirements
It took me a while to get there, but I think that's the most useful way to think about this. Then the question is simply what names we want to give to those capabilities. Here's one terribly disruptive possibility:
| | nonzeros | predicate test | in collection c
| equal to value v
| sequence or regex s
|
| --- | --- | --- | --- | --- | --- |
| Iteratively forwards | search(A)
| search(pred,A)
| searchin(A,c)
| searcheq(A,v)
| searchseq(A,s)
|
| Iteratively backwards | rsearch(A)
| rsearch(pred,A)
| rsearchin(A,c)
| rsearcheq(A,v)
| rsearchseq(A,s)
|
| All at once | find(A)
| find(pred,A)
| findin(A,c)
| findeq(A,v)
| findseq(A,s)
|
The search
and rsearch
methods would take an optional integer argument for the start index. I don't think we can tuck iterative searching entirely into an iterator object since you often want to start at a known _index_ (and not just some internal iterator state). We could also add methods to return an iterator, but I don't have a good name for that — findeach
becomes a bit clumsy with -in
, -eq
and -seq
suffixes.
(I don't really like this because it makes gives a pretty limited meaning to the very nice name search
, but it's just a spitball to get the ball rolling.)
Nice table. We may be able to merge searcheq
/findeq
into search
/find
. For such a basic usage, using most basic function makes sense.
But I'm not a fan of the search
vs. find
naming convention: I don't find it very obvious, and the fact that there's no common prefix makes it harder to find using ?
or tab completion. Why not findf
(f
for forward
) and findr
? Or even findfirst
and findlast
?
Or even
findfirst
andfindlast
?
or findnext
and findprev
.
I went with this because I find combinations of more than two words pretty unreadable and I went with suffixes for the kind of searching operation. This would result with things like findnextin
or findprevseq
.
I think we could only unify the -eq
functions if we restrict the predicates to be ::Function
or ::Union(DataType, Function)
(which is currently called Base.Callable
, but is terribly misleading since _anything_ can be callable these days).
We could call them all find*
:
find(A, start::Int, rev::Bool=false) # or dir::Order.Ordering=Order.Forward
Edit: this needs more explanation: without the start index, it's an all-at-once operation:
| | nonzeros | predicate test | in collection c
| equal to value v
| sequence or regex s
|
| --- | --- | --- | --- | --- | --- |
| Iteratively forwards | find(A,1)
| find(pred,A,1)
| findin(A,c,1)
| findeq(A,v,1)
| findseq(A,s,1)
|
| Iteratively backwards | find(A, endof(A),true)
| find(pred,A, endof(A),true)
| findin(A,c, endof(A),true)
| findeq(A,v, endof(A),true)
| findseq(A,s, endof(A),true)
|
| All at once | find(A)
| find(pred,A)
| findin(A,c)
| findeq(A,v)
| findseq(A,s)
|
dir::Order.Ordering=Order.Forward
Way clearer than a Boolean. A Boolean would be something I'd need to check the docs for every time I encountered it.
That's what I had initially, but Base.Order and Order.Forward are both unexported… and so I changed to match the sorting API. That's a minor issue, though. I'm not sure I like having both iterative and all-at-once behaviors under the same name.
@mbauman, I really like your distinction of _operation_ verses _how to operate_ ("iteratively" vs "all-at-once").
I would suggest that the "all-at-once" operations are actually closer to filtering, though, and these seem like different enough concepts from the "iterative" operations to warrant a distinct name.
I actually kind of liked your first search
vs find
suggestion. I'm not a fan of the current search
vs find
situation--it is a mishmash--but if there was actually some rationale to the distinction, I would (probably) be fine with it, and the one you suggested seems reasonable to me.
(Also: referring to the table for find
above, to me, find(A,1)
suggests reducing along dimension 1--a Matlab-ism, to be sure, but still one present in a number of Julia functions--sum
, prod
, max
, etc.--I think that would be inconsistent and somewhat confusing.)
@kmsquire As I understand it, the outcome of the threads you link to is that there's no clear distinction between "find" and "search", except that the latter insists on the process and may no return anything (but in our case, find
may not return anything either...) Stretching that idea a bit, one could decide that find
means "get all matches" and search
means "return the first occurrence after a given index, possibly going in reverse direction, so that I can write a more complex search loop". But I'm not sure that's really obvious.
Otherwise, I find that the idea of merging forward/reverse search is appealing, but it can get quite confusing as the start index is not in the same position in find(A, 1)
and find(pred, A, 1)
. Maybe the search order should always be specified before that, to make the distinction between the two series of functions clearer: find(A, Order.Forward, 1)
and find(pred, A, Order.Forward, 1)
. That way, the index would be optional when you only want to get the first (or last for Order.Reverse
) result.
I think the distinction at least makes some sense if you describe the two
cases as eg
"Find all matches"
and
"Search for the first match after the given index"
where the first one will indeed always find all matches (though they may be
zero) but the other one might not find the next match if there is none.
How about this plan, in which everything would be called find*
.
find*
versions return all matches.Order.Forward
or Order.Backward
as an argument, they return an iterator.This is essentially @mbauman's last table from https://github.com/JuliaLang/julia/issues/10593#issuecomment-90249829, except that the boolean is replaced with Order.Forward
or Order.Backward
, and that another row is added for getting an iterator.
I can have a look at a PR if you agree.
I think a large class of find functions can be replaced by a single findeach
iterator as suggested above by @mbauman . The general signature is:
findeach(needle, haystack[, startpoint][, rev])
The needle
is generally a predicate, and we can add special predicate objects to handle the most popular ones:
!iszero
for find(a)
; this already existsequalto(y) = x->isequal(x,y)
occursin(c) = x->(x in c)
occursin
should return an object we can dispatch on to implement optimizations like _sortedfindin
. It can also make a set of values on construction for fast lookup.
Here are implementations of most of the existing find functions in terms of findeach
:
findnext(a, i) = first(findeach(!iszero, a, i))
findfirst(a) = first(findeach(!iszero, a, start(a)))
findnext(a, v, i) = first(findeach(equalto(v), a, i))
findfirst(a, v) = first(findeach(equalto(v), a, start(a)))
findnext(p, a, i) = first(findeach(p, a, i))
findfirst(p, a) = first(findeach(p, a, start(a)))
findprev(a, i) = first(findeach(!iszero, a, i, Reverse))
findlast(a) = first(findeach(!iszero, a, endof(a), Reverse))
findprev(a, v, i) = first(findeach(equalto(v), a, i, Reverse))
findlast(a, v) = first(findeach(equalto(v), a, endof(a), Reverse))
findprev(p, a, i) = first(findeach(p, a, i, Reverse))
findlast(p, a) = first(findeach(p, a, endof(a), Reverse))
find(p, a) = collect(findeach(p, a))
find(a) = collect(findeach(!iszero, a))
findin(a, b) = collect(findeach(occursin(b), a))
search(str, chars[, start]) = first(findeach(occursin(chars), str[, start]))
search(str, substr[, start]) = first(findeach(substr, str[, start]))
match(regex, str[, start]) = first(findeach(regex, str[, start]))
matchall(regex, str) = [ m.match for m in findeach(regex, str) ]
eachmatch(regex, str) = findeach(regex, str)
Reverse order could be indicated several ways: a boolean, a keyword argument, a Reverse
object, etc. I'm flexible on that point.
At first, the version that takes an index argument can only be provided for arrays and strings. For other iterators, you can approximate the behavior by passing a rest
iterator.
A slightly controversial choice I made above was to pass a sequence instead of a predicate to search for a substring. Hopefully there aren't any cases where a sequence can be confused with a predicate.
One way to expand the search order elegantly would be to instead give give a stride size. With this scheme, 1 would be forward, -1 would be backward, but you could also specify any arbitrary stride. (Note, idea shamelessly stolen from python's range creation)
@JeffBezanson I think findeach
is compatibly with what I proposed in the Search & Find Julep. Your findeach
function corresponds to search
in Proposal 1 and to find
in Proposal 2, right? Can we base the discussion on these proposals (and modify them as needed)? There are many parameters to take into account at the same time to propose a complete plan.
Yes, it's similar. To complete the picture a bit, I would add findeach
and then deprecate findin
, search
, rsearch
, match
, eachmatch
, and functions that search for a given value (using equalto(x)
instead), and functions that search for nonzeros (as we did for countnz
). I would keep findfirst
and findlast
because they're convenient (you don't need to pass a start index).
findmin
, findmax
, searchsorted*
, indexin
, indmin
, and indmax
I think can be handled separately, or just left alone.
That leaves searchindex
, which I believe is just an optimization of search
. It might not be needed any more; we should look into that.
Would you make a PR against the Julep to add a Proposal 3 with a similar table as for the two others? That really helps seeing the consistency of the plan and identifying what needs to be deprecated/added.
OK, I'll do that.
A thought, can we define in(c)
instead of occursin(c)
? Currying arguments like this has a long and distinguished pedigree, and it seems a lot nicer to write findeach(in(c), x)
.
For that matter, we could define ==(x)
instead of equalsto(x)
.
Yes, curried versions of ==
and isequal
would make sense. in
is more awkward in that we'd want to curry on the second argument, not the first.
We could have a curried ∋
operator, although that does require Unicode input.
Or we could use findeach(_ in c, x)
ala #5571.
This would be great to have for 1.0, but seems like we need an assignee that will see it through by code freeze.
Actually, I have a PR locally which merges search*
functions with find*
. I'll try to push it soon.
Yes, curried versions of == and isequal would make sense. in is more awkward in that we'd want to curry on the second argument, not the first.
I am not too familiar with the exact definition of the term currying, but aren't other useful cases also on the second argument. For ==
there is no distinction of course, but suppose we also want find(<(x), collection)
?
Temporary fix if we don't get to this in time for feature freeze: move all these functions into a Search
stdlib package and then fix it with a major version bump of that package in the future.
PR link? (Oh, perhaps it's not pushed yet.)
Status update which covers everything in the Proposal 3 of the Search and Find Julep and in related discussion points:
search
and rsearch
. The only question that remains to be decided is whether it's OK for findnext(::String, ::String)
and similar to return a range. searchindex
and rsearchindex
are deprecated in favor of first(findnext(...))
and first(findprev(...))
.ismatch
into contains
. match
and eachmatch
can be kept as-is, since they return a special RegexMatch
object rather than indices (see also issue https://github.com/JuliaLang/julia/issues/19250 and previous PR https://github.com/JuliaLang/julia/pull/18028).findin(a, b)
in favor of find(occursin(b), a)
. See also issue https://github.com/JuliaLang/julia/issues/24967.find
to return the same index type as pairs
/keys
and makes it work with any collection. It's surprisingly simple, though it's waiting for a Nanosoldier run (which may prompt some adjustments to the AbstractArray{Bool}
method).findfirst
, findlast
findnext
and findprev
to accept/return the same type of index as find
since https://github.com/JuliaLang/julia/pull/24774.findnext
, findprev
, findfirst
and findlast
to return nothing
rather than 0
when there is no match, in order to support arrays with custom indices and arbitrary collections. See also https://github.com/JuliaLang/Juleps/issues/47.find
to findall
so that its purpose is more explicit.findn(x)
in favor of find(!iszero, x)
now that find
returns cartesian indices.findnz
to the new SparseArrays stdlib module (see also issue https://github.com/JuliaLang/julia/issues/24910).findfirst
and findlast
return cartesian indices for HasShape
iterators, for consistency with arrays.findmin
and findmax
so that their names are more consistent with find*
functions returning indices rather than (index, value)
tuples.findeach
function proposed by @JeffBezanson in the Julep can be introduced after 1.0, and findnext
/findprev
/findfirst
/findlast
be made simple convenience functions around it.searchsorted*
functions to findsorted*
and reorders their arguments for consistency with other find*
functions (see also issue https://github.com/JuliaLang/julia/issues/24883).Even if the API implemented in the above PRs is much more consistent than the previous one, I wonder a few more changes wouldn't make it even better. I'm now tempted to say that an ideal design would involve renaming find
to findall
(which is more explicit), and replace findfirst(needle, haystack)
with find(needle, haystack)
, findnext(needle, haystack, i)
with find(needle, haystack, i)
, findlast(needle, haystack)
with find(needle, haystack; rev=true)
and find(needle, haystack, i; rev=true)
.
The two potential issues with this proposal are
find
change cannot happen in a single deprecation cycle, at least for the two-argument version. But we could require using the three-argument form in 0.7, given that you just need to pass the first/last index of the haystack
explicitly.rev
. If not, we can still make it a positional argument.Opinions?
I like the idea of renaming find
to findall
, but I also like the clarity of findfirst
.
findall
is beautifully descriptive :).
I think the set findall
, findfirst
, findnext
, findlast
and findprev
is nicely explicit. Sure, there's some overlap but it kind of mirrors what we now have for string indices.
@JeffBezanson I don't think so, see the check list in my comment above. In particular there's still #24774 and JuliaLang/Juleps#47.
We can also rename find
to findall
since that proposal got a lot of support above (but that's really easy).
Just wanted to say that despite the fact that I barely interacted with @nalimilan's work here, the new names seem so compelling that when I recently had a 0.6-based project, I could barely remember the old way of doing this stuff and had to look at the docs several times ("right, search
, that's what I was looking for..."). A clear sign of success.
Hi everyone! I'm not sure if anyone will see this but I am having some trouble finding the indices from a user inputted string.
I have a string of ID's I am interested in finding:
ppl_id = ["T2DG0200031", "T2DG0200032", "T2DG0200033"]
and want to search the second column of a matrix which has all ID's ( of type Array{AbstractString,2}) for the indices that match the strings in ppl_id.
Any ideas...? It seems I can't get this to work with find() or findeach() or what you guys have mentioned above..
Questions are better suited for the forum: https://discourse.julialang.org/
Most helpful comment
Just wanted to say that despite the fact that I barely interacted with @nalimilan's work here, the new names seem so compelling that when I recently had a 0.6-based project, I could barely remember the old way of doing this stuff and had to look at the docs several times ("right,
search
, that's what I was looking for..."). A clear sign of success.