Hello all
Currently it is possible in Julia:
x = [1 1 3 4 1]
findall(x->x==1,vec(x))
3-element Array{Int64,1}:
1
2
5
If I use a string array:, it will still work:
x = ["1" "1" "3" "4" "1"]
findall(x->x=="1",vec(x))
3-element Array{Int64,1}:
1
2
5
But if I instead have one complete string ie;
x = "11341"
findall(x->x=="1",vec(x)) (Doesn't work will give an error)
There doesn't seem to be any easy function/way to go about this, so I am "missing" a findall feature, instead of having to split a string into substrings/chars.
Inspired by these posts here (made by me earlier on Discourse):
https://discourse.julialang.org/t/find-index-of-all-occurences-of-a-string/23044
https://discourse.julialang.org/t/suggestion-findall-to-work-on-strings/23143
Kind regards
Since we have findfirst(substring, string) and findnext(substring, string, i) returning UnitRange{Int}, it would be natural to have findall(substring, string). A natural definition seems like it could be something like:
function findall(t::Union{AbstractString,Regex}, s::AbstractString)
found = UnitRange{Int}[]
i = firstindex(s)
while true
r = findnext(t, s, i)
isnothing(r) && return found
push!(found, r)
i = nextind(s, last(r))
end
end
which gives output like
julia> findall("ing", "Spinning laughing dancing")
3-element Array{UnitRange{Int64},1}:
6:8
15:17
23:25
julia> findall(r"\w+", "Spinning laughing dancing")
3-element Array{UnitRange{Int64},1}:
1:8
10:17
19:25
Note that my implementation above does not terminate for findall("", "foo") and similar (cases where isempty(r) is true)— it's not entirely clear to me what the desired result is in such cases?
To handle the empty-range case, maybe :
function findall(t::Union{AbstractString,Regex}, s::AbstractString)
found = UnitRange{Int}[]
i = firstindex(s)
while true
r = findnext(t, s, i)
isnothing(r) && return found
push!(found, r)
j = isempty(r) ? first(r) : last(r)
j > lastindex(s) && return found
@inbounds i = nextind(s, j)
end
end
which gives e.g.
julia> findall("", "foo")
4-element Array{UnitRange{Int64},1}:
1:0
2:1
3:2
4:3
julia> findall(r"\w*", "Spinning laughing dancing")
6-element Array{UnitRange{Int64},1}:
1:8
9:8
10:17
18:17
19:25
26:25
It's also not clear to me what the desired behavior is for overlapping substrings? Do we want to find only disjoint substrings (the code above), or do we want all substrings even if they are overlapping?
For the latter behavior, just change my code above to j = first(r). In this case, you get e.g.:
julia> findall(r"\w*", "Spinning laughing dancing")
26-element Array{UnitRange{Int64},1}:
1:8
2:8
3:8
4:8
5:8
6:8
7:8
8:8
9:8
10:17
11:17
12:17
13:17
14:17
15:17
16:17
17:17
18:17
19:25
20:25
21:25
22:25
23:25
24:25
25:25
26:25
I can imagine both possibilities being useful. Maybe a disjoint::Bool=true keyword?
eachmatch(::Regex, ::AbstractString) has a overlap::Bool=false kwarg, so seems natural to reuse that here too.
Personally, I like @fredrikekre suggestion with reusing the overlap::Bool=false, to make it more useable, but personally for your findall("", "foo") @stevengj , I would prefer empty result (not an error, just "nothing"), since it would make more sense for an user than having all indices of each string as in your example here:
julia> findall("", "foo")
4-element Array{UnitRange{Int64},1}:
1:0
2:1
3:2
4:3
So maybe instead just an empty UnitRange array?
Thirdly, I just wondered why would you sometimes get ranges from 1:8 and then 9:8, is it also reading the string backwards? Maybe this should be a optional kwarg as well.
Just suggestions from a users point of view.
Kind regards
@AhmedSalih3d, a range like 9:8 is a standard way in Julia to specify an empty range "starting" at 9, i.e. the range "between " indices 8 and 9. Empty ranges like 9:8 are a valid answer above because r"\w*" matches zero-or-more "word" characters. Another example is:
julia> findnext(r"\b", "foo bar", 2)
4:3
because the first word boundary after position 2 is between indices 4 and 3 in this string.
Although the result my code returns for findall("", "foo") may seem weird, I think it is consistent with this, whereas returning an empty array would not be — if you ask a weird question, you get a weird answer.
So, in summary, it seems that we want:
function findall(t::Union{AbstractString,Regex}, s::AbstractString; overlap::Bool=false)
found = UnitRange{Int}[]
i, e = firstindex(s), lastindex(s)
while true
r = findnext(t, s, i)
isnothing(r) && return found
push!(found, r)
j = overlap || isempty(r) ? first(r) : last(r)
j > e && return found
@inbounds i = nextind(s, j)
end
end
or equivalent?
Anyone who wants to put together a PR with some tests and documentation is welcome to grab this code.
Yes, that looks correct to me.
Most helpful comment
eachmatch(::Regex, ::AbstractString)has aoverlap::Bool=falsekwarg, so seems natural to reuse that here too.