Assuming that #22572 and #22511 go through tests and get accepted I have the following proposal.
My experience from making #22572 and #22511 go though CI show that I would expect many users to make errors in indexing strings using something like somestring[2:end-1]
. The problem is that such a construct will pass typical tests as it is valid for ASCII characters but invalid in general.
Here are my thoughts how this risk could be minimized:
nextind
and prevind
functions and the use of SubString
type;count::Int
to nextind
and prevind
that defaults to 1
and indicates the number of characters to move in nextind
/prevind
(such cases happen in the code);count::Int
to chop
that defaults to 1
giving number of characters to remove at the end of the string;lchop(s::AbstractString, count:Int) = SubString(s, nextind(s, 1, count), endof(s))
(I am not sure about a name) that is identical to chop
but removes characters from the beginning of the string.This list would cover the most common cases I have encountered. I do not make a PR here because #22572 and #22511 have to be merged first and maybe there would be already some comments regarding the proposal.
Makes sense to me, feel free to file separate PRs (unless somebody disagrees). However I'm less convinced about lchop
since it wouldn't be symmetric with chop
. Deprecating chop
in favor of rchop
could make sense, but no language seems to do that except OCaml's ExtLib. Relevant Rosetta page: https://www.rosettacode.org/wiki/Substring/Top_and_tail (with a broken example for Julia, FWIW :-).
See also the radical approach at https://github.com/JuliaLang/julia/issues/9297 (which is unlikely to be implemented in 1.0, and maybe ever).
I have not been aware of #9297 and have written something similar independently for testing purposes :).
I believe we need lchop
but I was not sure about the name either. Another solutions I was thinking about were:
tail::Bool=true
;count<0
we chop -count
characters from the head;chop(s::AbstractString, tail::Int=1, head::Int=0)
allowing to specify number of characters to chop from the tail and from the head at the same time.Having reviewed the code base for common cases of what portions of strings are extracted the following patterns appear (along with what I propose to implement):
nchar
characters from a string:extend
first
function forAbstractString
with additional argumentnchar
with default value equal to 1;
- get last
nchar
characters from a string:
extendlast
function forAbstractString
with additional argumentnchar
with default value equal to 1;- chop last
tail
characters from a string or chop firsthead
characters from a string:
extendchop
function to the following signaturechop(s::AbstractString, head::Integer=0, tail::Integer=1)
, wherehead
andtail
give number of characters to chop from head and tail;- get byte index of a character
nchar
characters to the left from a given byte index:
extendprevind
function withnchar
argument with default value equal to 1;- get byte index of a character
nchar
characters to the right from a given byte index:
extendnextind
function withnchar
argument with default value equal to 1;
If this would be acceptable I can push an appropriate PR.
Makes sense to me. first(x, n)
would even be useful for any collection (DataArrays used to provide a head
function for that inspired from R).
Now I see that there is a problem with the reuse of first
and last as their return type is eltype(x)
. So for strings first("abc")
is 'a'
not "a"
. Maybe introduce head(x,n)
and tail(x,n)
?
However, I am not sure if they are not used in some standard package.
Good catch, but maybe that's not an issue: the one-argument method has to return an eltype(x)
object, but the two-argument method has to return a collection since the number of requested elements could be one or several.
OK
I have decided to partition this issue into several PRs to make it more manageable #23805 is the first (fundamental low level) part.
head(x, n)
and tail(x, n)
may be better names for first(x, n)
and last(x, n)
since otherwise there's possible confusion that first(x)
and first(x, 1)
don't return the same thing.
@StefanKarpinski Is the confusion really problematic though? I'm not sure what's our policy regarding the difference between one- and two-argument methods of the same function. In some sense, it's similar to getindex
returning either a collection or a scalar depending on the type of the index; but of course the difference here is that the return type would depend on the number of arguments rather than on their types.
Anyway, I'd be fine with head
and tail
too.
Might be fine and not confusing at all. Just mentioning it as a consideration.
I would stay with first
and last
.
After rethinking it head
and tail
would be new in base. People probably already have used such natural names in their code and this could lead to some problems (e.g. method conflict or piracy).
Last thing to decide is what those functions should return. I opt for:
first
and last
return String
(as probably most of the string is discarded)chop
returns SubString
(as probably most of the string is retained)I would recommend join(Iterators.take(string, n))
to avoid introducing new verbs. last
might make some sense, however.
(I guess there's nothing wrong with first
either; it would be nice if join(take(...))
would share the same implementation though.)
@TotalVerb Agreed with new verbs. That is why I want to go with first
and last
.
As for performance join/take approach is way slower unfortunately and does not perform a check for end of string. Here is the benchmark:
Here goes the definition of first
(I have not pushed it yet - I am waiting for some other PRs to merge):
function first(s::AbstractString, nchar::Integer)
nchar == 1 && return s[1:1]
s[1:nextind(s, 1, nchar-1)]
end
and the tests:
julia> x = randstring(1000)
julia> @benchmark join(Iterators.take(x, 1))
BenchmarkTools.Trial:
memory estimate: 256 bytes
allocs estimate: 7
--------------
minimum time: 1.493 μs (0.00% GC)
median time: 1.586 μs (0.00% GC)
mean time: 1.705 μs (1.41% GC)
maximum time: 247.402 μs (96.89% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark first(x, 1)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 42.454 ns (0.00% GC)
median time: 44.320 ns (0.00% GC)
mean time: 49.177 ns (5.29% GC)
maximum time: 2.157 μs (95.13% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark join(Iterators.take(x, 2))
BenchmarkTools.Trial:
memory estimate: 256 bytes
allocs estimate: 7
--------------
minimum time: 1.633 μs (0.00% GC)
median time: 1.679 μs (0.00% GC)
mean time: 1.786 μs (1.37% GC)
maximum time: 252.394 μs (96.88% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark first(x, 2)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 49.801 ns (0.00% GC)
median time: 58.257 ns (0.00% GC)
mean time: 60.764 ns (3.88% GC)
maximum time: 2.035 μs (93.74% GC)
--------------
samples: 10000
evals/sample: 993
julia> @benchmark join(Iterators.take(x, 100))
BenchmarkTools.Trial:
memory estimate: 656 bytes
allocs estimate: 12
--------------
minimum time: 8.864 μs (0.00% GC)
median time: 11.663 μs (0.00% GC)
mean time: 11.916 μs (0.00% GC)
maximum time: 76.045 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark first(x, 100)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 1
--------------
minimum time: 246.046 ns (0.00% GC)
median time: 247.114 ns (0.00% GC)
mean time: 262.899 ns (0.90% GC)
maximum time: 1.539 μs (72.80% GC)
--------------
samples: 10000
evals/sample: 438
Reading it now I think that first(s, 10)
and last(s, 10)
do read very nicely. We should probably have corresponding functionality for all collections that can support it.
Strings will need a separate implementation anyway (because of nextind
/prevind
). So in this thread I would leave it as is (unless it is really uncomplicated otherwise I would make a separate PR for this).
For general collections:
collection[start(collection):start(collection)+9]
and collection[end-9:end]
would not work?Strings will need a separate implementation anyway
Sure, but the API can be generalized. Was not suggesting it should be done here.
are there any collections except strings for which
collection[start(collection):start(collection)+9]
andcollection[end-9:end]
would not work?
Indexing and iteration are not the same at all so that would absolutely not work for general collections. The state object (which need not be a valid index) can be passed to next
but not necessarily to getindex
.
Related to discussion in #23805 about thisind
. This is what I would propose:
prevind
/nextind
with nchar=0
with it as I believe it would be confusing (I have explained why in #23805)thisind
is a proper name; for me a natural name is fixind
as this is what we want to do (but thisind
would be OK for me also if there is a general preference for this)fixind(str::AbstractString, i::Integer, forward::Bool=false)
, the additional argument forward
would say if we want to go forward if we are at invalid byte index (if true
) or backward (if false
), by default we go backwardforward==true
ArgumentError
if i<1 || i>endof(str)
j
such that j>=i
forward==false
ArgumentError
if i<1 || i>=next(str, endof(str))[2]
j
such that j<=i
Prompted by #23880 (and earlier fixes I have made) a pattern that is commonly used is s[i:j-1]
and now it raises error (in the past it was acceptable and was not ambiguous).
Now it can be written as s[i, prevind(s,j)]
or chop(s[i:j])
but the former is clumsy and the later is a bit inefficient.
If there would be a decision that handling of a pattern select characters from i to j, excluding j is important enough getindex
and SubString
could be extended with an additional parameter chop::Bool=false
which would make the range exclusive if it is set to true
.
getindex
generally doesn't take arguments other than the indices, so adding chop
would be unusual.
IMHO these things would be better handled via a StringIndex
type (#9297). At the time, @StefanKarpinski was afraid it would be too verbose, but the current codebase is more verbose than before since i-1
now uses prevind
anyway. Maybe we could revisit that approach. It would be interesting to see what the nalimilan/stringindex
branch would look like on current master and with more convenience features (like accepting 1
instead of start(s)
).
first
and last
from this proposal.Improved chop
is implemented in #24126.
There is only one open point left in this issue - should thisind
(or fixind
) be reintroduced?
And if yes - what should be the functionality: is forward
argument needed or only going to the beginning of the current character is enough.
@nalimilan I would discuss the issue of StringIndex
in #9297 and close this issue when thisind
point is decided.
In #24414 I propose how the implementation of thisind
could look like (and I have decided to stick with the original name). If that PR gets accepted or it is decided that thisind
is not needed this issue can be closed.
I think that we should encourage users to use only valid indices, but thisind
can come handy e.g. when UTF-8 strings are read in chunks as it was indicated in discussion in #23805.
All done in this issue. Close?
Most helpful comment
I would stay with
first
andlast
.After rethinking it
head
andtail
would be new in base. People probably already have used such natural names in their code and this could lead to some problems (e.g. method conflict or piracy).Last thing to decide is what those functions should return. I opt for:
first
andlast
returnString
(as probably most of the string is discarded)chop
returnsSubString
(as probably most of the string is retained)