Julia: Improvement of the functions for handling string indexing

Created on 19 Sep 2017 · 26Comments · Source: JuliaLang/julia

Assuming that #22572 and #22511 go through tests and get accepted I have the following proposal.

My experience from making #22572 and #22511 go though CI show that I would expect many users to make errors in indexing strings using something like somestring[2:end-1]. The problem is that such a construct will pass typical tests as it is valid for ASCII characters but invalid in general.

Here are my thoughts how this risk could be minimized:

improve the presentation of the problem in https://docs.julialang.org/en/latest/manual/strings/#man-strings-1 (especially the part starting with You can perform arithmetic and other operations with end, just like a normal value: that actually suggests the wrong way to index strings - in my opinion - although it is valid); in particular describe nextind and prevind functions and the use of SubString type;
add additional argument count::Int to nextind and prevind that defaults to 1 and indicates the number of characters to move in nextind/prevind (such cases happen in the code);
add additional argument count::Int to chop that defaults to 1 giving number of characters to remove at the end of the string;
add a function lchop(s::AbstractString, count:Int) = SubString(s, nextind(s, 1, count), endof(s)) (I am not sure about a name) that is identical to chop but removes characters from the beginning of the string.

This list would cover the most common cases I have encountered. I do not make a PR here because #22572 and #22511 have to be merged first and maybe there would be already some comments regarding the proposal.

strings

Source

bkamins

👍3

Most helpful comment

I would stay with first and last.
After rethinking it head and tail would be new in base. People probably already have used such natural names in their code and this could lead to some problems (e.g. method conflict or piracy).

Last thing to decide is what those functions should return. I opt for:

first and last return String (as probably most of the string is discarded)
chop returns SubString (as probably most of the string is retained)

bkamins on 21 Sep 2017

👍2

All 26 comments

Makes sense to me, feel free to file separate PRs (unless somebody disagrees). However I'm less convinced about lchop since it wouldn't be symmetric with chop. Deprecating chop in favor of rchop could make sense, but no language seems to do that except OCaml's ExtLib. Relevant Rosetta page: https://www.rosettacode.org/wiki/Substring/Top_and_tail (with a broken example for Julia, FWIW :-).

See also the radical approach at https://github.com/JuliaLang/julia/issues/9297 (which is unlikely to be implemented in 1.0, and maybe ever).

nalimilan on 19 Sep 2017

I have not been aware of #9297 and have written something similar independently for testing purposes :).
I believe we need lchop but I was not sure about the name either. Another solutions I was thinking about were:

adding third parameter tail::Bool=true;
stating that when count<0 we chop -count characters from the head;
make a definition chop(s::AbstractString, tail::Int=1, head::Int=0) allowing to specify number of characters to chop from the tail and from the head at the same time.

bkamins on 19 Sep 2017

Having reviewed the code base for common cases of what portions of strings are extracted the following patterns appear (along with what I propose to implement):

get first nchar characters from a string:

extend first function for AbstractString with additional argument nchar with default value equal to 1;

get last nchar characters from a string:
extend last function for AbstractString with additional argument nchar with default value equal to 1;

chop last tail characters from a string or chop first head characters from a string:
extend chop function to the following signature chop(s::AbstractString, head::Integer=0, tail::Integer=1), where head and tail give number of characters to chop from head and tail;

get byte index of a character nchar characters to the left from a given byte index:
extend prevind function with nchar argument with default value equal to 1;

get byte index of a character nchar characters to the right from a given byte index:
extend nextind function with nchar argument with default value equal to 1;

If this would be acceptable I can push an appropriate PR.

bkamins on 21 Sep 2017

👍1

Makes sense to me. first(x, n) would even be useful for any collection (DataArrays used to provide a head function for that inspired from R).

nalimilan on 21 Sep 2017

Now I see that there is a problem with the reuse of first and last as their return type is eltype(x). So for strings first("abc") is 'a' not "a". Maybe introduce head(x,n) and tail(x,n)?
However, I am not sure if they are not used in some standard package.

bkamins on 21 Sep 2017

Good catch, but maybe that's not an issue: the one-argument method has to return an eltype(x) object, but the two-argument method has to return a collection since the number of requested elements could be one or several.

nalimilan on 21 Sep 2017

bkamins on 21 Sep 2017

I have decided to partition this issue into several PRs to make it more manageable #23805 is the first (fundamental low level) part.

bkamins on 21 Sep 2017

head(x, n) and tail(x, n) may be better names for first(x, n) and last(x, n) since otherwise there's possible confusion that first(x) and first(x, 1) don't return the same thing.

StefanKarpinski on 21 Sep 2017

👍1

@StefanKarpinski Is the confusion really problematic though? I'm not sure what's our policy regarding the difference between one- and two-argument methods of the same function. In some sense, it's similar to getindex returning either a collection or a scalar depending on the type of the index; but of course the difference here is that the return type would depend on the number of arguments rather than on their types.

Anyway, I'd be fine with head and tail too.

nalimilan on 21 Sep 2017

Might be fine and not confusing at all. Just mentioning it as a consideration.

StefanKarpinski on 21 Sep 2017

Last thing to decide is what those functions should return. I opt for:

first and last return String (as probably most of the string is discarded)
chop returns SubString (as probably most of the string is retained)

bkamins on 21 Sep 2017

👍2

I would recommend join(Iterators.take(string, n)) to avoid introducing new verbs. last might make some sense, however.

(I guess there's nothing wrong with first either; it would be nice if join(take(...)) would share the same implementation though.)

TotalVerb on 22 Sep 2017

@TotalVerb Agreed with new verbs. That is why I want to go with first and last.
As for performance join/take approach is way slower unfortunately and does not perform a check for end of string. Here is the benchmark:

Here goes the definition of first (I have not pushed it yet - I am waiting for some other PRs to merge):

function first(s::AbstractString, nchar::Integer)
    nchar == 1 && return s[1:1]
    s[1:nextind(s, 1, nchar-1)]
end

and the tests:

julia> x = randstring(1000)

julia> @benchmark join(Iterators.take(x, 1))
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.493 μs (0.00% GC)
  median time:      1.586 μs (0.00% GC)
  mean time:        1.705 μs (1.41% GC)
  maximum time:     247.402 μs (96.89% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark first(x, 1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     42.454 ns (0.00% GC)
  median time:      44.320 ns (0.00% GC)
  mean time:        49.177 ns (5.29% GC)
  maximum time:     2.157 μs (95.13% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark join(Iterators.take(x, 2))
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.633 μs (0.00% GC)
  median time:      1.679 μs (0.00% GC)
  mean time:        1.786 μs (1.37% GC)
  maximum time:     252.394 μs (96.88% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark first(x, 2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     49.801 ns (0.00% GC)
  median time:      58.257 ns (0.00% GC)
  mean time:        60.764 ns (3.88% GC)
  maximum time:     2.035 μs (93.74% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark join(Iterators.take(x, 100))
BenchmarkTools.Trial:
  memory estimate:  656 bytes
  allocs estimate:  12
  --------------
  minimum time:     8.864 μs (0.00% GC)
  median time:      11.663 μs (0.00% GC)
  mean time:        11.916 μs (0.00% GC)
  maximum time:     76.045 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark first(x, 100)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     246.046 ns (0.00% GC)
  median time:      247.114 ns (0.00% GC)
  mean time:        262.899 ns (0.90% GC)
  maximum time:     1.539 μs (72.80% GC)
  --------------
  samples:          10000
  evals/sample:     438

bkamins on 22 Sep 2017

Reading it now I think that first(s, 10) and last(s, 10) do read very nicely. We should probably have corresponding functionality for all collections that can support it.

StefanKarpinski on 22 Sep 2017

👍1

Strings will need a separate implementation anyway (because of nextind/prevind). So in this thread I would leave it as is (unless it is really uncomplicated otherwise I would make a separate PR for this).

For general collections:

are there any collections except strings for which collection[start(collection):start(collection)+9] and collection[end-9:end] would not work?
is there a single abstract type for which it could be defined?

bkamins on 22 Sep 2017

Strings will need a separate implementation anyway

Sure, but the API can be generalized. Was not suggesting it should be done here.

StefanKarpinski on 22 Sep 2017

are there any collections except strings for which collection[start(collection):start(collection)+9] and collection[end-9:end] would not work?

Indexing and iteration are not the same at all so that would absolutely not work for general collections. The state object (which need not be a valid index) can be passed to next but not necessarily to getindex.

StefanKarpinski on 22 Sep 2017

Related to discussion in #23805 about thisind. This is what I would propose:

Introduce a new name for this functionality and not overload prevind/nextind with nchar=0 with it as I believe it would be confusing (I have explained why in #23805)
Decide if thisind is a proper name; for me a natural name is fixind as this is what we want to do (but thisind would be OK for me also if there is a general preference for this)
The signature for fixind(str::AbstractString, i::Integer, forward::Bool=false), the additional argument forward would say if we want to go forward if we are at invalid byte index (if true) or backward (if false), by default we go backward
The behavior would be:
if forward==true
- throw ArgumentError if i<1 || i>endof(str)
- otherwise return first valid index j such that j>=i
if forward==false
- throw ArgumentError if i<1 || i>=next(str, endof(str))[2]
- otherwise return first valid index j such that j<=i

bkamins on 23 Sep 2017

Prompted by #23880 (and earlier fixes I have made) a pattern that is commonly used is s[i:j-1] and now it raises error (in the past it was acceptable and was not ambiguous).
Now it can be written as s[i, prevind(s,j)] or chop(s[i:j]) but the former is clumsy and the later is a bit inefficient.

If there would be a decision that handling of a pattern select characters from i to j, excluding j is important enough getindex and SubString could be extended with an additional parameter chop::Bool=false which would make the range exclusive if it is set to true.

bkamins on 26 Sep 2017

getindex generally doesn't take arguments other than the indices, so adding chop would be unusual.

IMHO these things would be better handled via a StringIndex type (#9297). At the time, @StefanKarpinski was afraid it would be too verbose, but the current codebase is more verbose than before since i-1 now uses prevind anyway. Maybe we could revisit that approach. It would be interesting to see what the nalimilan/stringindex branch would look like on current master and with more convenience features (like accepting 1 instead of start(s)).

nalimilan on 27 Sep 2017

23960 implements `first` and `last` from this proposal.

bkamins on 2 Oct 2017

Improved chop is implemented in #24126.

bkamins on 13 Oct 2017

There is only one open point left in this issue - should thisind (or fixind) be reintroduced?
And if yes - what should be the functionality: is forward argument needed or only going to the beginning of the current character is enough.

@nalimilan I would discuss the issue of StringIndex in #9297 and close this issue when thisind point is decided.

bkamins on 21 Oct 2017

In #24414 I propose how the implementation of thisind could look like (and I have decided to stick with the original name). If that PR gets accepted or it is decided that thisind is not needed this issue can be closed.

I think that we should encourage users to use only valid indices, but thisind can come handy e.g. when UTF-8 strings are read in chunks as it was indicated in discussion in #23805.

bkamins on 31 Oct 2017

All done in this issue. Close?

bkamins on 30 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

add special display for ≈ test failures

StefanKarpinski · 3Comments

The devdocs page for functions is outdated

ararslan · 3Comments

TypeVars in tuples in type parameters

iamed2 · 3Comments

Implicit "end" for file-wide modules.

manor · 3Comments

Getting Openssl error while building from source on ubuntu 14.04.3 cmake version 2.8.12.2

arshpreetsingh · 3Comments

Julia: Improvement of the functions for handling string indexing

Most helpful comment

All 26 comments

23960 implements first and last from this proposal.

Related issues

23960 implements `first` and `last` from this proposal.