Julia: make sure string function index bounds are correctly documented

Created on 9 Jan 2018  Â·  22Comments  Â·  Source: JuliaLang/julia

Most helpful comment

I do not know if it is the right place too keep track of this, but somewhere in the documentation we should note that length is not additive for Strings, e.g.:

julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")

julia> length(a), length(b)
(1, 1)

julia> a*b, length(a*b)
("∀", 1)

as this is a natural property people could assume in their code.

All 22 comments

The same with nextind and prevind.

In particular the part of docs of nextind:

If i is out of bounds in s return i + 1.

and related part of docs of prevind is not working as documented, e.g.:

julia> nextind("x", 2)
ERROR: BoundsError: attempt to access "x"
  at index [2]
Stacktrace:
 [1] nextind(::String, ::Int64) at .\strings\string.jl:121
 [2] top-level scope

julia> nextind("x", -1)
ERROR: BoundsError: attempt to access "x"
  at index [-1]
Stacktrace:
 [1] nextind(::String, ::Int64) at .\strings\string.jl:121
 [2] top-level scope

Also see https://github.com/JuliaLang/julia/pull/24255 (I did not update that PR because I do not know what is the expected behavior).

The docs are what I initially thought the bounds should be, what they actually ended up being is:

  • thisind(s, i): 0 ≤ i ≤ ncodeunits(s)+1
  • nextind(s, i): 0 ≤ i < ncodeunits(s)+1
  • prevind(s, i): 0 < i ≤ ncodeunits(s)+1

Reasoning: it's useful and common to want to pass a just-out-of-bounds index and it makes sense to do so as long as the operation isn't going to go even more out of bounds. So the full in-bounds-or-just-out-of-bounds range of indices is ok for thisind since it never goes further out of bounds than it already is. For nextind you can only pass in the lower just-out-of-bounds value and move up. Similarly, for prevind you can only pass in the upper just-out-of-bounds value and move down. If you limit these bounds any further a lot of generic code is forced to add annoying and pointless boundary condition handling cases whereas with these bounds code works "just works" generically and correctly. In other words, this is a strict as we can empirically make these bounds.

Thanks. The only missing piece of specification is how:

  • nextind(s,i,n) should work when i+n>ncodeunits(s)+1
  • prevind(s,i,n) should work when i+n<0

This is actually what is stopping me with #24255. Or maybe it is unspecified (i.e. it should return any value that is not within bounds of the string - then the current implementation is OK - only performance should be improved)

Does it matter? You can't do any computations with those out-of-bounds indices anyway. The only thing you can really do with such an index is check if it is out of bounds or not and compare it to other indices. So I think the only constraints need to be that the functions are consistent when called with the same string object and that they should be strictly monotonic in n.

The issue is about notion of:

the same string object

If we define it as identical under === then current implementation is OK I think.

If we define it as two strings identical under == and of the same type then we have the following problem:

julia> nextind(chop("∀∀"), 1, 2)
7

julia> nextind(chop("∀a"), 1, 2)
5

strings chop("∀a") and chop("∀∀") have the same type and are identical under ==. They are not egal (===).

It is OK for me to leave it as is, but I want to have a clear documenation. The reason is that people develop custom AbstractStrings and we need to be precise. Otherwise we will have discussions what is a bug and what is an implementation detail (this could also affect what we see as breaking change in the future).

And PR https://github.com/JuliaLang/julia/pull/25525 shows that it might matter (I have spotted that problem because of inconsistent behavior of nextind 😄).

chop("∀a") === chop("∀∀") is false though.

I would be fine with defining imaginary out-of-bounds characters to all be exactly 1 code unit. The SubString type would need to be updated not to just pass through to its underlying string object.

All you write is exactly what I want to say.

I think (I will check it to be 100% sure) that we can simply remove methods nextind/prevind(s::SubString, i, n) from substring.jl and make them fall back to generic methods without any penalty right now and it should be consistent with:

I would be fine with defining imaginary out-of-bounds characters to all be exactly 1 code unit.

Currently it should not have a performance penalty I think because nextind/prevind(s, i, n) are generic anyway.

The only issue is that in the future probably those generic methods should be specialized for String type to improve performance (e.g to call next in nextind once we know we are at a valid index which should be faster than calling isvalid on each index) and then also additional SubString{String} methods could be implemented in a similar fashion also.

So In short - I believe there is a simple way to make nextind/prevind for SubString behave now as you have proposed and in the future it is also possible to keep this contract.

I could make an appropriate PR showing this approach - should I (I am asking because you are assigned to this issue).

The only issue is that in the future probably those generic methods should be specialized for String type to improve performance

We can add SubString method specializations at that point.

I could make an appropriate PR showing this approach - should I?

Please do! Assignment just means I'm responsible to make sure it gets done, not that I have to actually do. Thank you!

Apart from PRs I have made going back to the code of this issue I want to make sure this is intended:

julia> nextind("1", 0, 0)
0

julia> prevind("1", 2, 0)
2

and also reverseind requires update of the documentation as:

julia> reverseind("1", 0)
2

julia> reverseind("1", 2)
0

which is not covered by the contract given in the docs (I do not know if it was intended or not).

Those three cases make me realize that actually those the only three places in the whole Base where thisind can be called with 0 or ncodeunits(n)+1 value.

I fully agree with:

nextind(s, i): 0 ≤ i < ncodeunits(s)+1
prevind(s, i): 0 < i ≤ ncodeunits(s)+1

but then the question is in what situations it is good that thisind would accept 0 or ncodeunits(s)+1 as a second argument?

In Base those are only those three places I have indicated, and you may decide that:

  • this is a desired behavior (then we leave it as is);
  • actually we do not want this behavior (then restricting thisind(s, i) to 0 < i ≤ ncodeunits(s) is a natural choice);
  • we might want this behavior but decide to restrict thisind(s, i) to 0 < i ≤ ncodeunits(s) anyway and in those three places add some custom handling.

Fortunately thisind is a new function (introduced in Sep 2017 if I remember correctly) and its current range of accepted second argument is even fresher so I think we have some flexibility here.

but then the question is in what situations it is good that thisind would accept 0 or ncodeunits(s)+1 as a second argument?

You may want to try changing the bounds to 1 ≤ i ≤ ncodeunits(s) and see what breaks. It's quite a bit. In particular, I recall that it made the generic definition of reverseind break. One bit of insight into the problem is that for zero-length strings, that index range is empty so you can't actually call thisind at all on a zero length string, which seems fishy. Another issue is that one can (as reverseind does), do a computation that results in a just-out-of-bounds index, then do a transformation of that which should be within a character, and then want to find the start of that character. If you allow 0 and ncodeunits(s)+1 then this all just works. If you don't then you end up having a lot of special cases to deal with specially.

Thanks. Actually this led me to finding some more bugs:

  1. https://github.com/JuliaLang/julia/pull/25531#issuecomment-357900106) - I will fix it there;
  2. https://github.com/JuliaLang/julia/issues/12906#issuecomment-357904875 - probably in the discussion there we will have a decision what to do.

I did the check what you ask and here are my conclusions:

  1. only definition of endof would have to be changed (but the additional check we would have to do here for ncodeunits(s)==0 would be removed from thisind implementation to this cancels out);
  2. reverseind should work without changes.

everything else should not be impacted (fortunately thisind is new and not much depends on it).

So in short the narrow thisind would mean:

  1. benefits:

    • we are sure that thisind returns a valid index if it does not throw an error;

    • we can always change the definition to wide approach (with 0 and ncodeunits(s)+1) in the future which will not be breaking;

    • shorter body of thisind which might simplify inlining it;

  2. negatives:

    • thisind for zero-length string will always throw an error as zero-length string has no valid index (now it accepts 0 and 1)

I am indifferent, i.e. I can live with both definitions given they are properly documented as both approaches have their benefits and costs.

@nalimilan - you were working with those find* functions I guess. Do you have any preference here?

In the initial PR I wasn't a fan of nextind accepting/returning out-of-bounds indices, but I've kind of lost track of all these discussions so I'm not sure anymore...

One additional benefit of wide thisind is that it is guaranteed not to throw an error on nextind(s,i) and prevind(s,i) result. Here the only problem is that this property does not hold for nextind(s,i,n) for n>1 as discussed above.

In short: @StefanKarpinski - I think we have all cards about thisind on the table. The difference is minimal. I go ahead with PRs using current definition of thisind (accepting 0 and ncodeunits(s)+1), but it is easy to change.

@StefanKarpinski: please just indicate your final preference as there are bugs whose fixing depends on the decision 😄.

I need some time to review all of these issues and changes. I'll be able to look at it some time today.

While we're here, why abbreviate index to ind ?
thisindex?
nextindex?
...

We are done with my PRs cleaning up the errors in the implementations of the related functions I could find. They now all work assuming thisind accepts 0 and ncodeunits(s)+1.

So we can leave it as is and start writing precise documentations the functions or reconsider range of accepted values in thisind.

@StefanKarpinski I know this is a busy period in Julia development but this change will be seriously breaking for strings ecosystem if we want to change it in the future. I can live with both definitions. You have worked on this more and longer so I leave it for you to decide 😄.

I do not know if it is the right place too keep track of this, but somewhere in the documentation we should note that length is not additive for Strings, e.g.:

julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")

julia> length(a), length(b)
(1, 1)

julia> a*b, length(a*b)
("∀", 1)

as this is a natural property people could assume in their code.

It is additive for valid UTF-8 strings, however, which could be noted at the same time.

Was this page helpful?
0 / 5 - 0 ratings