Julia: unicode whitespace not recognized by lstrip()

Created on 23 May 2018  Â·  10Comments  Â·  Source: JuliaLang/julia

Consider:

julia> a = " \U02009 foo"
"   foo"

julia> lstrip(a)
"  foo"

Note that \U02009 is a Unicode Character 'THIN SPACE' character.

I don't know much about string processing or unicode, but it seems to me that lstrip() is doing something that's perhaps too naive:

const _default_delims = [' ','\t','\n','\v','\f','\r']
...
function lstrip(s::AbstractString, chars::Chars=_default_delims)

I realize you can override chars but I would suggest that we expand the default character set for string trimming to include unicode separators (ref https://www.fileformat.info/info/unicode/category/Zs/list.htm).

strings unicode

Most helpful comment

The existing _default_delims would already be covered by isspace, so it could just be:

lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)

All 10 comments

What do other languages do here?

Note that the other chars ('\t','\n','\v','\f','\r') are control characters.

Ruby and Crystal leave it, Python and Rust strip it

Stripping unicode whitespace by default seems reasonable to me.

Related idea: Define lstrip and friends to take a function argument, where the function is used to determine what should be stripped, i.e.

lstrip(f, s::AbstractString) = # ...
lstrip(s::AbstractString) = lstrip(c->isspace(c) || c in _default_delims, s)

It seems like this would allow us to more concisely express what should be skipped, rather than adding all Unicode space characters to _default_delims. Also, as an example in this hypothetical universe, lstrip("~~~1", '~') would instead be written lstrip(==('~'), "~~~1").

This is essentially equivalent to overriding chars, though, isn’t it? If that’s the case then we might as well just leave it as-is (but I’m still in support of adding the Unicode chars to _default_delim).

Yeah, but you don't need to enumerate all Unicode whitespace chars in _default_delims.

The existing _default_delims would already be covered by isspace, so it could just be:

lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)

In summary, our choices are:

  1. do nothing
  1. add all unicode spaces to _default_delims

    • complicated to keep up-to-date, and will be inefficient (since it won't be able to use range checks or short-circuiting)
  2. have a non-exported way to allow predicates for just isspace

    • seems silly, but if we can't come to an agreement, is at least better than 1.
  3. allow predicates as a second argument (#27309)

    • non-breaking
    • straightforward to document
    • violates general style rule of having function arg as first argument
    • we do violate it in other places (e.g. split/rsplit).
  4. move chars to first argument

    • breaking
    • ordering to me seems somewhat odd
  5. allow only predicate function, move to first (#27232)

    • general consensus was against
  6. allow predicate as a first argument, keep other chars as second argument (take commits from #27232 up to d7ae0748f8e2bcafba93192eb165b1ef2fd1783e)

    • non-breaking
    • a little weird that the argument order changes.

Given that it is a fairly minor function, my vague preference is option 4.

  1. 4 + 7 — allow predicates as first or second argument; allow non-predicate chars as second only.

A big :-1: to 4, :+1: to 7.

Was this page helpful?
0 / 5 - 0 ratings