Julia: unicode whitespace not recognized by lstrip()

Created on 23 May 2018 · 10Comments · Source: JuliaLang/julia

Consider:

julia> a = " \U02009 foo"
"   foo"

julia> lstrip(a)
"  foo"

Note that \U02009 is a Unicode Character 'THIN SPACE' character.

I don't know much about string processing or unicode, but it seems to me that lstrip() is doing something that's perhaps too naive:

const _default_delims = [' ','\t','\n','\v','\f','\r']
...
function lstrip(s::AbstractString, chars::Chars=_default_delims)

I realize you can override chars but I would suggest that we expand the default character set for string trimming to include unicode separators (ref https://www.fileformat.info/info/unicode/category/Zs/list.htm).

strings unicode

Source

sbromberger

Most helpful comment

The existing _default_delims would already be covered by isspace, so it could just be:

lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)

simonbyrne on 23 May 2018

👍2

All 10 comments

What do other languages do here?

Note that the other chars ('\t','\n','\v','\f','\r') are control characters.

simonbyrne on 23 May 2018

Ruby and Crystal leave it, Python and Rust strip it

ararslan on 23 May 2018

Stripping unicode whitespace by default seems reasonable to me.

StefanKarpinski on 23 May 2018

👍1

Related idea: Define lstrip and friends to take a function argument, where the function is used to determine what should be stripped, i.e.

lstrip(f, s::AbstractString) = # ...
lstrip(s::AbstractString) = lstrip(c->isspace(c) || c in _default_delims, s)

It seems like this would allow us to more concisely express what should be skipped, rather than adding all Unicode space characters to _default_delims. Also, as an example in this hypothetical universe, lstrip("~~~1", '~') would instead be written lstrip(==('~'), "~~~1").

ararslan on 23 May 2018

👍1

This is essentially equivalent to overriding chars, though, isn’t it? If that’s the case then we might as well just leave it as-is (but I’m still in support of adding the Unicode chars to _default_delim).

sbromberger on 23 May 2018

Yeah, but you don't need to enumerate all Unicode whitespace chars in _default_delims.

ararslan on 23 May 2018

The existing _default_delims would already be covered by isspace, so it could just be:

lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)

simonbyrne on 23 May 2018

👍2

In summary, our choices are:

do nothing

add all unicode spaces to _default_delims
- complicated to keep up-to-date, and will be inefficient (since it won't be able to use range checks or short-circuiting)
have a non-exported way to allow predicates for just isspace
- seems silly, but if we can't come to an agreement, is at least better than 1.
allow predicates as a second argument (#27309)
- non-breaking
- straightforward to document
- violates general style rule of having function arg as first argument
- we do violate it in other places (e.g. split/rsplit).
move chars to first argument
- breaking
- ordering to me seems somewhat odd
allow only predicate function, move to first (#27232)
- general consensus was against
allow predicate as a first argument, keep other chars as second argument (take commits from #27232 up to d7ae0748f8e2bcafba93192eb165b1ef2fd1783e)
- non-breaking
- a little weird that the argument order changes.