Consider:
julia> a = " \U02009 foo"
"   foo"
julia> lstrip(a)
"  foo"
Note that \U02009
is a Unicode Character 'THIN SPACE'
character.
I don't know much about string processing or unicode, but it seems to me that lstrip()
is doing something that's perhaps too naive:
const _default_delims = [' ','\t','\n','\v','\f','\r']
...
function lstrip(s::AbstractString, chars::Chars=_default_delims)
I realize you can override chars
but I would suggest that we expand the default character set for string trimming to include unicode separators (ref https://www.fileformat.info/info/unicode/category/Zs/list.htm).
What do other languages do here?
Note that the other chars ('\t','\n','\v','\f','\r') are control characters.
Ruby and Crystal leave it, Python and Rust strip it
Stripping unicode whitespace by default seems reasonable to me.
Related idea: Define lstrip
and friends to take a function argument, where the function is used to determine what should be stripped, i.e.
lstrip(f, s::AbstractString) = # ...
lstrip(s::AbstractString) = lstrip(c->isspace(c) || c in _default_delims, s)
It seems like this would allow us to more concisely express what should be skipped, rather than adding all Unicode space characters to _default_delims
. Also, as an example in this hypothetical universe, lstrip("~~~1", '~')
would instead be written lstrip(==('~'), "~~~1")
.
This is essentially equivalent to overriding chars
, though, isn’t it? If that’s the case then we might as well just leave it as-is (but I’m still in support of adding the Unicode chars to _default_delim
).
Yeah, but you don't need to enumerate all Unicode whitespace chars in _default_delims
.
The existing _default_delims
would already be covered by isspace
, so it could just be:
lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)
In summary, our choices are:
add all unicode spaces to _default_delims
have a non-exported way to allow predicates for just isspace
allow predicates as a second argument (#27309)
split
/rsplit
).move chars
to first argument
allow only predicate function, move to first (#27232)
allow predicate as a first argument, keep other chars
as second argument (take commits from #27232 up to d7ae0748f8e2bcafba93192eb165b1ef2fd1783e)
Given that it is a fairly minor function, my vague preference is option 4.
chars
as second only.A big :-1: to 4, :+1: to 7.
Most helpful comment
The existing
_default_delims
would already be covered byisspace
, so it could just be: