Julia: Should `split(str)` split on Non-breaking spaces?

Created on 13 Nov 2019  Â·  4Comments  Â·  Source: JuliaLang/julia

split(str) splits on U+00A0 : NO-BREAK SPACE [NBSP]

julia> split("a b")
2-element Array{SubString{String},1}:
 "a"
 "b"

Python does this also:

>>> "a b".split()
['a', 'b']

Is this the behavour one wants?
By the defintions of nonbreaking space it is all about avoiding line-breaks placed in the wrong stop, so during type-setting it can be inserting.

However, I have seen it also used between single worrds that happen to contain spaces*. I don't know how common this is.

I have a set of embeddings for which reading was broken
because it was using split(line) to break up the line,
And the dataset was encoding words with spaces (or in this case multicharacter symbols with spaces) using nonbreaking spaces.

My initial instinct was that non-breaking spaces should be treated as part of the words to either side. an thus not split on either.
Now I am not so sure.

PCRE says that it does break up words:

julia> match(r".\b", "x y")
RegexMatch("x")
  • English is a weird language sometimes since A) those are permitted, B) they are rarely acknolwedged.
    Sometimes you see that in names, e.g. the surname Diana Wynne Jones, the surname is Wynne Jones. Or Anna Rose Smith can have Rose not as the middle name but as a compouind first name with a space: Anna Rose
breaking strings

Most helpful comment

Words can contain spaces, for example in English the open compounds (e.g. "ice cream").
But there is no code point for inner-word spaces. The closest is the word joiner U+2060, but this has no visible length. This relates back to the decision to encode graphemes. So splitting at word boundaries is only possible with a dictionary, so my take is that splitting at white space is a sane default.

All 4 comments

Words can contain spaces, for example in English the open compounds (e.g. "ice cream").
But there is no code point for inner-word spaces. The closest is the word joiner U+2060, but this has no visible length. This relates back to the decision to encode graphemes. So splitting at word boundaries is only possible with a dictionary, so my take is that splitting at white space is a sane default.

Hard to resist applying the "breaking" tag here. But what does it mean in this context? :joy:

Since this is breaking we can go a bit crazier:
Maybe split should default only to splitting on ansi white-space

Maybe where white-space includes , \t, \n, \r.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

StefanKarpinski picture StefanKarpinski  Â·  3Comments

omus picture omus  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

dpsanders picture dpsanders  Â·  3Comments

iamed2 picture iamed2  Â·  3Comments