split(str) splits on U+00A0 : NO-BREAK SPACE [NBSP]
julia> split("a b")
2-element Array{SubString{String},1}:
"a"
"b"
Python does this also:
>>> "a b".split()
['a', 'b']
Is this the behavour one wants?
By the defintions of nonbreaking space it is all about avoiding line-breaks placed in the wrong stop, so during type-setting it can be inserting.
However, I have seen it also used between single worrds that happen to contain spaces*. I don't know how common this is.
I have a set of embeddings for which reading was broken
because it was using split(line) to break up the line,
And the dataset was encoding words with spaces (or in this case multicharacter symbols with spaces) using nonbreaking spaces.
My initial instinct was that non-breaking spaces should be treated as part of the words to either side. an thus not split on either.
Now I am not so sure.
PCRE says that it does break up words:
julia> match(r".\b", "x y")
RegexMatch("x")
Anna Rose Smith can have Rose not as the middle name but as a compouind first name with a space: Anna RoseWords can contain spaces, for example in English the open compounds (e.g. "ice cream").
But there is no code point for inner-word spaces. The closest is the word joiner U+2060, but this has no visible length. This relates back to the decision to encode graphemes. So splitting at word boundaries is only possible with a dictionary, so my take is that splitting at white space is a sane default.
Hard to resist applying the "breaking" tag here. But what does it mean in this context? :joy:
Since this is breaking we can go a bit crazier:
Maybe split should default only to splitting on ansi white-space
Maybe where white-space includes , \t, \n, \r.
Most helpful comment
Words can contain spaces, for example in English the open compounds (e.g. "ice cream").
But there is no code point for inner-word spaces. The closest is the word joiner U+2060, but this has no visible length. This relates back to the decision to encode graphemes. So splitting at word boundaries is only possible with a dictionary, so my take is that splitting at white space is a sane default.