I believe there's an issue with the implementation of iterate
on the result of graphemes("š¤¦š¼āāļø")
.
The following works as expected:
julia> s = "š¤¦š¼āāļø"
"š¤¦š¼\u200dāļø"
julia> length(s) == 5
true
julia> [s[i] for i in eachindex(s)]
5-element Array{Char,1}:
'š¤¦': Unicode U+1F926 (category So: Symbol, other)
'š¼': Unicode U+1F3FC (category Sk: Symbol, modifier)
'\u200d': Unicode U+200D (category Cf: Other, format)
'ā': Unicode U+2642 (category So: Symbol, other)
'ļø': Unicode U+FE0F (category Mn: Mark, nonspacing)
julia> using Unicode
julia> length(graphemes(s)) == 1
true
However, this is the error I get when I try to collect the result of the Iterator in the latest stable release of Julia (v1.5.1).
julia> collect(graphemes(s))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
[1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
[2] _collect at ./array.jl:630 [inlined]
[3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
[4] top-level scope at REPL[5]:1
The implementation of length
appears to be correct, but I think there's a bug in the implementation of iterate
:
julia> for (i, c) in enumerate(graphemes(s))
println(i, " ", c)
end
1 š¤¦š¼ā
2 āļø
I believe the correct behavior in this case is to return just the first element. iterate
seems to be returning the Symbol ā
as well.
I can reproduce this on master.
I think the problem is that the state
passed to isgraphemebreak!
needs to be initialized on the first iteration with the null character. The following fixes the issue for me. Is this the correct way to do it?
diff --git a/base/strings/unicode.jl b/base/strings/unicode.jl
index 235f85184d..029787b3c1 100644
--- a/base/strings/unicode.jl
+++ b/base/strings/unicode.jl
@@ -671,7 +671,7 @@ function length(g::GraphemeIterator{S}) where {S}
return n
end
-function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
+function iterate(g::GraphemeIterator{S}, i_=(Int32(0),firstindex(g.s))) where {S}
s = g.s
statei, i = i_
state = Ref{Int32}(statei)
@@ -679,6 +679,7 @@ function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
y = iterate(s, i)
y === nothing && return nothing
c0, k = y
+ i == firstindex(g.s) && isgraphemebreak!(state, eltype(S)(0x00000000), c0)
while k <= ncodeunits(s) # loop until next grapheme is s[i:j]
c, ā = iterate(s, k)
isgraphemebreak!(state, c0, c) && break
I don't think that should be required. If it is, that potentially seems like a bug in utf8proc
No, it seems to be documented: https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html#aae83bdcabf3a97c1046c0700ba353640. Apparently, we don't need to pass the state around though, we can just always set it to zero initially.
I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself.
I just thought this was intended because that's how it's done in length(::GraphemeIterator)
, but perhaps that was just a workaround.
It's definitely not supposed to be needed because eltype(S)(0x00000000)
is not a character that occurs in the source. Something else is wrong (maybe a missing reset of the state in utf8proc). And of course it should be refactored to not need to useless carry around the state.
Thanks for the comment @Keno and @simeonschaub for looking into this!
I am going through the source of utf8proc.c
and utf8proc.h
to see if I can narrow it down.
UAX#29 defines the grapheme cluster boundary rules in this section. Specifically there's this rule: http://www.unicode.org/reports/tr29/#GB11
Do not break within emoji modifier sequences or emoji zwj sequences. -- GB11 | \p{Extended_Pictographic} Extend* ZWJ | Ć | \p{Extended_Pictographic}
These lines in grapheme_break_simple
manage the rule for this:
(lbc == UTF8PROC_BOUNDCLASS_E_ZWG && // GB11 (requires additional handling below)
tbc == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) ? false : // ----
My understanding is that this is implemented as a state machine.
The grapheme_break_simple
function that contains the code above is called from grapheme_break_extended
which manages the state machine.
I think it is this part of the "state machine" that is causing the issue.
else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
*state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
*state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
else
*state = tbc;
}
I did some print style debugging to figure out what is going on.
Here is the state machine transition in the following format:
tbc (tbc -> boundclass) : from_state -> to_state = break_permitted
'š¼' (UTF8PROC_BOUNDCLASS_EXTEND) : UTF8PROC_BOUNDCLASS_START -> UTF8PROC_BOUNDCLASS_EXTEND = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ) : UTF8PROC_BOUNDCLASS_EXTEND -> UTF8PROC_BOUNDCLASS_ZWJ = false
'ā' (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_ZWJ -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = true
'ļø' (UTF8PROC_BOUNDCLASS_EXTEND) : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false
You can see that when the boundclass is UTF8PROC_BOUNDCLASS_ZWJ
and the state goes from UTF8PROC_BOUNDCLASS_EXTEND
to UTF8PROC_BOUNDCLASS_ZWJ
. I believe this state should be UTF8PROC_BOUNDCLASS_E_ZWG
instead. The current implementation of the state machine "loses" the information that it was part of an extending grapheme cluster.
I made the following change
if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
*state = UTF8PROC_BOUNDCLASS_OTHER;
// Special support for GB11 (emoji extend* zwj / emoji)
else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
*state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
*state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
else
*state = tbc;
}
+ else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+ *state = UTF8PROC_BOUNDCLASS_E_ZWG;
else
*state = tbc;
And now the state machine goes from UTF8PROC_BOUNDCLASS_EXTEND
to UTF8PROC_BOUNDCLASS_E_ZWG
.
'š¼' (UTF8PROC_BOUNDCLASS_EXTEND) : UTF8PROC_BOUNDCLASS_START -> UTF8PROC_BOUNDCLASS_EXTEND = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ) : UTF8PROC_BOUNDCLASS_EXTEND -> UTF8PROC_BOUNDCLASS_E_ZWG = false
'ā' (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_E_ZWG -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false
'ļø' (UTF8PROC_BOUNDCLASS_EXTEND) : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false
I ran make check
in the utf8proc
folder after making this change and the tests pass.
_Edit: Ignore the text below in this comment. See next comment for more information._
~However, I am still getting behavior that doesn't make sense.~
Specifically, there's this rule that says:
Break at the start and end of text, unless the text is empty.
I don't understand why the specification says "break at the start" or what "unless the text is empty" means. What "text" is being referred to here?
This is the implementation for this rule:
(lbc == UTF8PROC_BOUNDCLASS_START) ? true : // GB1
This returns true
at every UTF8PROC_BOUNDCLASS_START
. Is that a correct? Maybe I'm misinterpreting the specification but that doesn't seem like what it should do, right?
I changed that line to the following:
(lbc == UTF8PROC_BOUNDCLASS_START) ? false
This is obviously a violation of the specification and with this change make check
also fails.
But with this change and with the fix to the state machine that I mentioned above I'm getting the behavior I expect. This also seems to work for every other case I experiment with. I didn't run a full test suite though and this is my first foray into the Unicode specification and I've only read this section.
Any suggestions here would be very welcome.
I took a second stab at this just to explore again what is going on. It turns out I had a bug in my implementation! (I had re-implemented the state machine in Julia for my testing and messed up one line there.)
This is the only change I needed to make to resolve this issue, exactly the same as the previous comment.
if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
*state = UTF8PROC_BOUNDCLASS_OTHER;
// Special support for GB11 (emoji extend* zwj / emoji)
else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
*state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
*state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
else
*state = tbc;
}
+ else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+ *state = UTF8PROC_BOUNDCLASS_E_ZWG;
else
*state = tbc;
Ignore the part about the UTF8PROC_BOUNDCLASS_START
state from the previous comment.
The only question I have is what other states should transition to UTF8PROC_BOUNDCLASS_E_ZWG
?
Most helpful comment
I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself.