Julia: `collect(graphemes("šŸ¤¦šŸ¼ā€ā™‚ļø"))` results in ERROR: ArgumentError: destination has fewer elements than required

Created on 21 Sep 2020  Ā·  9Comments  Ā·  Source: JuliaLang/julia

I believe there's an issue with the implementation of iterate on the result of graphemes("šŸ¤¦šŸ¼ā€ā™‚ļø").

The following works as expected:

julia> s = "šŸ¤¦šŸ¼ā€ā™‚ļø"
"šŸ¤¦šŸ¼\u200dā™‚ļø"

julia> length(s) == 5
true

julia> [s[i] for i in eachindex(s)]
5-element Array{Char,1}:
 'šŸ¤¦': Unicode U+1F926 (category So: Symbol, other)
 'šŸ¼': Unicode U+1F3FC (category Sk: Symbol, modifier)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 'ā™‚': Unicode U+2642 (category So: Symbol, other)
 'ļø': Unicode U+FE0F (category Mn: Mark, nonspacing)

julia> using Unicode

julia> length(graphemes(s)) == 1
true

However, this is the error I get when I try to collect the result of the Iterator in the latest stable release of Julia (v1.5.1).

julia> collect(graphemes(s))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[5]:1

The implementation of length appears to be correct, but I think there's a bug in the implementation of iterate:

julia> for (i, c) in enumerate(graphemes(s))
       println(i, " ", c)
       end
1 šŸ¤¦šŸ¼ā€
2 ā™‚ļø

I believe the correct behavior in this case is to return just the first element. iterate seems to be returning the Symbol ā™‚ as well.

bug strings unicode

Most helpful comment

I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself.

All 9 comments

I can reproduce this on master.

I think the problem is that the state passed to isgraphemebreak! needs to be initialized on the first iteration with the null character. The following fixes the issue for me. Is this the correct way to do it?

diff --git a/base/strings/unicode.jl b/base/strings/unicode.jl
index 235f85184d..029787b3c1 100644
--- a/base/strings/unicode.jl
+++ b/base/strings/unicode.jl
@@ -671,7 +671,7 @@ function length(g::GraphemeIterator{S}) where {S}
     return n
 end

-function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
+function iterate(g::GraphemeIterator{S}, i_=(Int32(0),firstindex(g.s))) where {S}
     s = g.s
     statei, i = i_
     state = Ref{Int32}(statei)
@@ -679,6 +679,7 @@ function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
     y = iterate(s, i)
     y === nothing && return nothing
     c0, k = y
+    i == firstindex(g.s) && isgraphemebreak!(state, eltype(S)(0x00000000), c0)
     while k <= ncodeunits(s) # loop until next grapheme is s[i:j]
         c, ā„“ = iterate(s, k)
         isgraphemebreak!(state, c0, c) && break

I don't think that should be required. If it is, that potentially seems like a bug in utf8proc

No, it seems to be documented: https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html#aae83bdcabf3a97c1046c0700ba353640. Apparently, we don't need to pass the state around though, we can just always set it to zero initially.

I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself.

I just thought this was intended because that's how it's done in length(::GraphemeIterator), but perhaps that was just a workaround.

It's definitely not supposed to be needed because eltype(S)(0x00000000) is not a character that occurs in the source. Something else is wrong (maybe a missing reset of the state in utf8proc). And of course it should be refactored to not need to useless carry around the state.

Thanks for the comment @Keno and @simeonschaub for looking into this!

I am going through the source of utf8proc.c and utf8proc.h to see if I can narrow it down.

UAX#29 defines the grapheme cluster boundary rules in this section. Specifically there's this rule: http://www.unicode.org/reports/tr29/#GB11

Do not break within emoji modifier                    sequences or emoji zwj sequences.
--
GB11 | \p{Extended_Pictographic}                      Extend* ZWJ | Ɨ | \p{Extended_Pictographic}

These lines in grapheme_break_simple manage the rule for this:

(lbc == UTF8PROC_BOUNDCLASS_E_ZWG &&              // GB11 (requires additional handling below)
 tbc == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) ? false : // ----

My understanding is that this is implemented as a state machine.
The grapheme_break_simple function that contains the code above is called from grapheme_break_extended which manages the state machine.

I think it is this part of the "state machine" that is causing the issue.

else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
  if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
    *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
  else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
    *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
  else
    *state = tbc;
}

I did some print style debugging to figure out what is going on.

Here is the state machine transition in the following format:

tbc (tbc -> boundclass) : from_state -> to_state = break_permitted
'šŸ¼'     (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_START                 -> UTF8PROC_BOUNDCLASS_EXTEND                = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ)                   : UTF8PROC_BOUNDCLASS_EXTEND                -> UTF8PROC_BOUNDCLASS_ZWJ                   = false
'ā™‚'      (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_ZWJ                   -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = true
'ļø'       (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false

You can see that when the boundclass is UTF8PROC_BOUNDCLASS_ZWJ and the state goes from UTF8PROC_BOUNDCLASS_EXTEND to UTF8PROC_BOUNDCLASS_ZWJ. I believe this state should be UTF8PROC_BOUNDCLASS_E_ZWG instead. The current implementation of the state machine "loses" the information that it was part of an extending grapheme cluster.

I made the following change

    if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
      *state = UTF8PROC_BOUNDCLASS_OTHER;
    // Special support for GB11 (emoji extend* zwj / emoji)
    else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
      if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
        *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
      else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
        *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
      else
        *state = tbc;
    }
+   else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+       *state = UTF8PROC_BOUNDCLASS_E_ZWG;       
    else
      *state = tbc;

And now the state machine goes from UTF8PROC_BOUNDCLASS_EXTEND to UTF8PROC_BOUNDCLASS_E_ZWG.

'šŸ¼'     (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_START                 -> UTF8PROC_BOUNDCLASS_EXTEND                = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ)                   : UTF8PROC_BOUNDCLASS_EXTEND                -> UTF8PROC_BOUNDCLASS_E_ZWG                 = false
'ā™‚'      (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_E_ZWG                 -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false
'ļø'       (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false

I ran make check in the utf8proc folder after making this change and the tests pass.


_Edit: Ignore the text below in this comment. See next comment for more information._

~However, I am still getting behavior that doesn't make sense.~

Specifically, there's this rule that says:

Break at the start and end of text, unless the text is empty.

I don't understand why the specification says "break at the start" or what "unless the text is empty" means. What "text" is being referred to here?

This is the implementation for this rule:

(lbc == UTF8PROC_BOUNDCLASS_START) ? true :       // GB1

This returns true at every UTF8PROC_BOUNDCLASS_START. Is that a correct? Maybe I'm misinterpreting the specification but that doesn't seem like what it should do, right?

I changed that line to the following:

(lbc == UTF8PROC_BOUNDCLASS_START) ? false

This is obviously a violation of the specification and with this change make check also fails.

But with this change and with the fix to the state machine that I mentioned above I'm getting the behavior I expect. This also seems to work for every other case I experiment with. I didn't run a full test suite though and this is my first foray into the Unicode specification and I've only read this section.

Any suggestions here would be very welcome.

I took a second stab at this just to explore again what is going on. It turns out I had a bug in my implementation! (I had re-implemented the state machine in Julia for my testing and messed up one line there.)

This is the only change I needed to make to resolve this issue, exactly the same as the previous comment.

    if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
      *state = UTF8PROC_BOUNDCLASS_OTHER;
    // Special support for GB11 (emoji extend* zwj / emoji)
    else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
      if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
        *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
      else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
        *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
      else
        *state = tbc;
    }
+   else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+       *state = UTF8PROC_BOUNDCLASS_E_ZWG;       
    else
      *state = tbc;

Ignore the part about the UTF8PROC_BOUNDCLASS_START state from the previous comment.

The only question I have is what other states should transition to UTF8PROC_BOUNDCLASS_E_ZWG?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

arshpreetsingh picture arshpreetsingh  Ā·  3Comments

ararslan picture ararslan  Ā·  3Comments

felixrehren picture felixrehren  Ā·  3Comments

tkoolen picture tkoolen  Ā·  3Comments

StefanKarpinski picture StefanKarpinski  Ā·  3Comments