Elixir: String.length counting wrongly for CRLF = "\r\n"

Created on 21 Feb 2018  路  3Comments  路  Source: elixir-lang/elixir

Environment

  • Elixir & Erlang versions (elixir --version):
    Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [kernel-poll:false]
  • Operating system:
    Ubuntu 14.04

Current behavior

I know String.length is counting codepoints not bytes so if you have a multibyte char it will count it as 1 char/codepoint but as far as I know "\r\n" is still two chars/codepoints and two bytes. This means String.length is counting wrongly.

iex(1)> String.length "\n"
1
iex(2)> String.length "\r"
1
iex(3)> String.length "\r\n"
1

Expected behavior

iex(3)> String.length "\r\n"
2

Most helpful comment

After reading http://www.unicode.org/reports/tr29/ I agree on the behavior but its a quite unexpected one. Seams like I am not the first one confused with that - see https://github.com/elixir-lang/elixir/issues/2883.

For something like 锚 or 啪 it makes sense and is expected - multibyte characters counted as one. This is also quite obvious in the documentation. Since \r\n is still two characters you do not expect it beeing counted as one. Maybe I mix it with codepoints in my head.
It might be a good idea to explicitly mention this \r\n case in the documentation (and also for String.reverse). Also I would not consider \r\n beeing a grapheme at all since back on typewriters it have been two different operations (CR was more like home button today when I remember correct).

Something like this is just unexpected:
String.reverse(String.reverse("\n\r"))
"\r\n"
String.length "\r\n"
1

All 3 comments

Length counts graphemes, for example, "茅" can be written in two codepoints, "e" and the acute accent, but counted as a single grapheme. Unicode also considers "\r\n" to be a single grapheme.

After reading http://www.unicode.org/reports/tr29/ I agree on the behavior but its a quite unexpected one. Seams like I am not the first one confused with that - see https://github.com/elixir-lang/elixir/issues/2883.

For something like 锚 or 啪 it makes sense and is expected - multibyte characters counted as one. This is also quite obvious in the documentation. Since \r\n is still two characters you do not expect it beeing counted as one. Maybe I mix it with codepoints in my head.
It might be a good idea to explicitly mention this \r\n case in the documentation (and also for String.reverse). Also I would not consider \r\n beeing a grapheme at all since back on typewriters it have been two different operations (CR was more like home button today when I remember correct).

Something like this is just unexpected:
String.reverse(String.reverse("\n\r"))
"\r\n"
String.length "\r\n"
1

The number of characters do not necessarily correlate. In some languages, two separate letters, such as ch are considered a single grapheme. However we do not implement locale specific grapheme behaviour.

Also the following property does not hold true for Unicode: given a random string of characters x, reverse(reverse(x)) is not always x. For example, take the characters and e. In a string 虂e, when reversed, it will become e 虂, which is printed as and when reversed will still remain as e 虂.

And we actually expect the same to be true for \r\n, when reversed, it should still remain \r\n.

In any case, this discussion is quite moot. We implement the Unicode Standard and we will continue to follow it. So if you don't agree with the current behaviour, then we need to propose it to the standard, and once we update the Unicode database in Elixir the changes will be propagated.

Was this page helpful?
0 / 5 - 0 ratings