Erlang/OTP 20 [erts-9.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.5.2
Ubuntu 17.10
Strings produced with the ~s sigil sometimes contain the BOM. If that's the case, then subsequent comparison with the same string created with plain quotes will fail.
Reproduction:
iex(50)> s1 = ~s(2017-11-29\t2017-12-02\t"QUOTED"\t"ABCDE"\t"EFG"\t"HIJKLMON"\tPQR)
"2017-11-29\t2017-12-02\t\"QUOTED\"\t\"ABCDE\"\t\"EFG\"\t\"HIJKLMON\"\tPQR"
iex(51)> s2 = "2017-11-29\t2017-12-02\t\"QUOTED\"\t\"ABCDE\"\t\"EFG\"\t\"HIJKLMON\"\tPQR"
"2017-11-29\t2017-12-02\t\"QUOTED\"\t\"ABCDE\"\t\"EFG\"\t\"HIJKLMON\"\tPQR"
iex(52)> s1 == s2
false
The BOM is visible upon inspection with i:
iex(53)> i s1
Term
"2017-11-29\t2017-12-02\t\"QUOTED\"\t\"ABCDE\"\t\"EFG\"\t\"HIJKLMON\"\tPQR"
Data type
BitString
Byte size
62
Description
This is a string: a UTF-8 encoded binary. It's printed surrounded by
"double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
<<239, 187, 191, 50, 48, 49, 55, 45, 49, 49, 45, 50, 57, 9, 50, 48, 49, 55, 45, 49, 50, 45, 48, 50, 9, 34, 81, 85, 79, 84, 69, 68, 34, 9, 34, 65, 66, 67, 68, 69, 34, 9, 34, 69, 70, 71, 34, 9, 34, 72, ...>>
Reference modules
String, :binary
Implemented protocols
IEx.Info, List.Chars, Inspect, String.Chars, Collectable
iex(54)> i s2
Term
"2017-11-29\t2017-12-02\t\"QUOTED\"\t\"ABCDE\"\t\"EFG\"\t\"HIJKLMON\"\tPQR"
Data type
BitString
Byte size
59
Description
This is a string: a UTF-8 encoded binary. It's printed surrounded by
"double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
<<50, 48, 49, 55, 45, 49, 49, 45, 50, 57, 9, 50, 48, 49, 55, 45, 49, 50, 45, 48, 50, 9, 34, 81, 85, 79, 84, 69, 68, 34, 9, 34, 65, 66, 67, 68, 69, 34, 9, 34, 69, 70, 71, 34, 9, 34, 72, 73, 74, 75, ...>>
Reference modules
String, :binary
Implemented protocols
IEx.Info, List.Chars, Inspect, String.Chars, Collectable
I would expect the string comparison to pass. I had this exact case in ExUnit case and the test would fail with printing the same strings and saying they failed the equality check, which is super-confusing.
I don’t think the sigil is adding the BOM but rather your editor. It is the
same as “é” which can be written in two different ways in Unicode and they
won’t be equal. Elixir won’t alternate between the two. It is all up to you
and your editor. Similar to any zero width white spaces in Unicode you may
add.
Maybe ExUnit could show better diffs in this case, but I don’t think it is
an issue with sigils.
I'm able to reproduce this by copy-pasting the s1 string from Chrome to GNOME Terminal. All settings are on defaults. But as you noted, this is environment-related, since if I type the same string with the sigil straight into the iex session in the console, the BOM is not added.
I guess the issue can be closed, though I'm really unhappy about this.
I guess the issue can be closed, though I'm really unhappy about this.
Yeah, I totally understand that. It is the same issue as this:
iex(1)> "é" == "é"
false
or using non zero width white space:
iex(1)> "a" == "a"
false
It makes you pull your hair until you figure out what is really happening. It is not a behaviour specific to Elixir either.
For now, let's show \uFEFF on inspected strings (instead of showing nothing). That should at least make it more obvious without changing its representation.
I'd like to give this issue a go if it's up for grabs.
Man, you guys are quick. I was about to write that I'll happily contribute a PR.
Most helpful comment
For now, let's show
\uFEFFon inspected strings (instead of showing nothing). That should at least make it more obvious without changing its representation.