Related: https://github.com/facebook/reason/issues/1384, #1053, #799
@jordwalke you said a while ago that this is a matter of finding the right unicode printing library (while marking it as a "good first task"). Do you have any idea of possible candidates that you would want someone to explore if they were to tackle this?
This doesn't occur with ocamlformat - any idea why the two would act differently?
Apparently, a well-known issue.
Please consider bumping its priority. It correctly matches the input in switches, but makes the language unusable if your alphabet differences doesn't stop at å, ä, ö.
How curious.

(the value here is correct, I'm surprised that the comment is not being turned to junk)
I did some more investigation.
I don't know much about the refmt internals, but as far as I can tell, we print out strings here: https://github.com/facebook/reason/blob/43146a22f95dd9ce10d407c6f251ed85541fcae8/src/reason-parser/reason_pprint_ast.ml#L2377
Here pp = Format.fprintf (but Printf.fprintf and sprintf etc repro also as long as "%S" is used)
I believe the problem is that the %S formatter can't handle unicode properly. Presumably it does some faulty escaping or similar? I've tested this locally and using %S seem to reproduce the problem.
According to https://caml.inria.fr/pub/docs/manual-ocaml/libref/Printf.html, this is
S: convert a string argument to OCaml syntax (double quotes, escapes).
Compare this to the line below https://github.com/facebook/reason/blob/43146a22f95dd9ce10d407c6f251ed85541fcae8/src/reason-parser/reason_pprint_ast.ml#L2378
Which does print properly (and doesn't use %S).
Hopefully someone smarter than me can advise: in this case, since we want to reproduce what the user types, is there a need for additional escaping? Or are we lucky and can we simply replace this line with the following? (Call me a wishful thinker)
| Pconst_string (i, None) -> pp f "\"%s\"" i
Edit: Tried making the change and running tests, it seems we do still need to do some escaping because the escape sequences seem to be parsed when reading in input. Not sure why this is necessary... Relevant code is here I think: https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L692-L707
https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L451-L459
https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L762-L810
@Schmavery, you'll still need to escape double quotes (as well as '\' '\n' '\t' '\r' '\b')
Is this all that’s needed: https://github.com/ocaml-ppx/ocamlformat/blob/2d973e38571f574b773809d5d02f155cf2782676/src/Fmt_ast.ml#L314
If someone fixes it quickly it will go out in the release
@jordwalke No, shoot - I was wrong. ocamlformat has the same issue with unicode strings. Sorry for the noise.
Did some investigation a while ago and as @Schmavery indicates: ocaml's Format scrambles the unicode with the way where printing.
@jordwalke in my testing, String.escaped exhibits the same problem.
Can anyone explain why we need to parse in escape sequences when lexing rather than just keeping "\n" as two characters (for example)? Obviously the lexer would need to be aware of some of these (like escaped quotes) in order to be able to detect the correct end of string, but it seems like other than that there's no reason to store the string that way.
Edit: I guess it's because we want to conform to the ocaml AST definition which would presumably expect string in this form. Nevermind. Maybe if there's a way to keep the original string around (only for formatting)... I guess ocamlformat has similar problems.
Because String.length would break otherwise.
Imagine let x = "🙈" encoded in utf8. The 🙈 is represented by multiple bytes,
4 to be exact, in utf8. I.e. String.length x === 4.
If you write the literal notation for each byte, you get:
let x = "\240\159\153\136";
If you don't escape the literals, you'll get String.length === 16.
@IwanKaramazow my mistake was only thinking about the context of reformatting, where this wouldn't matter (afaict). Obviously the string should be escaped at runtime. Thanks for explanation.
" I guess it's because we want to conform to the ocaml AST definition which would presumably expect string in this form. Nevermind. Maybe if there's a way to keep the original string around (only for formatting)... "
Yes! We can do exactly what we do for comments, which is to keep the original comments around, separate from the AST. Then when printing strings, we can look up the original file content for that location range, and pull the strings from there instead of trying to print the contents that was stored in the AST.
We have a format called reason_binary that we can --parse/--print and it includes (comments, AST). We could add a new format called reason_binary_with_strings which has comments, AST, strings). This would be the form that the printer actually uses. We would merely keep the old form reason_binary so that we can perform upgrades from previous versions of Reason.
Feel free to take a shot at that one.
I'm just surprised there's not an easier way to get from the escaped string content back to the original bytes that were in the file. Is that really the case?
See #1780 for a very basic, incomplete attempt at using uutf to for better UTF-8 encoded string printing.
@jordwalke I think the challenge in going from escaped strings back to the original content is that some assumption about or detection of the original encoding needs to be performed.
Thanks.
We could also just stuff the original bytes into a ppx attribute that is never printed couldn't we? Then when printing the string literal, you look to see if there is that secret (non-printed) attribute, and if so grab its string content.
Yeah - I think your suggestion of pulling raw bytes from the original source file is the best way to ensure string literals stay exactly as they were originally. That was the case I was thinking of when mistakenly thinking ocamlformat already solved this - if formatting is disabled for a file with [@@@ocamlformat.disabled] then the input is preserved exactly, leaving unicode and manual formatting intact.
I don't know much about mll, but is there a way to call "quoted_string" (which is how the {| |} strings are parsed), and then effectively run the string(s) function on the result:
https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L762
So that we can get both forms?
I would also suggest avoiding doing this when the --print is anything but --print re. (There's a configuration record we could use to store the mode so that the lexer can know what mode we're in).
FWIW, ocamlformat currently preserves unicode within literal {| |} strings, but not normal " " strings, since the bytes of literal strings are just passed straight through using Format.pp_print_string.
This should be fixed by https://github.com/facebook/reason/pull/1838 ! 😄
Yep
Most helpful comment
Related: https://github.com/facebook/reason/issues/1384, #1053, #799
@jordwalke you said a while ago that this is a matter of finding the right unicode printing library (while marking it as a "good first task"). Do you have any idea of possible candidates that you would want someone to explore if they were to tackle this?