Reason: [refmt] Do not convert unicode strings

Created on 24 Jan 2018 · 21Comments · Source: reasonml/reason

refmt automatically turns strings like "└── " into this:

Making it practically useless.

This is from esy, where we pinned to 28a8922. (but the issue persists on the latest commit)

Source

rauanmayemir

👍3

Most helpful comment

Related: https://github.com/facebook/reason/issues/1384, #1053, #799

@jordwalke you said a while ago that this is a matter of finding the right unicode printing library (while marking it as a "good first task"). Do you have any idea of possible candidates that you would want someone to explore if they were to tackle this?

Schmavery on 24 Jan 2018

👍3

All 21 comments

Related: https://github.com/facebook/reason/issues/1384, #1053, #799

Schmavery on 24 Jan 2018

👍3

This doesn't occur with ocamlformat - any idea why the two would act differently?

hcarty on 24 Jan 2018

Apparently, a well-known issue.

Please consider bumping its priority. It correctly matches the input in switches, but makes the language unusable if your alphabet differences doesn't stop at å, ä, ö.

rauanmayemir on 24 Jan 2018

How curious.

(the value here is correct, I'm surprised that the comment is not being turned to junk)

rauanmayemir on 24 Jan 2018

I did some more investigation.
I don't know much about the refmt internals, but as far as I can tell, we print out strings here: https://github.com/facebook/reason/blob/43146a22f95dd9ce10d407c6f251ed85541fcae8/src/reason-parser/reason_pprint_ast.ml#L2377

Here pp = Format.fprintf (but Printf.fprintf and sprintf etc repro also as long as "%S" is used)

I believe the problem is that the %S formatter can't handle unicode properly. Presumably it does some faulty escaping or similar? I've tested this locally and using %S seem to reproduce the problem.

According to https://caml.inria.fr/pub/docs/manual-ocaml/libref/Printf.html, this is
S: convert a string argument to OCaml syntax (double quotes, escapes).

Compare this to the line below https://github.com/facebook/reason/blob/43146a22f95dd9ce10d407c6f251ed85541fcae8/src/reason-parser/reason_pprint_ast.ml#L2378
Which does print properly (and doesn't use %S).

Hopefully someone smarter than me can advise: in this case, since we want to reproduce what the user types, is there a need for additional escaping? Or are we lucky and can we simply replace this line with the following? (Call me a wishful thinker)

| Pconst_string (i, None) -> pp f "\"%s\"" i

Edit: Tried making the change and running tests, it seems we do still need to do some escaping because the escape sequences seem to be parsed when reading in input. Not sure why this is necessary... Relevant code is here I think: https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L692-L707
https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L451-L459
https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L762-L810

Schmavery on 24 Jan 2018

@Schmavery, you'll still need to escape double quotes (as well as '\' '\n' '\t' '\r' '\b')

hhugo on 24 Jan 2018

Is this all that’s needed: https://github.com/ocaml-ppx/ocamlformat/blob/2d973e38571f574b773809d5d02f155cf2782676/src/Fmt_ast.ml#L314

If someone fixes it quickly it will go out in the release

jordwalke on 24 Jan 2018

@jordwalke No, shoot - I was wrong. ocamlformat has the same issue with unicode strings. Sorry for the noise.

hcarty on 24 Jan 2018

Did some investigation a while ago and as @Schmavery indicates: ocaml's Format scrambles the unicode with the way where printing.

IwanKaramazow on 24 Jan 2018

@jordwalke in my testing, String.escaped exhibits the same problem.
Can anyone explain why we need to parse in escape sequences when lexing rather than just keeping "\n" as two characters (for example)? Obviously the lexer would need to be aware of some of these (like escaped quotes) in order to be able to detect the correct end of string, but it seems like other than that there's no reason to store the string that way.

Edit: I guess it's because we want to conform to the ocaml AST definition which would presumably expect string in this form. Nevermind. Maybe if there's a way to keep the original string around (only for formatting)... I guess ocamlformat has similar problems.

Schmavery on 24 Jan 2018

Because String.length would break otherwise.
Imagine let x = "🙈" encoded in utf8. The 🙈 is represented by multiple bytes,
4 to be exact, in utf8. I.e. String.length x === 4.
If you write the literal notation for each byte, you get:
let x = "\240\159\153\136";
If you don't escape the literals, you'll get String.length === 16.

IwanKaramazow on 24 Jan 2018

@IwanKaramazow my mistake was only thinking about the context of reformatting, where this wouldn't matter (afaict). Obviously the string should be escaped at runtime. Thanks for explanation.

Schmavery on 24 Jan 2018

" I guess it's because we want to conform to the ocaml AST definition which would presumably expect string in this form. Nevermind. Maybe if there's a way to keep the original string around (only for formatting)... "

Yes! We can do exactly what we do for comments, which is to keep the original comments around, separate from the AST. Then when printing strings, we can look up the original file content for that location range, and pull the strings from there instead of trying to print the contents that was stored in the AST.

We have a format called reason_binary that we can --parse/--print and it includes (comments, AST). We could add a new format called reason_binary_with_strings which has comments, AST, strings). This would be the form that the printer actually uses. We would merely keep the old form reason_binary so that we can perform upgrades from previous versions of Reason.
Feel free to take a shot at that one.

I'm just surprised there's not an easier way to get from the escaped string content back to the original bytes that were in the file. Is that really the case?

jordwalke on 24 Jan 2018

👍1

See #1780 for a very basic, incomplete attempt at using uutf to for better UTF-8 encoded string printing.

hcarty on 24 Jan 2018

@jordwalke I think the challenge in going from escaped strings back to the original content is that some assumption about or detection of the original encoding needs to be performed.

hcarty on 24 Jan 2018

Thanks.

We could also just stuff the original bytes into a ppx attribute that is never printed couldn't we? Then when printing the string literal, you look to see if there is that secret (non-printed) attribute, and if so grab its string content.

jordwalke on 24 Jan 2018

Yeah - I think your suggestion of pulling raw bytes from the original source file is the best way to ensure string literals stay exactly as they were originally. That was the case I was thinking of when mistakenly thinking ocamlformat already solved this - if formatting is disabled for a file with [@@@ocamlformat.disabled] then the input is preserved exactly, leaving unicode and manual formatting intact.

hcarty on 25 Jan 2018

I don't know much about mll, but is there a way to call "quoted_string" (which is how the {| |} strings are parsed), and then effectively run the string(s) function on the result:

https://github.com/facebook/reason/blob/master/src/reason-parser/reason_lexer.mll#L762

So that we can get both forms?

I would also suggest avoiding doing this when the --print is anything but --print re. (There's a configuration record we could use to store the mode so that the lexer can know what mode we're in).

jordwalke on 25 Jan 2018

👍1

FWIW, ocamlformat currently preserves unicode within literal {| |} strings, but not normal " " strings, since the bytes of literal strings are just passed straight through using Format.pp_print_string.