Version Used:
Version 16.7.0 Preview 2.0 [30112.204.master]
Steps to Reproduce:
```
csc a.cs
````
where a.cs is a file that contains bytes that are not valid UTF-8.
The compiler attempts to use UTF-8 encoding, fails, and falls back to the default OS ANSI code page.
https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/EncodedStringText.cs#L24-L52
Proposal
One option is to set CodePage to UTF-8 by default when the project targets net5.
@ryzngard @clairernovotny FYI. For our deterministic rebuild scenario we could capture the OS code page in the PDB. However, that doesn't completely solve the problem since the code page might not be available on the machine when rebuilding. In that case (if this was actually a real problem) we could in theory have a service (database) that provides all existing code pages and download it from there.
Another problem with capturing the OS code page is that by doing so we would make the PDB non-deterministic even in the case when all source files are UTF8 encoded but /codepage is not specified.
@jaredpar @agocke @gafter FYI
Another problem with capturing the OS code page is that by doing so we would make the PDB non-deterministic even in the case when all source files are UTF8 encoded but /codepage is not specified.
We should only capture fallback encoding if the compiler actually used it and thus the compilation is already non-deterministic.
One option is to set CodePage to UTF-8 by default when the project targets net5.
Slight change: net5 or higher. Essentially this should be the new default going forward.
We should only capture fallback encoding if the compiler actually used it and thus the compilation is already non-deterministic.
Curious: why do we feel this is more severe than say the non-determinism that comes from floating point constant folding? Both are host specific forms of non-determinism and it's possible, to some degree, to control both of them. For instance we could force a conv.r4 at every layer of floating point folding to make the value fairly deterministic.
Curious: why do we feel this is more severe than say the non-determinism that comes from floating point constant folding?
Isn't the floating point deterministic as long as we use the same version of the runtime? Or is there a dependency on the OS/CPU?
Adding the OS encoding would unnecessarily make the build dependent on OS configuration, where it may not today (say all files are UTF8 encoded and the OS encoding is never used by the compiler).
I'm a little confused by the problem described in the issue description. We failed to use a UTF-8 encoding, how can we reasonably fall back to UTF-8? I am not very familiar with CodePages so apologies if I'm missing something obvious.
We already do force a conv.r4 at every step of constant folding. I believe that is required by the C# language specification. Perhaps you're thinking of decimal, where we rely on the runtime library to do the math for us, and the library is different in different versions. https://github.com/dotnet/runtime/issues/1611
@RikkiGibson we are falling back to the OS ANSI code page (unless /codepage is specified explicitly).
@gafter So the decimal operations would only depend on the version of corlib (that the compiler uses), correct?
@RikkiGibson
Our encoding story is complex and strongly influenced by how the native compiler handled encoding. Essentially though in the absence of an explicit encoding we do the following:
In the presence of an explicit /codePage argument the compiler will consider that code page only and no others.
What @tmat is asking for here is essentially that when no /codePage is specified and we're at net5 or higher then pretend like UTF-8 was explicitly specified. This would be done at our MSBuild layer where we make other TF based decisions.
@tmat Yes
Most helpful comment
@RikkiGibson
Our encoding story is complex and strongly influenced by how the native compiler handled encoding. Essentially though in the absence of an explicit encoding we do the following:
In the presence of an explicit
/codePageargument the compiler will consider that code page only and no others.What @tmat is asking for here is essentially that when no
/codePageis specified and we're atnet5or higher then pretend like UTF-8 was explicitly specified. This would be done at our MSBuild layer where we make other TF based decisions.