Sdk: Can generate the file without BOM?

Created on 5 May 2018  路  14Comments  路  Source: dotnet/sdk

Steps to reproduce

dotnet new xxx

Expected behavior

Generate the all file without BOM

Actual behavior

Generate the all file with BOM

Environment data

dotnet --info output:

Product Information:
Version: 2.1.104
Commit SHA-1 hash: 48ec687460

Runtime Environment:
OS Name: Windows
OS Version: 10.0.16299
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\2.1.104\

Microsoft .NET Core Shared Framework Host

Version : 2.0.6
Build : 74b1c703813c8910df5b96f304b0f2b78cdf194d

Most helpful comment

FWIW at least for C# to quote from the ECMA-334 5th Edition (https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-334.pdf):

  1. Conformance (PDF Page 25):
A conforming implementation of C# shall interpret characters in conformance with
the Unicode Standard. Conforming implementations shall accept Unicode source
files encoded with the UTF-8 encoding form.

7.1 Programs (PDF Page 35):

Conforming implementations shall accept Unicode source files encoded with the
UTF-8 encoding form (as defined by the Unicode standard), and transform them
into a sequence of Unicode characters. Implementations can choose to accept and
transform additional character encoding schemes (such as UTF-16, UTF-32, or
non-Unicode character mappings).

Nothing in here says that it has to contain the BOM, so if you are looking for the end all be all it will not be found in the standard...

That being said every Visual Studio version we have ever used the templates have always contained the BOM. We have commit hooks that enforce it for us internally due to some of the issues @sharwell as mentioned. For us there was a portion of code that contained some exotic characters required by a third party library that was garbled by text editors not properly respecting the fact that the file was indeed UTF-8. As he says having the BOM avoids more issues than it causes. YMMV.

All 14 comments

I believe this has been fixed in 2.1.300. @peterhuene, I remember you looked into something similar in the past. Can you confirm?

An issue was fixed regarding modifying solution files with dotnet sln with dotnet/cli#8199 for 2.1.300.

@seanmars is there a particular generated file you're expecting to see the BOM? That is to say, what is the exact command you're running?

I think the .cs, .json, .css ... files not with BOM, only the sln file need BOM, right?
But now, if use dotnet new mvc(or webapi), all the file will generated with BOM.

Imgur
Imgur
Imgur

I do see UTF-8 BOMs with a lot of source files in both https://github.com/aspnet/templating and https://github.com/dotnet/templating.

I found these related issues:
https://github.com/aspnet/templating/issues/500
https://github.com/dotnet/templating/pull/477

Originally it seems that this was done to force Visual Studio to treat the files as UTF-8, but that might have been before charset in .editorconfig was respected. Thus, it might be worth raising the issue again with both of the above repos to see if the time has come to remove the BOMs from the templates.

Is there any news?

I am running into this:

C:\testapps\threeapp>type Program.cs
鈭┾晽鈹恥sing System;

namespace threeapp
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
        }
    }
}

C:\testapps\threeapp>dotnet --version
3.0.100-preview4-010345

Do we just need someone to go through the templates and remove the BOMs?

@vijayrkn @mlorbetske to comment. These templates live in dotnet/templating.

鉂楋笍 Source files need to be generated _with BOM_. Otherwise, certain editors will treat them in non-uniform manner and eventually someone will accidentally save the file with question marks (encoding error fallback character). Normally I see this in author names in files getting messed up, but recently we found a curly quote in dotnet/winforms which was incorrectly saved. These errors are easy to miss and (in many cases) hard to fix, so we create the file with BOM to avoid it altogether.

We have a secondary benefit that the BOM triggers an early exit in the automatic encoding detection algorithm in .NET, so editors like Visual Studio load files faster. It's a small win and not really significant compared to the problem above, but I find it interesting. 馃槃

Given how contentious this has always been, I'd prefer to leave things as they are unless there's a compelling reason to change the content to exclude the BOM. It is worth noting that the presence or absence of the BOM is determined by the source content, so while say CS files have a BOM, it's not required that JS or RB files do - this is a choice that a template author can make to best suit their audience. With the comment by @sharwell, it seems like the most prudent thing to do for tools built on .NET consuming these content files is to leave the BOM in the content - is this agreeable?

If we want to have a longer discussion on this, I can move the issue to the dotnet/templating repo.

That sounds reasonable to me.

I think there is no compelling reason to include or exclude BOM both. And i'm agree what the @sharwell say. But I believe that other files(like js, css...) also encounter the same problem. Why are they not using the BOM? It's really interesting! 馃お

@seanmars Some file formats do not allow the BOM (e.g. JSON), and others have a history of not supporting it well (e.g. many Java-based tools, since BOM handling there is manual and often overlooked). C# templates have long contained the BOM so tooling surrounding these files is well-equipped to handle it correctly.

FWIW at least for C# to quote from the ECMA-334 5th Edition (https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-334.pdf):

  1. Conformance (PDF Page 25):
A conforming implementation of C# shall interpret characters in conformance with
the Unicode Standard. Conforming implementations shall accept Unicode source
files encoded with the UTF-8 encoding form.

7.1 Programs (PDF Page 35):

Conforming implementations shall accept Unicode source files encoded with the
UTF-8 encoding form (as defined by the Unicode standard), and transform them
into a sequence of Unicode characters. Implementations can choose to accept and
transform additional character encoding schemes (such as UTF-16, UTF-32, or
non-Unicode character mappings).

Nothing in here says that it has to contain the BOM, so if you are looking for the end all be all it will not be found in the standard...

That being said every Visual Studio version we have ever used the templates have always contained the BOM. We have commit hooks that enforce it for us internally due to some of the issues @sharwell as mentioned. For us there was a portion of code that contained some exotic characters required by a third party library that was garbled by text editors not properly respecting the fact that the file was indeed UTF-8. As he says having the BOM avoids more issues than it causes. YMMV.

Was this page helpful?
0 / 5 - 0 ratings