std::string::String should :
&str or [u8], should be with or without UTF BOM by caller.p.s. I am not native English speaker, so may what I describe may differ from my original meaning.
Hey there. Thank you for your interest in designing Rust!
I think you can find more interest in the issue if you post it over at: https://internals.rust-lang.org/ :)
String and &str are explicitly designed to store and refer to UTF-8 data, so parsing/storing/emitting a BOM doesn't quite make sense. Its use in UTF-8 text is even explicitly discouraged by the Unicode standard itself, as it's essentially useless and it also breaks ASCII-compatibility.
Copy of my comment on https://github.com/rust-lang/rust/issues/50386#issuecomment-386035566, which was about "stripping" the BOM.
It鈥檚 not clear what is being proposed. Which APIs exactly should strip?
And more importantly, why? I tend to think of these standard library API as low-level primitives, and feel that BOM removal would tend to belong more in a higher library that might for example also support multiple encodings and detect the presence of a BOM to help pick one. And even then, maybe not always. https://docs.rs/encoding_rs/0.7.2/encoding_rs/struct.Decoder.html#impl has different methods for different use cases, only some of them remove a BOM.
@H2CO3 @SimonSapin
UTF8 with BOM should be accepted. What I want is just to let BOM go, not disturbing other components.
And it's rather difficult to determine BOM for it does not show up in println!.
String shouldn't have magic treatment for any characters. It makes sense for encoding/decoding methods to have options for handling the BOM, such as those @SimonSapin mentioned, but none of that should happen internally to String.
We may consider adding a method like
pub fn from_utf8_with_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>
or even
pub fn from_utf8_with_optional_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>
But like comments below, this shouldn't affect String internal. And this may also just live in an external crate rather than std.
Most helpful comment
String shouldn't have magic treatment for any characters. It makes sense for encoding/decoding methods to have options for handling the BOM, such as those @SimonSapin mentioned, but none of that should happen internally to String.