Rfcs: RFC: std::string::String could provide options about UTF BOM

Created on 3 May 2018  路  6Comments  路  Source: rust-lang/rfcs

std::string::String should :

  • correctly accept byte stream with or without UTF BOM,
  • when converts to &str or [u8], should be with or without UTF BOM by caller.

p.s. I am not native English speaker, so may what I describe may differ from my original meaning.

T-libs

Most helpful comment

String shouldn't have magic treatment for any characters. It makes sense for encoding/decoding methods to have options for handling the BOM, such as those @SimonSapin mentioned, but none of that should happen internally to String.

All 6 comments

Hey there. Thank you for your interest in designing Rust!

I think you can find more interest in the issue if you post it over at: https://internals.rust-lang.org/ :)

String and &str are explicitly designed to store and refer to UTF-8 data, so parsing/storing/emitting a BOM doesn't quite make sense. Its use in UTF-8 text is even explicitly discouraged by the Unicode standard itself, as it's essentially useless and it also breaks ASCII-compatibility.

Copy of my comment on https://github.com/rust-lang/rust/issues/50386#issuecomment-386035566, which was about "stripping" the BOM.

It鈥檚 not clear what is being proposed. Which APIs exactly should strip?

And more importantly, why? I tend to think of these standard library API as low-level primitives, and feel that BOM removal would tend to belong more in a higher library that might for example also support multiple encodings and detect the presence of a BOM to help pick one. And even then, maybe not always. https://docs.rs/encoding_rs/0.7.2/encoding_rs/struct.Decoder.html#impl has different methods for different use cases, only some of them remove a BOM.

@H2CO3 @SimonSapin
UTF8 with BOM should be accepted. What I want is just to let BOM go, not disturbing other components.
And it's rather difficult to determine BOM for it does not show up in println!.

String shouldn't have magic treatment for any characters. It makes sense for encoding/decoding methods to have options for handling the BOM, such as those @SimonSapin mentioned, but none of that should happen internally to String.

We may consider adding a method like

pub fn from_utf8_with_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>

or even

pub fn from_utf8_with_optional_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>

But like comments below, this shouldn't affect String internal. And this may also just live in an external crate rather than std.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

burdges picture burdges  路  3Comments

onelson picture onelson  路  3Comments

mahkoh picture mahkoh  路  3Comments

rust-highfive picture rust-highfive  路  4Comments

mqudsi picture mqudsi  路  3Comments