Rfcs: RFC: std::string::String could provide options about UTF BOM

Created on 3 May 2018 · 6Comments · Source: rust-lang/rfcs

std::string::String should :

correctly accept byte stream with or without UTF BOM,
when converts to &str or [u8], should be with or without UTF BOM by caller.

p.s. I am not native English speaker, so may what I describe may differ from my original meaning.

T-libs

Source

hanyuwei70

👎7

Most helpful comment

String shouldn't have magic treatment for any characters. It makes sense for encoding/decoding methods to have options for handling the BOM, such as those @SimonSapin mentioned, but none of that should happen internally to String.

joshtriplett on 4 May 2018

👍8

All 6 comments

Hey there. Thank you for your interest in designing Rust!

I think you can find more interest in the issue if you post it over at: https://internals.rust-lang.org/ :)

Centril on 3 May 2018

String and &str are explicitly designed to store and refer to UTF-8 data, so parsing/storing/emitting a BOM doesn't quite make sense. Its use in UTF-8 text is even explicitly discouraged by the Unicode standard itself, as it's essentially useless and it also breaks ASCII-compatibility.

H2CO3 on 3 May 2018

👍4

Copy of my comment on https://github.com/rust-lang/rust/issues/50386#issuecomment-386035566, which was about "stripping" the BOM.

It’s not clear what is being proposed. Which APIs exactly should strip?

And more importantly, why? I tend to think of these standard library API as low-level primitives, and feel that BOM removal would tend to belong more in a higher library that might for example also support multiple encodings and detect the presence of a BOM to help pick one. And even then, maybe not always. https://docs.rs/encoding_rs/0.7.2/encoding_rs/struct.Decoder.html#impl has different methods for different use cases, only some of them remove a BOM.

SimonSapin on 4 May 2018

👍6

@H2CO3 @SimonSapin
UTF8 with BOM should be accepted. What I want is just to let BOM go, not disturbing other components.
And it's rather difficult to determine BOM for it does not show up in println!.

hanyuwei70 on 4 May 2018

joshtriplett on 4 May 2018

👍8

We may consider adding a method like

pub fn from_utf8_with_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>

or even

pub fn from_utf8_with_optional_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>

But like comments below, this shouldn't affect String internal. And this may also just live in an external crate rather than std.

WiSaGaN on 25 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow slicing to fixed-length arrays

torkleyy · 3Comments

Expose method to open link in a web browser

clarfonthey · 3Comments

Pod & Packed traits

mahkoh · 3Comments

Idea: "compact" enums and structs

rust-highfive · 4Comments

Documentation: function/method parameter lists

onelson · 3Comments