I would like to propose the addition of a new binary string type BString
to stdlib. Like String
, it would be a subtype of AbstractString
, but unlike String
, it can only hold sequences of Unicode characters in the range U+0000 to U+00FF. In memory, BString
encodes each character as one byte (like ISO 8859-1). BString would have the same in-memory layout as String
such that encoding a String
value into UTF-8 and returning it as a BString
byte sequence would be a no-op.
The name BString
could be ambiguously interpreted as ”binary string”, ”byte string” or ”basic Latin string“, because this type has multiple functions and advantages over String
:
It would be well suited to process arbitrary binary data (as character = byte) using all the convenient string-processing and I/O functions available for AbstractString
, but without a UTF-8 decoder always running in the background.
It would also be well suited for processing text data where only ASCII characters are of Interest (even if the text data is UTF-8 encoded!).
A particular advantage of BString
over Vector{UInt8}
is that BString
has the exact same in-memory representation as String
and therefore conversion between String
and BString
would be a no-op in all situations where the algorithm does not care about non-ASCII characters.
Imagine for example you write a parser, such as for a CSV file, which only cares about ASCII metacharacters (in the case of a CSV file: commas, quotes and linefeeds). Such an algorithm does not care the least about the UTF-8 character encoding. Everything other than the metacharacters are just byte sequences, whether they are in ISO 8859, UTF-8 or EUC-JP, that get passed on as such. However, with String
there is essentially a UTF-8 decoder running in the background all the time, often completely unnecessarily, if only ASCII characters are of interest. By reinterpreting a String
variable as a BString
variable, a programmer can essentially tell Julia: I'm not interested here in UTF-8 decoding any Unicode characters at all, either because I only look for ASCII characters, or because this is really arbitrary binary data (e.g., a JPEG header), and I merely want to use the string parsing functionality that comes with AbstractString
. BString
would do exactly that.
Offering to the Programmer both a UTF-8 (String
) and a binary (BString
) variant of the AbstractString
library would essentially be doing exactly what Perl does (where each string has a built-in UTF-8 flag that says whether each element of the string sequence is a byte or a Unicode character). In fact, the dynamic Perl string type with UTF-8 flag would in Julia then be identical to Union{String,BString}
. This has worked extremely well since Perl 5.8 and binary string processing (having the full String
API available for binary data) is something that I very much miss in Julia.
The main reason for why BString
should go into stdlib, and not into a package, is very simple: to keep the methods available for String
and BString
exactly aligned. I would therefore like to implement each BString
method one line below the corresponding implementation of the String
method (i.e., in the same file!), such that when future extensions to the String
API are made, BString
is updated as well. Also, I feel that providing a binary string type is an extremely basic and elementary function that should be part of the standard library.
Vector{UInt8}
has a completely different memory layout and function API from String
(motivated by the mutable MATLAB-like matrix type, with dimensions, etc.) and is therefore no replacement for BString
. BString
would natrally offer regular expressions, number formatting, substring searching, and lots of other string processing and IO functionality that Vector{UInt8}
does not.
Julia has already decided to make String
a quite different data type from a Vector
of characters, and therefore we need a binary, non-UTF-8 version of String
as well.
Use BinaryString
from https://github.com/JuliaString/Strs.jl?
The main reason for why BString should go into stdlib, and not into a package, is very simple: to keep the methods available for String and BString exactly aligned
That's the case with almost all extensions of abstract types and the solution is that Base writes generic methods that works on the concrete type as long as you define a few core methods on them. Why wouldn't that work in this case?
However, with String there is essentially a UTF-8 decoder running in the background all the time, often completely unnecessarily, if only ASCII characters are of interest.
It's unclear what you mean by a UTF-8 decoder "running in the background". Decoding only happens if you actually access a character from a string and then only that character is decoded. Note that you can already iterate a string as bytes using codeunits
, which will give you essentially what you want for ASCII strings. That said, the overhead of string iteration being able to handle UTF-8 as well is actually quite minimal:
julia> @btime collect($("a"^100));
880.824 ns (1 allocation: 496 bytes)
julia> @btime [Char(i) for i in codeunits($("a"^100))];
794.622 ns (1 allocation: 496 bytes)
There are some further gains to be made if you use UInt8
for representing your characters instead of Char
, but then again, that's what codeunits(::String)
already does.
Note that what you're proposing is just a string type that uses that the Latin-1 encoding. That could be added in a minimal external package, defining Latin1String <: AbstractString
and Latin1Char <: AbstractChar
. You could provide a reinterpret(::Type{Latin1String}, ::String)
method that shares the same memory as the original String
object, but reinterprets it as being Latin-1 encoded. You can do ASCII-oriented string processing on that, ignoring the resulting mojibake that result from incorrectly interpreting any UTF-8 data as Latin-1. The Latin1Char
type would wrap a UInt8
and would be what you get from indexing into a Latin1String
. This actually sounds like a pretty fun little package to write. I don't think there's any need for this to live in Base: there's no technical reason for it (works perfectly well as a package) and no social reason for it (most people don't need this).
Most helpful comment
Note that what you're proposing is just a string type that uses that the Latin-1 encoding. That could be added in a minimal external package, defining
Latin1String <: AbstractString
andLatin1Char <: AbstractChar
. You could provide areinterpret(::Type{Latin1String}, ::String)
method that shares the same memory as the originalString
object, but reinterprets it as being Latin-1 encoded. You can do ASCII-oriented string processing on that, ignoring the resulting mojibake that result from incorrectly interpreting any UTF-8 data as Latin-1. TheLatin1Char
type would wrap aUInt8
and would be what you get from indexing into aLatin1String
. This actually sounds like a pretty fun little package to write. I don't think there's any need for this to live in Base: there's no technical reason for it (works perfectly well as a package) and no social reason for it (most people don't need this).