Julia: Feature request: a binary/byte string type `BString` in stdlib

Created on 10 Oct 2020  ·  3Comments  ·  Source: JuliaLang/julia

I would like to propose the addition of a new binary string type BString to stdlib. Like String, it would be a subtype of AbstractString, but unlike String, it can only hold sequences of Unicode characters in the range U+0000 to U+00FF. In memory, BString encodes each character as one byte (like ISO 8859-1). BString would have the same in-memory layout as String such that encoding a String value into UTF-8 and returning it as a BString byte sequence would be a no-op.

The name BString could be ambiguously interpreted as ”binary string”, ”byte string” or ”basic Latin string“, because this type has multiple functions and advantages over String:

  • It would be well suited to process arbitrary binary data (as character = byte) using all the convenient string-processing and I/O functions available for AbstractString, but without a UTF-8 decoder always running in the background.

  • It would also be well suited for processing text data where only ASCII characters are of Interest (even if the text data is UTF-8 encoded!).

A particular advantage of BString over Vector{UInt8} is that BString has the exact same in-memory representation as String and therefore conversion between String and BString would be a no-op in all situations where the algorithm does not care about non-ASCII characters.

Imagine for example you write a parser, such as for a CSV file, which only cares about ASCII metacharacters (in the case of a CSV file: commas, quotes and linefeeds). Such an algorithm does not care the least about the UTF-8 character encoding. Everything other than the metacharacters are just byte sequences, whether they are in ISO 8859, UTF-8 or EUC-JP, that get passed on as such. However, with String there is essentially a UTF-8 decoder running in the background all the time, often completely unnecessarily, if only ASCII characters are of interest. By reinterpreting a String variable as a BString variable, a programmer can essentially tell Julia: I'm not interested here in UTF-8 decoding any Unicode characters at all, either because I only look for ASCII characters, or because this is really arbitrary binary data (e.g., a JPEG header), and I merely want to use the string parsing functionality that comes with AbstractString. BString would do exactly that.

Offering to the Programmer both a UTF-8 (String) and a binary (BString) variant of the AbstractString library would essentially be doing exactly what Perl does (where each string has a built-in UTF-8 flag that says whether each element of the string sequence is a byte or a Unicode character). In fact, the dynamic Perl string type with UTF-8 flag would in Julia then be identical to Union{String,BString}. This has worked extremely well since Perl 5.8 and binary string processing (having the full String API available for binary data) is something that I very much miss in Julia.

The main reason for why BString should go into stdlib, and not into a package, is very simple: to keep the methods available for String and BString exactly aligned. I would therefore like to implement each BString method one line below the corresponding implementation of the String method (i.e., in the same file!), such that when future extensions to the String API are made, BString is updated as well. Also, I feel that providing a binary string type is an extremely basic and elementary function that should be part of the standard library.

Vector{UInt8} has a completely different memory layout and function API from String (motivated by the mutable MATLAB-like matrix type, with dimensions, etc.) and is therefore no replacement for BString. BString would natrally offer regular expressions, number formatting, substring searching, and lots of other string processing and IO functionality that Vector{UInt8} does not.

Julia has already decided to make String a quite different data type from a Vector of characters, and therefore we need a binary, non-UTF-8 version of String as well.

strings

Most helpful comment

Note that what you're proposing is just a string type that uses that the Latin-1 encoding. That could be added in a minimal external package, defining Latin1String <: AbstractString and Latin1Char <: AbstractChar. You could provide a reinterpret(::Type{Latin1String}, ::String) method that shares the same memory as the original String object, but reinterprets it as being Latin-1 encoded. You can do ASCII-oriented string processing on that, ignoring the resulting mojibake that result from incorrectly interpreting any UTF-8 data as Latin-1. The Latin1Char type would wrap a UInt8 and would be what you get from indexing into a Latin1String. This actually sounds like a pretty fun little package to write. I don't think there's any need for this to live in Base: there's no technical reason for it (works perfectly well as a package) and no social reason for it (most people don't need this).

All 3 comments

Use BinaryString from https://github.com/JuliaString/Strs.jl?

The main reason for why BString should go into stdlib, and not into a package, is very simple: to keep the methods available for String and BString exactly aligned

That's the case with almost all extensions of abstract types and the solution is that Base writes generic methods that works on the concrete type as long as you define a few core methods on them. Why wouldn't that work in this case?

However, with String there is essentially a UTF-8 decoder running in the background all the time, often completely unnecessarily, if only ASCII characters are of interest.

It's unclear what you mean by a UTF-8 decoder "running in the background". Decoding only happens if you actually access a character from a string and then only that character is decoded. Note that you can already iterate a string as bytes using codeunits, which will give you essentially what you want for ASCII strings. That said, the overhead of string iteration being able to handle UTF-8 as well is actually quite minimal:

julia> @btime collect($("a"^100));
  880.824 ns (1 allocation: 496 bytes)

julia> @btime [Char(i) for i in codeunits($("a"^100))];
  794.622 ns (1 allocation: 496 bytes)

There are some further gains to be made if you use UInt8 for representing your characters instead of Char, but then again, that's what codeunits(::String) already does.

Note that what you're proposing is just a string type that uses that the Latin-1 encoding. That could be added in a minimal external package, defining Latin1String <: AbstractString and Latin1Char <: AbstractChar. You could provide a reinterpret(::Type{Latin1String}, ::String) method that shares the same memory as the original String object, but reinterprets it as being Latin-1 encoded. You can do ASCII-oriented string processing on that, ignoring the resulting mojibake that result from incorrectly interpreting any UTF-8 data as Latin-1. The Latin1Char type would wrap a UInt8 and would be what you get from indexing into a Latin1String. This actually sounds like a pretty fun little package to write. I don't think there's any need for this to live in Base: there's no technical reason for it (works perfectly well as a package) and no social reason for it (most people don't need this).

Was this page helpful?
0 / 5 - 0 ratings