V: Better string type

Created on 9 Sep 2020  路  18Comments  路  Source: vlang/v

There are 2 problems with string:

  1. string is null terminated like in C, so slicing has to allocate new memory for the string (see #5480). This can be very bad for performance.
  2. It acts just like an array of bytes - it has length, can be indexed and sliced - even though it holds UTF-8 encoded data. This can often lead to code accidentally not handling multi-byte UTF-8 code points (see #6320).

IMO ideally a UTF-8 string type would look like this:

struct String {
pub mut:
  bytes []byte
}
mut str := 'cze艣膰'
str << ', hi' // can append, which can use spare capacity in `str.bytes.cap`
// decode str into runes as needed
for r in str {
  // typeof(r) is rune
}

There is no len field - the number of code points is not known. There is no direct indexing or slicing, the positions of code points are not known. If you don't need decoding and just want to read the bytes, use the str.bytes array. This means the default behaviour is correct, and efficient operations on bytes is still quite easy.

Feature Request

All 18 comments

Slicing will always copy data, like in Java. Keeping a 1 MB string in memory because there's a 5 byte slice re-using the data is a lot worse than copying the bytes.

It also makes ownership and autofree much easier.

[]rune will be used for UTF8 codepoints, like in Go.

Having immutable strings like in Java and Go is a huge benefit for performance and concurrency, that's not going to change.

Slicing will always copy data, like in Java. Keeping a 1 MB string in memory because there's a 5 byte slice re-using the data is a lot worse than copying the bytes.

It looks like Java introduced the copying only recently.

Yes, immutable Strings are very good. For example, if they are used as keys in a map, it is essential. But @ntrel does not question this.

The core problem is right now, that a UTF-8 presentation (byte array) does not allow direct access to a rune at an arbitrary position, because a rune can consist of multiple bytes. Iterating from the first byte is the only straightforward choice. Of course, iterating backward also works, but getting the byte-index from a rune-index IMHO always requires iterating from the begin or end.

What about different internal representations (like recently in Java)? Because a string is immutable, we know in advance whether it contains only runes that fit into 1 byte or 2 bytes. So internally there could be 3 implementations, US-ASCII (1 byte), UTF-16 (2 bytes) or UTF-32 (4 bytes). Then accessing any arbitrary position would be possible.

The core problem is right now, that a UTF-8 presentation (byte array) does not allow direct access to a rune at an arbitrary position

It'll work like in Go

for rune in string {

and

[]rune(s)[i]

It looks like Java introduced the copying only recently.

8 years ago

http://java-performance.info/changes-to-string-java-1-7-0_06/

How time flies! It just feels like 1 or 2 years ago.

You often refer to Go - what are the problems with Go that you try to avoid/workaround with V?

Keeping a 1 MB string in memory because there's a 5 byte slice re-using the data is a lot worse than copying the bytes.

That choice should be up to the programmer. Allocating is expensive in many cases.

Array slicing in V does not allocate, this is inconsistent. And -autofree has to support this for arrays, so it could for strings too.

[]rune will be used for UTF8 codepoints, like in Go.

That's not UTF-8, that's UTF-32.

immutable strings like in Java and Go is a huge benefit for performance and concurrency,

That's why I used pub mut not __global. The byte contents don't change. The byte length can, if it's a mut string variable.

No, it's UTF8, not UTF32.

If I understand it correctly, UTF-8 is an encoding to encode characters to byte sequences where each character can be represented by 1 or more bytes. Are you saying that []rune is a byte array? What then is the difference to []byte? Or is rune a 32 bit uint and hence []rune an array of 32 bit uints? Then IMHO UTF-32 would be the right name for it.

rune is an alias of int and byte of u8. rune has that limit to store unicode characters. []rune is a slice of of these unicode characters. You can further read about this here https://stackoverflow.com/questions/19310700/what-is-a-rune

Also I prefer having a separate rune type for denoting Unicode characters. It makes the distinction from byte plus increases performance as @medvednikov said.

So runes can't be UTF-8, right?

@vmcrash they are

That choice should be up to the programmer. Allocating is expensive in many cases.

I agree. V will allow both approaches.

Array slicing in V does not allocate, this is inconsistent.

Yes, this will be changed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vtereshkov picture vtereshkov  路  3Comments

markgraydev picture markgraydev  路  3Comments

penguindark picture penguindark  路  3Comments

taojy123 picture taojy123  路  3Comments

clpo13 picture clpo13  路  3Comments