V: Better string type

Created on 9 Sep 2020 · 18Comments · Source: vlang/v

There are 2 problems with string:

string is null terminated like in C, so slicing has to allocate new memory for the string (see #5480). This can be very bad for performance.
It acts just like an array of bytes - it has length, can be indexed and sliced - even though it holds UTF-8 encoded data. This can often lead to code accidentally not handling multi-byte UTF-8 code points (see #6320).

IMO ideally a UTF-8 string type would look like this:

struct String {
pub mut:
  bytes []byte
}
mut str := 'cześć'
str << ', hi' // can append, which can use spare capacity in `str.bytes.cap`
// decode str into runes as needed
for r in str {
  // typeof(r) is rune
}

There is no len field - the number of code points is not known. There is no direct indexing or slicing, the positions of code points are not known. If you don't need decoding and just want to read the bytes, use the str.bytes array. This means the default behaviour is correct, and efficient operations on bytes is still quite easy.

Feature Request

Source

ntrel

👍3

All 18 comments

Slicing will always copy data, like in Java. Keeping a 1 MB string in memory because there's a 5 byte slice re-using the data is a lot worse than copying the bytes.

medvednikov on 9 Sep 2020

It also makes ownership and autofree much easier.

medvednikov on 9 Sep 2020

[]rune will be used for UTF8 codepoints, like in Go.

medvednikov on 9 Sep 2020

Having immutable strings like in Java and Go is a huge benefit for performance and concurrency, that's not going to change.

medvednikov on 9 Sep 2020

Slicing will always copy data, like in Java. Keeping a 1 MB string in memory because there's a 5 byte slice re-using the data is a lot worse than copying the bytes.

It looks like Java introduced the copying only recently.

Yes, immutable Strings are very good. For example, if they are used as keys in a map, it is essential. But @ntrel does not question this.

vmcrash on 9 Sep 2020

👍1

The core problem is right now, that a UTF-8 presentation (byte array) does not allow direct access to a rune at an arbitrary position, because a rune can consist of multiple bytes. Iterating from the first byte is the only straightforward choice. Of course, iterating backward also works, but getting the byte-index from a rune-index IMHO always requires iterating from the begin or end.

What about different internal representations (like recently in Java)? Because a string is immutable, we know in advance whether it contains only runes that fit into 1 byte or 2 bytes. So internally there could be 3 implementations, US-ASCII (1 byte), UTF-16 (2 bytes) or UTF-32 (4 bytes). Then accessing any arbitrary position would be possible.

vmcrash on 9 Sep 2020

The core problem is right now, that a UTF-8 presentation (byte array) does not allow direct access to a rune at an arbitrary position

It'll work like in Go

for rune in string {

and

[]rune(s)[i]

medvednikov on 9 Sep 2020

It looks like Java introduced the copying only recently.

8 years ago

http://java-performance.info/changes-to-string-java-1-7-0_06/

medvednikov on 9 Sep 2020

How time flies! It just feels like 1 or 2 years ago.

You often refer to Go - what are the problems with Go that you try to avoid/workaround with V?

tmssngr on 10 Sep 2020

Keeping a 1 MB string in memory because there's a 5 byte slice re-using the data is a lot worse than copying the bytes.

That choice should be up to the programmer. Allocating is expensive in many cases.

Array slicing in V does not allocate, this is inconsistent. And -autofree has to support this for arrays, so it could for strings too.

ntrel on 10 Sep 2020

👍1

[]rune will be used for UTF8 codepoints, like in Go.

That's not UTF-8, that's UTF-32.

immutable strings like in Java and Go is a huge benefit for performance and concurrency,

That's why I used pub mut not __global. The byte contents don't change. The byte length can, if it's a mut string variable.

ntrel on 10 Sep 2020

No, it's UTF8, not UTF32.

medvednikov on 10 Sep 2020

If I understand it correctly, UTF-8 is an encoding to encode characters to byte sequences where each character can be represented by 1 or more bytes. Are you saying that []rune is a byte array? What then is the difference to []byte? Or is rune a 32 bit uint and hence []rune an array of 32 bit uints? Then IMHO UTF-32 would be the right name for it.

vmcrash on 11 Sep 2020

👍1

rune is an alias of int and byte of u8. rune has that limit to store unicode characters. []rune is a slice of of these unicode characters. You can further read about this here https://stackoverflow.com/questions/19310700/what-is-a-rune

Delta456 on 11 Sep 2020

Also I prefer having a separate rune type for denoting Unicode characters. It makes the distinction from byte plus increases performance as @medvednikov said.

Delta456 on 11 Sep 2020

So runes can't be UTF-8, right?

vmcrash on 11 Sep 2020

@vmcrash they are

Delta456 on 11 Sep 2020

That choice should be up to the programmer. Allocating is expensive in many cases.

I agree. V will allow both approaches.

Array slicing in V does not allocate, this is inconsistent.

Yes, this will be changed.

medvednikov on 12 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Dont use vsprintf/sprintf

radare · 3Comments

Add -flto on production builds

radare · 3Comments

Can't change a mutable structure field

ArcDrake · 3Comments

module name déjà vu

PavelVozenilek · 3Comments

Cross-platform shell scripts in V / Support for shebang?

aurora · 3Comments