Years ago I filed #6714, hoping that it would one day be possible to construct a string efficiently without wasted allocations.
I hereby propose that we stop waiting and just do it explicitly with a new type in a new internal package.
package builder // import "internal/builder"
// String is like a bytes.Buffer but never allows its internal
// byte slice to escape. The zero value is usable.
type String struct {
// ...
}
func (s *String) Len() int { ... }
func (s *String) Grow(n int) { ... }
func (s *String) WriteByte(b byte) { ... }
func (s *String) WriteString(s string) { ... }
func (s *String) Write(b []byte) { ... }
func (s *String) String() string { ... }
The (*String).String
method would use unsafe
to make a string header out of an internal []byte
slice header.
The Write
methods could even recycle old too-small []byte
backing arrays as they grow.
There would be no Reset
or Bytes
or Truncate
or Read
methods. Nothing that could mutate the []byte
once it was unsafely converted to a string
.
The implementation would have to take care to use a new []byte
backing array on any resize after an unsafe string
was constructed.
This package doesn't sound too magical to require compiler/runtime helps
and it will be generally useful.
Perhaps add it to the strings package instead?
one possible implementation:
// invariant: with all grow methods, we must make sure len(bytes) <
cap(bytes)
// When String is called, we freeze the byte slice by reslicing it so that
len == cap
// and then just return the String itself as a string header using unsafe
(this relying
// on the fact that the first two fields in a slice header is the same as a
string header..
type String struct {
bytes []byte
}
We can even make compiler do the transform for us (like Java's
StringBuilder).
e.g.
transform all variable string concatenations like:
s := fmt.Sprintf("A%d", n) + "ABC"
into:
var sb builder.String
fmt.Fprintf(&sb, "A%d", n)
sb.WriteString("ABC")
s := sb.String()
+1 on this as I'm actually using something quite similar already.
I realized that we don't need to maintain any invariant: the underlying
[]byte is append-only, so we can convert that to a string at any time
without worrying whether the later updates could modify the string's
underlying buffer.
bytes.Buffer already does no allocations up to 64 bytes, but then to make a string of course there is one allocation.
It could be smarter.
I would prefer we find a way not to add more ways to create strings. There are several already.
I would love a publicly available API for this rather than an internal package
Java has had this same problem since its start: There is no way for any caller outside the java.lang
package to create a String
that doesn't copy the backing character array. As noted here, even Go's bytes.Buffer
makes a copy before exposing the accumulated string, in fear that subsequent operations against the bytes.Buffer
would change the would-be-shared backing array.
It would nice to have a method on bytes.Buffer
(or something similar) like Take
, which returns a string using the backing array鈥攚ithout copying it鈥攁nd then invalidates or resets the Buffer
, so that subsequent operations against it would either panic or use a fresh backing array.
As the original description here notes,
The implementation would have to take care to use a new []byte backing array on any resize after an unsafe string was constructed.
A method like Take
makes that need explicit, though it does impose unnecessary cost when the last operation against the buffer is to take a string from it.
a new internal package
What's the reason and effect of internal
being involved in this proposal?
Edit: The effect is that it would prevent code outside of GOROOT to import the package.
I have a number of questions and concerns.
Doing this right in the compiler seems tractable, it avoids additional API, it avoids internal-only API, and it avoids forcing programmers to make unsafe decisions. It seems better on every axis than introducing a new package. But maybe there is some consideration I am missing.
@rsc,
Why? This proposal gives no justification or intended use cases, making it very hard to evaluate as an engineering decision.
See #6714 and the bugs referencing it. The justification is to reduce allocations and thus make the GC run less often and thus reduce CPU usage.
Why only internal? Is there some compelling reason that the standard library is special and needs this magic capability but code outside the standard library does not? It's hard to believe people won't just copy it into their own projects, which is not something we want to encourage.
Only because I figured it would be less controversial to start internal and see if we like the API first. We could export it later.
This is an unsafe operation and will be overused and used incorrectly.
No, the API would be designed such that it's impossible to use incorrectly or unsafely. It would only use unsafe itself, like reflect does, even though reflect is a safe part of the language. I tried to convey that in the top post.
I still believe the compiler can do significantly better. My recent reading about other systems with immutability (in particular, Midori) suggests that it's actually not too hard to identify the string(buf) conversions in which buf is not retained elsewhere and not used after the conversion; those can reuse buf for the storage when appropriate.
Significantly? In any case, that is #6714, filed in 2013. We can keep waiting, of course.
@robpike, @seh,
bytes.Buffer already does no allocations up to 64 bytes, but then to make a string of course there is one allocation.
Yes, but bytes.Buffer
exposes its innards via (*Buffer).Bytes()
etc. It would need to keep track more state to use unsafe
safely.
A method like
Take
makes that need explicit, though it does impose unnecessary cost when the last operation against the buffer is to take a string from it.
If we did that, we'd need all the safety bookkeeping in that type anyway, so at that point I'd rather just make (*bytes.Buffer).String
be the Take
function, so there's no new API surface.
But only if we decide that bytes.Buffer
wants to get into this business and use unsafe.
The builder.String
type proposed here is significantly more useful than the compiler optimization described in #6714.
Determining there are no live aliases to a []byte
(which is what #6714 proposes) will fail the moment it is wrapped and passed as an io.Writer
. This is a very common thing to do when constructing a string via repeated calls to fmt.Fprintf
.
A builder.String
can be used as an io.Writer and then invalidate its internal []byte
after the String
method is called.
Sorry, @bradfitz, I missed that the []byte readers were omitted. I agree that the API is safe, and I agree that that's preferred over unsafe methods on bytes.Buffer. That resolves a big concern I had. There's still the question of what the benefit would be. What important benchmarks would get faster, and by how much? I remember that there have been past conversations, but #6714 in particular is pretty light on detail, and the other optimizations in the system have changed since then anyway.
@crawshaw, I'm not sure that's quite true. If you call func f(io.Writer) with f(w), and we know from escape analysis that f does not retain a pointer to w, and we also know from looking at w's methods that none of them expose w.buf (or even know that f doesn't call the ones that do), then we know that there are no new aliases to w.buf due to the call to f. The compiler could plausibly put that together today for fmt.Fprintf(&buf, ...) where buf is a bytes.Buffer. I believe I understand this problem much better than I did six months ago, certainly much better than in 2013 when Brad filed #6714 originally. But I have other things on my plate ahead of this.
Related: #18822 (re: immutability and live references to []byte)
After discussion with @golang/proposal-review, it sounds like maybe we can make bytes.Buffer's String method avoid the copy (provided there has been no call to Bytes) but remember that it gave out a string and make its own copy on any future mutation. Let's start with that "String is just more efficient" and go from there.
@bradfitz et al., mind if I take this?
@cespare, I already have the start of a CL. I've been waiting for years for this, so I'd like to give it a shot. :-)
CL https://golang.org/cl/37767 mentions this issue.
Mirroring my comments from https://go-review.googlesource.com/c/37767/:
Yeah, I don't like this CL.
I don't think we can reuse the bytes.Buffer type for this magic and still have readable code.
But if we made a new type that didn't permit any escapes, then the magic would be very tiny and readable.
It turns out that the bytes.Buffer
type is too large and flexible to add this magic transparently.
Is the plan to add a new, dedicated string-building API for 1.10? (I hope so.)
@josharian, TBD. That's why I added the NeedsDecision label, to discuss with people later.
Per discussion with @golang/proposal-review, it seems OK to put this in package strings with a new name (strings.Buffer? strings.Builder?) and have a more restricted API (one that is like bytes.Buffer but without all the problematic bits).
The only thing I personally care about is that my code that says
var b bytes.Buffer
fmt.Fprintf(&b, ...)
s := b.String()
only needs to change the type used on the first line and nothing else.
Another reason to keep the API identical (but more restricted) is to allow us to use it in the compiler and other programs that must run with old Go versions, by doing something like:
// +build !go1.10
type Buffer struct {
bytes.Buffer
}
// +build go1.10
type Buffer struct {
strings.Buffer
}
@bradfitz is it OK to send a CL for the new API?
We're already using our own Builder implementation that is pretty much what @rsc described in https://github.com/golang/go/issues/18990#issuecomment-306297226 so it's essentially ready to go.
@cespare Go for it (@bradfitz is still more or less on leave).
Change https://golang.org/cl/74931 mentions this issue: strings: add Builder
Thanks; I sent https://golang.org/cl/74931. I have a few questions about the API which we can discuss on the CL.
Change https://golang.org/cl/83255 mentions this issue: strings: fix two Builder bugs allowing mutation of strings, remove ReadFrom
The string method at https://github.com/golang/go/commit/3058d38632aea679c96cd41156b2751c97578a2d#diff-4c939e2b8a761c673f94b9505cdc2d08L22 relies on the assumption that the layout of the first two struct fields of the string header and slice header are the same. At what point does warning from the reflect package about "may change in a later release" become untenable?
The standard lib can rely on implementation details that other packages can't, since it is distributed with that implementation.
The standard lib can rely on implementation details that other packages can't, since it is distributed with that implementation.
Yup.
At what point does warning from the reflect package about "may change in a later release" become untenable?
If the layout changes, unit tests will fail and we'll make the String method a big uglier (but likely still a single line).
I've always doubted we'd be able to ever change StringHeader or SliceHeader, though.
Are there benchmarks available for strings.Builder
anywhere?
I'm not finding any as part of CL 74931. I figure there's a good chance there are some existing benchmarks, perhaps somewhere else. I'm implementing strings.Builder
for another architecture, and I wanted to reuse existing benchmarks, if possible, instead of spending time recreating them myself. Or is TestBuilderAllocs
the only one?
Are there benchmarks available for strings.Builder anywhere?
Not that I'm aware of.
Change https://golang.org/cl/96980 mentions this issue: strings: add Builder benchmarks comparing bytes.Buffer and strings.Builder
Are there benchmarks available for strings.Builder anywhere?
For posterity, the CL mentioned above adds some nice benchmarks.
I also found from https://talks.godoc.org/github.com/campoy/gotalks/go1.10/talk.slide#29 that @campoy made some benchmarks for that talk at github.com/campoy/gotalks/go1.10/strings
.
Change https://golang.org/cl/102479 mentions this issue: all: use strings.Builder instead of bytes.Buffer where appropriate
Most helpful comment
Per discussion with @golang/proposal-review, it seems OK to put this in package strings with a new name (strings.Buffer? strings.Builder?) and have a more restricted API (one that is like bytes.Buffer but without all the problematic bits).
The only thing I personally care about is that my code that says
only needs to change the type used on the first line and nothing else.