Hello,
Just like we have FormatFloat(...) string and AppendFloat(...) []byte, it would be nice
to avoid a costly string conversion when parsing numbers from a byteslice.
Simple code like:
func main() {
s := "123.45"
r := bytes.NewBufferString(s)
runtime.UpdateMemStats()
println(runtime.MemStats.Mallocs)
var v float64
fmt.Fscanf(r, "%f", &v)
runtime.UpdateMemStats()
println(runtime.MemStats.Mallocs)
}
says it's doing 7 allocations in the Scanf.
Regarding "careful compilation", see my comment in the duplicate issue #3197.
From issue #3197: I attempted to convert strconv's Parse internals to use []byte instead, and then make the existing string versions make an unsafe []byte of the string's memory using reflect.SliceHeader / reflect.StringHeader, but the reflect package already depends on strconv, so there's an import loop.
Implemented here, for discussion at some point: http://golang.org/cl/5759057/ The byte view approach may be considered either gross or lovely. Not sure. The grossness is isolated, well tested, and can't escape to callers at least.
Yeah, there's a whole bunch of string<->[]byte optimizations that would be nice. issue #2204 comes to mind. And in another project, I want to be able to do a lookup in a map[string]T with a []byte key. It'd be nice if: var m map[string]T = ... var key []byte = ... v, ok := m[string(key)] ... didn't generate garabage too.
I think we have an issue about "escape analysis of strings" already. If a string([]byte) does not escape, it won't be stored so further modifications to the []byte are safe even if the string was created zero-copishly. Ah it's issue #2205, but it's about []byte(string). The string([]byte) case does not require read-only flagging, I think, only ordinary escape analysis.
This issue was just causing performance problems for a user inside Google. They had a
large sstable with float64 []byte values. Each strconv.ParseFloat(string(value), 64) was
generating garbage (which ended up being a lot of garbage)
As a medium-term hack, I showed they could do:
func unsafeString(b []byte) string {
return *(*string)(unsafe.Pointer(&b))
}
freq, err := strconv.ParseFloat(unsafeString(val), 64)
But I felt bad even pointing that out.
I'd really like issue #2205 to be the answer to this issue, so we don't have to add new
API.
That is, I'd like this to be garbage-free:
var b []byte = ....
freq, err := strconv.ParseFloat(string(b), 64)
The compiler should do the equivalent unsafeString thing, if it can statically determine
that the string never escapes strconv.ParseFloat.
Yes, the strconv.ParseFloat could then observe concurrent mutations of b via its string,
like this:
var b []byte = ....
go mutateBytes(b)
freq, err := strconv.ParseFloat(string(b), 64)
But that was already a data race.
If we implement no-op byte to string conversions, there are a couple of library changes that would have to be made to use it effectively. Right now, ParseFloat leaks its string parameter because it's captured in the error handling. With the change, it would be advantageous to copy the string instead. Coincidentally, I don't think we have a good way to copy strings. We've never had the need to before.
doesn't append([]byte{}, str...) do the trick for copying the string?
(into a []byte, of course, we can't really copy a string.)
if the optimization is implemented, just convert the copied []byte
to string will achieve the real copy-string effect (well, maybe
it copies the string multiple times, but it doesn't matter much IMO).
http://play.golang.org/p/ghP85_uea9 appears to copy the backing contents of a string.
Dave, you're looking at the address of the StringHeader, not its backing contents. For instance: http://play.golang.org/p/-AI1pPNIwW shows the addresses are different, even though their backing contents are the same. You really need unsafe to determine whether two strings alias the same memory. But </tangent>, yes: the strconv package would have to take care to not retain its input strings, even in error messages. And we'd need to write tests to guarantee that strconv never regresses and starts allocating. But that's easy and we do that in a number of other packages.
I'm running into high allocations converting []byte to string to use with the ParseXXX functions. However, I also agree that we shouldn't add new functions just to handle bytes.
According to gcflags=-m, the ParseXXX functions leak the input str to err. Unfortunately, this means that ParseBool(string(bytes)) still incurs an allocation even in the event that compiler can reason that ParseBool never mutates the input.
Given that errors are expected to be rare, maybe we could copy the input string in the error message. This way, the escape analysis can properly conclude that the input str does not escape.
The advantage is that ParseBool(string(bytes)) could incur no extra allocations.
The disadvantage is that every time an error occurs, a copy of the input is being made (the string could potentially be very large)..
Nearly 4 years and still a problem.
People just work around it in their own code with unsafe hacks, or making their own forks of the Parse functions that take []byte.
I propose adding four new functions to the strconv package:
func ParseBoolBytes(b []byte) (value bool, err error)
func ParseFloatBytes(b []byte, bitSize int) (f float64, err error)
func ParseIntBytes(b []byte, base int, bitSize int) (i int64, err error)
func ParseUintBytes(b []byte, base int, bitSize int) (n uint64, err error)
Thoughts?
/cc @rsc @griesemer @ianlancetaylor @adg
FWIW, a similar problem (in more severe form) exists now in math/big where we mimic the strconv API. We have already Int.Append, Float.Append (matching strconv.AppendInt, strconv.AppendFloat) working on []byte buffers, and int.Text, Float.Text (matching strconv.FormatInt, strconv.FormatFloat) producing string results (exactly for performance reasons).
There's also big.ParseFloat and Float.Parse (matching strconv.ParseFloat) working on strings. big.ParseInt, etc. is planned for completeness, and so are matching functions for big.Rats.
The numbers/strings tend to be longer for math/big which may make this even more important.
If we extend the strconv API, the matching math/big API would probably extend by
Factory functions:
Methods:
Apropos smart compilation: The compiler can know in some cases that an incoming []byte argument is never modified (this information can be available in the binary interface and export data). If the argument is a string s, the compiler could be clever about not allocating memory for []byte(s) when passing []byte(s) as the respective parameter. (Compiler people: do we do this already?) But this would require our API to work on []byte slices rather than strings as is the case now.
The other way round is more tricky: If we have a []byte b, we cannot avoid making a copy when passing string(b) because (in general) it's much harder to prove that b is not changed underfoot, e.g. by another goroutine.
/cc @randall77
We've been making the argument that we could theoretically improve the compiler and escape analysis for years to no avail.
@bradfitz The point I am making is that the API would have to be []byte oriented in the first place for the (relatively simple) compiler optimization to work. That is, maybe we need to bite the bullet and add these functions. The string versions would be trivial wrappers (which in turn could benefit from the compiler optimization once present).
@griesemer re: the other way around is more tricky:
ISTM that for the string case you should be able to discount other goroutines as long as there are no synchronisation points, as then there would be a race condition (the read to convert string to []byte vs the write to the slice), and with a race condition we don't guarantee correctness so it may still be a viable technique.
You'd also have to make sure that the function in question didn't make any external calls to code that might modify the slice.
@rogpeppe It's much harder to know statically that there's (globally) no goroutines with synchronization than to know whether a single function doesn't modify an incoming bytes. External calls by that single function could tell (via export data) whether they modify or store incoming byte arguments elsewhere.
As I was arguing somewhere else, we don't have to know globally that there are no goroutines with synchronization. We only have to know locally that the function we are calling has no synchronization points. We can't know that if there are any calls through an interface method, but otherwise we can.
@ianlancetaylor good point!
@ianlancetaylor exactly.
CL https://golang.org/cl/23787 mentions this issue.
Can't happen until Go 1.8.
@dsnet raised a great point in https://go-review.googlesource.com/#/c/23787

Unless the compiler guys say this ain't happening, I'm personally still willing to wait for the compiler to address this.
I kindly wanted to ask @josharian @randall77 @griesemer and other compiler folks what y'all think about this; do y'all think we should wait for the compiler optimizations or should we be biting the bullet and implement these functions as suggested by @griesemer here https://github.com/golang/go/issues/2632#issuecomment-144846932, then later the string implementation could use these optimizations?
@bradfitz great point in https://github.com/golang/go/issues/2632#issuecomment-66061061, I was checking with sample usage for the number of allocations when writing the CL, we should in deed include tests to make sure the strconv package never regresses.
Also cc @dr2chase @dvyukov
I don't think anyone is working on this at the moment. It's doable though. We'd have to add some data to the export info about whether any synchronization is done inside functions.
I'd recommend we wait and do this in the compiler. But that's not a promise on timeline...
/cc @griesemer for export data.
@randall77 It's trivial to add additional information about functions in the export data, even in a backward-compatible way. We just need to have that information.
From the current direction, it seems to me that this issue is a subset or perhaps even duplicate of https://github.com/golang/go/issues/2205.
I would like to avoid doing this. I still think we might be able to do something to make these conversions free and avoid 2x API bloat everywhere. I'd rather hold out for that. People who are super-sensitive to these allocations today can easily copy+modify the current code and ship that in their programs.
FWIW, here is working (but tricky) patches to make ParseXXX no escaping.
string([]byte(s)) is used just for cloning strings. I don't know better way to do this though.
(I realized that extending escape analysis isn't an easy path for this.)
diff --git a/src/strconv/atoi.go b/src/strconv/atoi.go
index 66df149..d6f645a 100644
--- a/src/strconv/atoi.go
+++ b/src/strconv/atoi.go
@@ -24,11 +24,11 @@ func (e *NumError) Error() string {
}
func syntaxError(fn, str string) *NumError {
- return &NumError{fn, str, ErrSyntax}
+ return &NumError{fn, string([]byte(str)), ErrSyntax}
}
func rangeError(fn, str string) *NumError {
- return &NumError{fn, str, ErrRange}
+ return &NumError{fn, string([]byte(str)), ErrRange}
}
const intSize = 32 << (^uint(0) >> 63)
@@ -134,7 +134,7 @@ func ParseUint(s string, base int, bitSize int) (uint64, error) {
return n, nil
Error:
- return n, &NumError{"ParseUint", s, err}
+ return n, &NumError{"ParseUint", string([]byte(s)), err}
}
// ParseInt interprets a string s in the given base (2 to 36) and
@@ -180,7 +180,7 @@ func ParseInt(s string, base int, bitSize int) (i int64, err error) {
un, err = ParseUint(s, base, bitSize)
if err != nil && err.(*NumError).Err != ErrRange {
err.(*NumError).Func = fnParseInt
- err.(*NumError).Num = s0
+ err.(*NumError).Num = string([]byte(s0))
return 0, err
}
cutoff := uint64(1 << uint(bitSize-1))
@hirochachacha, what are you trying to accomplish there?
@bradfitz I'm not sure this is deserved to send a CL.
given:
package main
import "strconv"
func main() {
s := []byte("1234")
strconv.ParseFloat(string(s[:2]), 64)
strconv.Atoi(string(s[2:4]))
}
$ go build -gcflags -m a.go
before:
./a.go:7: string(s[:2]) escapes to heap
./a.go:8: string(s[2:4]) escapes to heap
./a.go:6: main ([]byte)("1234") does not escape
after:
./a.go:6: main ([]byte)("1234") does not escape
./a.go:7: main string(s[:2]) does not escape
./a.go:8: main string(s[2:4]) does not escape
I believe @hirochachacha is making the same point I made in issuecomment-135883394. Suppose there was some compiler optimization that could prove that a function never mutates an input []byte and that there are no synchronization events, then it could use a string in its place. This optimization still wouldn't be helpful to the strconv package since the input escapes through the error value (and thus making it really difficult for the compiler to prove anything about the input).
Since this compiler optimization doesn't exist yet, I vote that we don't address this right now. Always copying the input in the error case may lead to unexpectedly bad performance in situations where a string is parsed in a tight loop and expected to fail.
@dsnet I agree with you at all points. Sorry for the duplication, I didn't notice that.
Most helpful comment
Nearly 4 years and still a problem.
People just work around it in their own code with unsafe hacks, or making their own forks of the Parse functions that take []byte.
I propose adding four new functions to the strconv package:
Thoughts?
/cc @rsc @griesemer @ianlancetaylor @adg