Go: strconv: add equivalents of Parsexxx() with []byte arguments

Created on 29 Dec 2011 · 49Comments · Source: golang/go

Hello,

Just like we have FormatFloat(...) string and AppendFloat(...) []byte, it would be nice
to avoid a costly string conversion when parsing numbers from a byteslice.

Simple code like:

func main() {
  s := "123.45"
  r := bytes.NewBufferString(s)
  runtime.UpdateMemStats()
  println(runtime.MemStats.Mallocs)
  var v float64
  fmt.Fscanf(r, "%f", &v)
  runtime.UpdateMemStats()
  println(runtime.MemStats.Mallocs)
}

says it's doing 7 allocations in the Scanf.

FrozenDueToAge GarbageCollector NeedsDecision Performance Thinking

Source

remyoudompheng

👍3

Most helpful comment

Nearly 4 years and still a problem.

People just work around it in their own code with unsafe hacks, or making their own forks of the Parse functions that take []byte.

I propose adding four new functions to the strconv package:

func ParseBoolBytes(b []byte) (value bool, err error)
func ParseFloatBytes(b []byte, bitSize int) (f float64, err error)
func ParseIntBytes(b []byte, base int, bitSize int) (i int64, err error)
func ParseUintBytes(b []byte, base int, bitSize int) (n uint64, err error)

Thoughts?

/cc @rsc @griesemer @ianlancetaylor @adg

bradfitz on 1 Oct 2015

👍13

All 49 comments

Comment 1:

Sounds like a nice idea.
Naming suggestions? "ParseFloatBytes"?

_Labels changed: added priority-later, removed priority-triage._

_Status changed to HelpWanted._

adg on 4 Jan 2012

Comment 2:

I would like to think more about this.
It doubles everything, and we may be able to avoid
it with careful compilation.

rsc on 9 Jan 2012

Comment 3:

Regarding "careful compilation", see my comment in the duplicate issue #3197.

bradfitz on 5 Mar 2012

Comment 4:

_Issue #3197 has been merged into this issue._

bradfitz on 5 Mar 2012

Comment 5:

From issue #3197:
I attempted to convert strconv's Parse internals to use []byte instead, and then make
the existing string versions make an unsafe []byte of the string's memory using
reflect.SliceHeader / reflect.StringHeader, but the reflect package already depends on
strconv, so there's an import loop.

bradfitz on 5 Mar 2012

Comment 6:

In the meanwhile, when I had to parse many numbers and save some allocations, I would
convert a big []byte to a big string and slice from it, instead of slicing from the
[]byte and string-ifying each chunk.

remyoudompheng on 5 Mar 2012

Comment 7:

I wonder if reflect.SliceHeader / StringHeader should be mirrored in package runtime.

bradfitz on 5 Mar 2012

Comment 8:

Implemented here, for discussion at some point:
http://golang.org/cl/5759057/
The byte view approach may be considered either gross or lovely. Not sure. The grossness
is isolated, well tested, and can't escape to callers at least.

bradfitz on 6 Mar 2012

Comment 9:

I am much less worried about the implementation than the expansion of the API.

rsc on 6 Mar 2012

Comment 10:

True but this API was already extended with Append* for []byte efficiency, and this just
mirrors that.
And we already have nearly-duplicate strings and bytes packages.  This []byte/string
duplicity is unfortunate, but this is far from the only place.
Still, Later.

bradfitz on 6 Mar 2012

Comment 11:

Like I said in #2, we may be able to do better with Parse
than with Format/Append.  The distinction is that Parse
accepts strings as input, while Format returns strings as
output.  It is easier to analyze the former than the latter.

rsc on 6 Mar 2012

Comment 12:

Yeah, there's a whole bunch of string<->[]byte optimizations that would be nice.
issue #2204 comes to mind.  And in another project, I want to be able to do a lookup in a
map[string]T with a []byte key. It'd be nice if:
      var m map[string]T = ...
      var key []byte = ...
      v, ok := m[string(key)]
...  didn't generate garabage too.

bradfitz on 6 Mar 2012

Comment 13:

I think we have an issue about "escape analysis of strings" already. If a string([]byte)
does not escape, it won't be stored so further modifications to the []byte are safe even
if the string was created zero-copishly. Ah it's issue #2205, but it's about
[]byte(string).
The string([]byte) case does not require read-only flagging, I think, only ordinary
escape analysis.

remyoudompheng on 6 Mar 2012

Comment 14:

_Labels changed: added performance._

bradfitz on 25 Apr 2013

Comment 15:

_Labels changed: added garbage._

bradfitz on 26 Apr 2013

Comment 16:

This issue was just causing performance problems for a user inside Google. They had a
large sstable with float64 []byte values. Each strconv.ParseFloat(string(value), 64) was
generating garbage (which ended up being a lot of garbage)
As a medium-term hack, I showed they could do:
    func unsafeString(b []byte) string {
       return *(*string)(unsafe.Pointer(&b))
    }
    freq, err := strconv.ParseFloat(unsafeString(val), 64)
But I felt bad even pointing that out.
I'd really like issue #2205 to be the answer to this issue, so we don't have to add new
API.
That is, I'd like this to be garbage-free:
     var b []byte = ....
     freq, err := strconv.ParseFloat(string(b), 64)
The compiler should do the equivalent unsafeString thing, if it can statically determine
that the string never escapes strconv.ParseFloat.
Yes, the strconv.ParseFloat could then observe concurrent mutations of b via its string,
like this:
     var b []byte = ....
     go mutateBytes(b)
     freq, err := strconv.ParseFloat(string(b), 64)
But that was already a data race.

bradfitz on 26 Apr 2013

Comment 17:

If we implement no-op byte to string conversions, there are a couple of library changes
that would have to be made to use it effectively. Right now, ParseFloat leaks its string
parameter because it's captured in the error handling. With the change, it would be
advantageous to copy the string instead.
Coincidentally, I don't think we have a good way to copy strings. We've never had the
need to before.

DanielMorsing on 27 Apr 2013

Comment 18:

doesn't append([]byte{}, str...) do the trick for copying the string?
(into a []byte, of course, we can't really copy a string.)
if the optimization is implemented, just convert the copied []byte
to string will achieve the real copy-string effect (well, maybe
it copies the string multiple times, but it doesn't matter much IMO).

minux on 27 Apr 2013

Comment 19:

http://play.golang.org/p/ghP85_uea9 appears to copy the backing contents of a string.

davecheney on 27 Apr 2013

Comment 20:

Dave, you're looking at the address of the StringHeader, not its backing contents.  For
instance: http://play.golang.org/p/-AI1pPNIwW shows the addresses are different, even
though their backing contents are the same.
You really need unsafe to determine whether two strings alias the same memory.
But </tangent>, yes: the strconv package would have to take care to not retain its
input strings, even in error messages.  And we'd need to write tests to guarantee that
strconv never regresses and starts allocating. But that's easy and we do that in a
number of other packages.

bradfitz on 28 Apr 2013

Comment 21:

_Status changed to Thinking._

rsc on 30 Jul 2013

Comment 22:

_Labels changed: added go1.3maybe._

rsc on 27 Nov 2013

Comment 23:

_Labels changed: added release-none, removed go1.3maybe._

rsc on 4 Dec 2013

Comment 24:

_Labels changed: added repo-main._

rsc on 4 Dec 2013

I'm running into high allocations converting []byte to string to use with the ParseXXX functions. However, I also agree that we shouldn't add new functions just to handle bytes.

According to gcflags=-m, the ParseXXX functions leak the input str to err. Unfortunately, this means that ParseBool(string(bytes)) still incurs an allocation even in the event that compiler can reason that ParseBool never mutates the input.

Given that errors are expected to be rare, maybe we could copy the input string in the error message. This way, the escape analysis can properly conclude that the input str does not escape.

The advantage is that ParseBool(string(bytes)) could incur no extra allocations.
The disadvantage is that every time an error occurs, a copy of the input is being made (the string could potentially be very large)..

dsnet on 28 Aug 2015

Nearly 4 years and still a problem.

People just work around it in their own code with unsafe hacks, or making their own forks of the Parse functions that take []byte.

I propose adding four new functions to the strconv package:

func ParseBoolBytes(b []byte) (value bool, err error)
func ParseFloatBytes(b []byte, bitSize int) (f float64, err error)
func ParseIntBytes(b []byte, base int, bitSize int) (i int64, err error)
func ParseUintBytes(b []byte, base int, bitSize int) (n uint64, err error)

Thoughts?

/cc @rsc @griesemer @ianlancetaylor @adg

bradfitz on 1 Oct 2015

👍13

FWIW, a similar problem (in more severe form) exists now in math/big where we mimic the strconv API. We have already Int.Append, Float.Append (matching strconv.AppendInt, strconv.AppendFloat) working on []byte buffers, and int.Text, Float.Text (matching strconv.FormatInt, strconv.FormatFloat) producing string results (exactly for performance reasons).

There's also big.ParseFloat and Float.Parse (matching strconv.ParseFloat) working on strings. big.ParseInt, etc. is planned for completeness, and so are matching functions for big.Rats.

The numbers/strings tend to be longer for math/big which may make this even more important.

If we extend the strconv API, the matching math/big API would probably extend by

Factory functions:

ParseIntBytes
ParseRatBytes
ParseFloatBytes

Methods:

Int.ParseBytes
Rat.ParseBytes
Float.ParseBytes

Apropos smart compilation: The compiler can know in some cases that an incoming []byte argument is never modified (this information can be available in the binary interface and export data). If the argument is a string s, the compiler could be clever about not allocating memory for []byte(s) when passing []byte(s) as the respective parameter. (Compiler people: do we do this already?) But this would require our API to work on []byte slices rather than strings as is the case now.

The other way round is more tricky: If we have a []byte b, we cannot avoid making a copy when passing string(b) because (in general) it's much harder to prove that b is not changed underfoot, e.g. by another goroutine.

/cc @randall77

griesemer on 1 Oct 2015

We've been making the argument that we could theoretically improve the compiler and escape analysis for years to no avail.

bradfitz on 1 Oct 2015

@bradfitz The point I am making is that the API would have to be []byte oriented in the first place for the (relatively simple) compiler optimization to work. That is, maybe we need to bite the bullet and add these functions. The string versions would be trivial wrappers (which in turn could benefit from the compiler optimization once present).

griesemer on 1 Oct 2015

@griesemer re: the other way around is more tricky:
ISTM that for the string case you should be able to discount other goroutines as long as there are no synchronisation points, as then there would be a race condition (the read to convert string to []byte vs the write to the slice), and with a race condition we don't guarantee correctness so it may still be a viable technique.

You'd also have to make sure that the function in question didn't make any external calls to code that might modify the slice.

rogpeppe on 2 Oct 2015

@rogpeppe It's much harder to know statically that there's (globally) no goroutines with synchronization than to know whether a single function doesn't modify an incoming bytes. External calls by that single function could tell (via export data) whether they modify or store incoming byte arguments elsewhere.

griesemer on 2 Oct 2015

As I was arguing somewhere else, we don't have to know globally that there are no goroutines with synchronization. We only have to know locally that the function we are calling has no synchronization points. We can't know that if there are any calls through an interface method, but otherwise we can.

ianlancetaylor on 2 Oct 2015

@ianlancetaylor good point!

griesemer on 2 Oct 2015

@ianlancetaylor exactly.

rogpeppe on 2 Oct 2015

CL https://golang.org/cl/23787 mentions this issue.

gopherbot on 6 Jun 2016

Can't happen until Go 1.8.

adg on 6 Jun 2016

@dsnet raised a great point in https://go-review.googlesource.com/#/c/23787
screen shot 2016-08-21 at 1 08 31 am

Unless the compiler guys say this ain't happening, I'm personally still willing to wait for the compiler to address this.

I kindly wanted to ask @josharian @randall77 @griesemer and other compiler folks what y'all think about this; do y'all think we should wait for the compiler optimizations or should we be biting the bullet and implement these functions as suggested by @griesemer here https://github.com/golang/go/issues/2632#issuecomment-144846932, then later the string implementation could use these optimizations?

@bradfitz great point in https://github.com/golang/go/issues/2632#issuecomment-66061061, I was checking with sample usage for the number of allocations when writing the CL, we should in deed include tests to make sure the strconv package never regresses.

odeke-em on 21 Aug 2016

Also cc @dr2chase @dvyukov

josharian on 21 Aug 2016

I don't think anyone is working on this at the moment. It's doable though. We'd have to add some data to the export info about whether any synchronization is done inside functions.

I'd recommend we wait and do this in the compiler. But that's not a promise on timeline...

randall77 on 21 Aug 2016

👍1

/cc @griesemer for export data.

bradfitz on 21 Aug 2016

@randall77 It's trivial to add additional information about functions in the export data, even in a backward-compatible way. We just need to have that information.

griesemer on 22 Aug 2016

From the current direction, it seems to me that this issue is a subset or perhaps even duplicate of https://github.com/golang/go/issues/2205.

odeke-em on 28 Aug 2016

I would like to avoid doing this. I still think we might be able to do something to make these conversions free and avoid 2x API bloat everywhere. I'd rather hold out for that. People who are super-sensitive to these allocations today can easily copy+modify the current code and ship that in their programs.

rsc on 10 Oct 2016

e.g. https://godoc.org/go4.org/strutil#ParseUintBytes

bradfitz on 12 Oct 2016

FWIW, here is working (but tricky) patches to make ParseXXX no escaping.
string([]byte(s)) is used just for cloning strings. I don't know better way to do this though.
(I realized that extending escape analysis isn't an easy path for this.)

diff --git a/src/strconv/atoi.go b/src/strconv/atoi.go
index 66df149..d6f645a 100644
--- a/src/strconv/atoi.go
+++ b/src/strconv/atoi.go
@@ -24,11 +24,11 @@ func (e *NumError) Error() string {
 }

 func syntaxError(fn, str string) *NumError {
-   return &NumError{fn, str, ErrSyntax}
+   return &NumError{fn, string([]byte(str)), ErrSyntax}
 }

 func rangeError(fn, str string) *NumError {
-   return &NumError{fn, str, ErrRange}
+   return &NumError{fn, string([]byte(str)), ErrRange}
 }

 const intSize = 32 << (^uint(0) >> 63)
@@ -134,7 +134,7 @@ func ParseUint(s string, base int, bitSize int) (uint64, error) {
    return n, nil

 Error:
-   return n, &NumError{"ParseUint", s, err}
+   return n, &NumError{"ParseUint", string([]byte(s)), err}
 }

 // ParseInt interprets a string s in the given base (2 to 36) and
@@ -180,7 +180,7 @@ func ParseInt(s string, base int, bitSize int) (i int64, err error) {
    un, err = ParseUint(s, base, bitSize)
    if err != nil && err.(*NumError).Err != ErrRange {
        err.(*NumError).Func = fnParseInt
-       err.(*NumError).Num = s0
+       err.(*NumError).Num = string([]byte(s0))
        return 0, err
    }
    cutoff := uint64(1 << uint(bitSize-1))

hirochachacha on 1 Nov 2016

@hirochachacha, what are you trying to accomplish there?

bradfitz on 1 Nov 2016

@bradfitz I'm not sure this is deserved to send a CL.

given:

package main

import "strconv"

func main() {
    s := []byte("1234")
    strconv.ParseFloat(string(s[:2]), 64)
    strconv.Atoi(string(s[2:4]))
}

$ go build -gcflags -m a.go

before:

./a.go:7: string(s[:2]) escapes to heap
./a.go:8: string(s[2:4]) escapes to heap
./a.go:6: main ([]byte)("1234") does not escape

after:

./a.go:6: main ([]byte)("1234") does not escape
./a.go:7: main string(s[:2]) does not escape
./a.go:8: main string(s[2:4]) does not escape

hirochachacha on 1 Nov 2016

I believe @hirochachacha is making the same point I made in issuecomment-135883394. Suppose there was some compiler optimization that could prove that a function never mutates an input []byte and that there are no synchronization events, then it could use a string in its place. This optimization still wouldn't be helpful to the strconv package since the input escapes through the error value (and thus making it really difficult for the compiler to prove anything about the input).

Since this compiler optimization doesn't exist yet, I vote that we don't address this right now. Always copying the input in the error case may lead to unexpectedly bad performance in situations where a string is parsed in a tight loop and expected to fail.

dsnet on 1 Nov 2016

@dsnet I agree with you at all points. Sorry for the duplication, I didn't notice that.

hirochachacha on 2 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings