Go: proposal: provide std lib pkg Copy function to copy a string efficiently without memory sharing

Created on 14 Jul 2020  ·  14Comments  ·  Source: golang/go

Goal

Provide a fast, allocation efficient and guaranteed way to copy a string without the new string sharing memory with the old string, that does not rely on code assuming string([]byte(s)) works for this.

Proposal

Add a Copy(s string) string function (not a builtin) to a std lib package that is guaranteed to return a string with the same content as the input string, does not share memory (for the backing) array with the input string and minimises memory allocated to do so.

Note that it this function does not need to be in the strings package but could be located in a package more fitting for its special use character.

Use cases

  • creating a new string for e.g. use as a map key or substring of a string that should be guaranteed not pin more memory in the backing array then needed (which currently cant be freed by the garbage collector)
  • creating a new string from a string for which the backing array is e.g. an mmap region that should be unmapped

Existing alternatives:

  • string([]byte(s)) is currently not guaranteed by the language standard and thereby all conforming Go compilers to not just return the same backing array. While the current Go Gc compiler does not elide the double copy a future Go compiler could optimize this to a noop. If the compiler becomes smart enough to recognise the safe versions of string to byte to string conversions (#6714) it would be fragile to try to forbid some of them again. Being explicit about the intent of the copy semantics seems cleaner and a better intent communication.

  • (s + " ")[:len(s)] is as far as I understand not guaranteed by the language standard and thereby all conforming Go compilers to not just return the same backing array.

  • strings.Builder can be used to only cause one allocation currently for the copy but has no guarantees to not be worse in allocations or even avoid copying if only one string is written and retrieved again.

Implementation is simple:

func Copy(s string) string {
    b := make([]byte, len(s))
    copy(b, s)
    return *(*string)(unsafe.Pointer(&b))
}

https://go-review.googlesource.com/c/go/+/242259/1/src/strings/copy.go

The use of unsafe headers in the strings library is already present in the strings.Builder code.

Downsides

  • The garbage collector might be getting smarter in the future making the pinning of large memory for small substrings a non issue.
  • The compiler/runtime might choose to automatically dedupe strings needing to make strings.Copy special to keep semantics intact.
  • Having the Copy function could lead to confusion with (s1 = s2) and application in contexts where it is not needed.
  • Inner details as does not reuse memory can be to low level and considered to much of an implementation detail to expose.
  • It only provides a tool for some very special use cases.
  • The function can just be implemented without guarantees of compatibility by users needed such a function.

For those reasons and others I expect this proposal will be easily rejected or voted down. That is fine as I think its still useful to have some documented discussion about agreement whether such functionality or implementation details about strings are not ok to be exposed.

Proposal

Most helpful comment

Avoiding the unnecessary pinning of large strings is a general problem that would be nice to solve for the general case. It would mean that some programs would get better memory usage without requiring people to understand this detailed issue and requiring them to know about a relatively obscure library function.

And if we are able to fix that general problem, then we don't need the library function.

This is just my opinion, of course, others may feel differently.

All 14 comments

Why do you want this? Are you worried about extracting a substring so the main string can be collected? Because if that's the use case, I don't think confusing the simple existing model (s1 = s2) is justified.

Yes that is one use case as written in the proposal. Another application I have seen is to make a string that is safe to use going forward since the orginal string was created through an unsafe process of using a memory region e.g. mmaped that should be released again.

I fully understand if the proposal is getting rejected on grounds that this is something that the garbage collector can solve in the future without any new API or that users can implement as a helper themselfs. I have seen the issue come up before but never a proposal. Even if rejected I think it will serve as a good reference point cite that this wont be implemtented/support or that these details are to low level to base an API on.

It is also in context of me experimenting with the compiler to optimze some string->byte->string... conversion chains which could unless explicitly prevented make the 'strings([]byte(s))' version a noop that some Go programs may rely on to actually make a copy due to either memory concerns or having used unsafe to create the string in the first place. I understand that issue is not yet reality and can be defered until that is the case to offer an easy escape hatch to refactor then "broken" code.

What about using the following code?

func CopyString(s string) string {
    return string(append([]byte(nil), s...))
}

The compiler may optimize it to a single string copy and a single memory allocation.

If the outcome is to change the compiler to recognise a pattern for low overhead copy I do not see the advantage of string(append([]byte(nil), s...)) over the compiler recognising the simpler string([]byte(s)) and using that for single memory allocation copy.

Since concatenation creates a new string, the simplest copy-this construct would be s + "" or "" + s.

@networkimprov Right now the compiler can just turn s + "" into simply s. So to make that work as a string copy we would have to define it in the language. And it would be odd to define a specific meaning for s + "", and it would be odd to define string concatenation as always making a copy in general.

@martisch described two use cases. The first is to avoid pinning a large string. As @robpike suggests, that is something we can look into implementing in the garbage collector. I don't know how hard that would be--probably pretty hard--but it does seem like that should be the first step.

@martisch also suggests the case of a string defined in mmap'ed memory that needs to move to the regular heap. I'm probably missing something but I think getting a string in mmap'ed memory must involve some sort of unsafe operation, so I'm not sure how interesting this case is. I expect that we can move a string into the heap using an unsafe operation.

So maybe we need to decide whether the garbage collector option is possible for unpinning larage strings, and maybe we need to decide whether there is an unsafe operation that will do the copy operation. E.g., see the discussion in #19367.

Why not let us explicitly copy a string, by whatever API/syntax?

Why complicate the runtime to fix a single use case of a general purpose stdlib or language feature?

Avoiding the unnecessary pinning of large strings is a general problem that would be nice to solve for the general case. It would mean that some programs would get better memory usage without requiring people to understand this detailed issue and requiring them to know about a relatively obscure library function.

And if we are able to fix that general problem, then we don't need the library function.

This is just my opinion, of course, others may feel differently.

It's easy to do yourself, although of course there is no guarantee the compiler won't optimize this away. It doesn't today.

https://play.golang.org/p/gLXMWXD1SNL

package main

import (
    "fmt"
    "reflect"
    "unsafe"
)

func main() {
    x := "qwertyuiopasdfghjklzxcvbnm"
    y := x[3:5]
    fmt.Println(x, (*reflect.StringHeader)(unsafe.Pointer(&x)))
    fmt.Println(y, (*reflect.StringHeader)(unsafe.Pointer(&y)))
    y = mycopy(y)
    fmt.Println(y, (*reflect.StringHeader)(unsafe.Pointer(&y)))
}

func mycopy(x string) string {
    return ("x" + x)[1:]
}

Note that mycopy using concatenation my allocate using the next bigger sizeclass currently which is not needed.

I agree that deciding to have the garbage collector not pin extra memory together with an unsafe operation to copy memory will make the string copy function uneccessary and are better more general solution if they are going to be choosen.

Every GC'ed language has this problem. I don't think that we are going to solve it in the GC when none of the rest have.

When Java was faced with this problem they made the (in my opinion incredibly unwise) decision to change the substring operation to start making copies. That is, Java x.substring(i, j) used to be like Go's x[i:j], sharing memory with the parent, but starting in Java 1.7 it makes a copy. This kind of change makes formerly linear algorithms go quadratic, so I'm stunned they thought this was a good idea. But if nothing else it shows how much people needed a solution to the unwanted sharing pinning data.

Given the likely absence of a magic bullet for the foreseeable future, it does seem reasonable to have something like strings.Copy(string) string, with appropriate documentation.

In the past I've written something that does s[:1] + s[1:] for non-empty strings and returns "" for empty strings. Lots of other ways to do it too (like Rob's), but I think it is telling that we all have our versions of this.

To avoid some confusion of what Copy does maybe a longer name like DeepCopy might be used if the function is in the common strings package. Otherwise unsafe.StringCopy might better convey the special use case of this function even if its not an unsafe operation itself but might need to be used with other unsafe operations to have created a string that needs a copy of the backing array.

@martisch, I don't think unsafe.StringCopy is a good name for this function, because there is nothing unsafe about it.

(strings.Copy seems like a clear enough name to me.)

@martisch

Another application I have seen is to make a string that is safe to use going forward since the orginal string was created through an unsafe process of using a memory region e.g. mmaped that should be released again.

I think the problem there is using the string type in the first place. Go programs can (and often do) assume that variables of type string are immutable and have unbounded, garbage-collected lifetimes. The appropriate type for “a sequence of bytes that exists for a while but may not be safe to write or to access in the future” is []byte, not string.

[]byte already has this meaning in general due to concurrency. Because other goroutines may read the same data concurrently, a function that receives an arbitrary []byte cannot assume that it is safe to mutate. Because other goroutines may write to the data after the call has returned, a function that receives an arbitrary []byte cannot assume that it is safe to retain for future reading.

So I don't think “copying an external string into the Go heap” is an appropriate use-case for this function, because I don't agree that external strings should be represented as type string in the first place.

Was this page helpful?
0 / 5 - 0 ratings