Go: proposal: strings: add Cut

Created on 9 Jul 2020 · 41Comments · Source: golang/go

In many projects I've seen code like this:

idx = strings.Index(username, "@")
if idx != -1 {
  name = username[:idx]
} else {
  name = username
}

idx = strings.LastIndex(address, "@")
if idx != -1 {
  host = address[idx+1:]
} else {
  host = address
}

I think this operation—getting a prefix or suffix based on a substring—is common enough to think about adding helpers for them. So I propose to add functions PrefixUntil and SuffixAfter (names up for debate) to package strings. Something like:

// PrefixUntil returns the prefix of the string s until the first
// occurrence of the substring substr, excluding the substring itself.
// It returns s if substr was not found.
func PrefixUntil(s, substr string) (prefix string) {
    var idx = Index(s, substr)
    if idx == -1 {
        return s
    }

    return s[:idx]
}

// SuffixAfter returns the suffix of the string s after the last
// occurrence of the substring substr, excluding the substring itself.
// It returns s if substr was not found.
func SuffixAfter(s, substr string) (suffix string) {
    var idx = LastIndex(s, substr)
    if idx == -1 {
        return s
    }

    return s[idx+len(substr):]
}

So that the code could be rewritten as:

name = strings.PrefixUntil(username, "@")

host = strings.SuffixAfter(address, "@")

Proposal Proposal-Hold

Source

ainar-g

👍10 👎4 👀1

Most helpful comment

We certainly write code to do this operation all the time. Even just grepping for 'Index.* >=' in the Go tree I find plenty of code like:

if i := strings.Index(line, "#"); i >= 0 {
    line = line[:i]
}

and

if i := strings.Index(raw, "@"); i >= 0 {
    pattern, rawVers = raw[:i], raw[i+1:]
    if strings.Contains(rawVers, "@") || rawVers == "" {
        return nil, fmt.Errorf("invalid module version syntax %q", raw)
    }
}

If we had a function like:

func Cut(s, sep string) (before, after string, ok bool) {
    if i := strings.Index(s, sep); i >= 0 {
        return s[:i], s[i+len(sep):], true
    }
    return s, "", false
}

then snippets like the above would become:

line, _, _ = strings.Cut(line, "#")

if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
    // no need for: pattern, rawVers = raw[:i], raw[i+1:]
    ...
}

It does seem like it comes up a lot. What do people think? Should we add this (single) Cut function?

Update: Edited to return (before, after, ok) instead of (before, sep, after), as suggested by @nightlyone.

rsc on 11 Nov 2020

👍57 👎1

All 41 comments

It would help if you could identify some cases in the standard library, and/or in other popular Go packages, where the new functions would be used. That is, give some supporting information for "in many projects I've seen...." Thanks.

ianlancetaylor on 9 Jul 2020

👍1

@ianlancetaylor A simple grep query shows lots, actually. Here's one from package encoding/xml:

if i := strings.LastIndex(prefix, "/"); i >= 0 {
    prefix = prefix[i+1:]
}

Here's one from path:

if i := strings.LastIndex(path, "/"); i >= 0 {
    path = path[i+1:]
}

net/url:

i := strings.LastIndex(authority, "@")
if i < 0 {
    host, err = parseHost(authority)
} else {
    host, err = parseHost(authority[i+1:])
}

net/http (IndexByte, but the meaning is the same):

if i := strings.IndexByte(resp.Status, ' '); i != -1 {
    statusCode = resp.Status[:i]
}

go/types:

if i := strings.LastIndex(name, "/"); i >= 0 {
    name = name[i+1:]
}

test/fixedbugs/issue32778.go, a copy of the code from the original issue, which in turn, I assume, is a version of the code the reporter of that issue saw:

if i := strings.LastIndexByte(string(n), '.'); i >= 0 {
    return Name(n[i+1:])
}
return Name(n)

Lots of that in test/run.go.

I discovered these using:

$ grep -A 5 -e 'strings.\(Last\)\?Index(' -r

And then sifting the results manually.

ainar-g on 9 Jul 2020

👍7

This looks like a variation on strings.SplitN(str, "@", 2). So maybe SplitFirst & SplitLast?

networkimprov on 9 Jul 2020

👍4

@networkimprov Those functions don't really split (turn one into many) strings. And the proposed logic is different from that of the Split.* functions. Perhaps FirstPrefix and LastSuffix.

ainar-g on 10 Jul 2020

@ainar-g I believe these would be equivalent

name = strings.PrefixUntil(username, "@")
name, _ = strings.SplitFirst(username, "@")

jimmyfrasche on 10 Jul 2020

👍3

If I should rather Split off the below into its own proposal I can but it seems the frequency of both patterns (prefix, suffix and partition) together makes for a potential better outlook to be relevant enough for addition.

Adding more general strings.SplitFirst and strings.SplitLast:

Much code I have optimised in parsing functions involves splitting around an identifier and uses Index and LastIndex.
so a key, value, ok = strings.SplitFirst(kv, separator) (and same for Last) would be nice and would cover the use cases of strings.PrefixUntil and SuffixAfter too.

Why ok as a return? To distinguish the case of separator appearing in the source string or not. Knowing value == "" does not provide that.

I can give Googlers references to internal CLs. Here is one for std lib:
https://go-review.googlesource.com/c/go/+/231738/1/src/encoding/asn1/common.go

Why not strings.Split(N)? these allocate a return slice and there is no strings.SplitLastN.

Alternative: strings.Partition like in Python:

Another take on this could be to align this with str.partition from Python:
https://docs.python.org/3/library/stdtypes.html#str.partition

str.partition(sep)
Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.

So we could discuss strings.Partition(s, sep string) (first, split, tail) and strings.PartitionLast(s, sep string) (head, split, last string) as an alternative with similar semantics as SplitFirst and SplitLast.

martisch on 10 Jul 2020

👍6

If something like this is accepted (and on balance I'm in favour), I'd prefer at least one variant that has a single return value so that it can be used directly in expressions.

rogpeppe on 21 Aug 2020

There may be an opportunity here, but it's a pretty subtle thing to get right as a general function.
For example, the example above is a bit difficult if the string has no @ sign:

name = strings.PrefixUntil(username, "@")
host = strings.SuffixAfter(address, "@")

parsing "rsc" will end up with name="rsc", host="rsc", which is not what you want.

The Python partition function is more like what you want, in that it distinguishes "separator is present" from "separator is not". But it's a bit awkward to use too since there are no tuples to index like in Python, and allocating a slice defeats the purpose.

There are many helper functions in strings already. Why is this one important, and - even more importantly - can we handle all the needs with a single function and still be convenient?

rsc on 4 Nov 2020

👍1

The reason why I like a function that does a pair split is that I have optimized many parsing functions where you want to split off one element of a delimited list in a string without allocating and at the same time getting the rest of the string without the delimiter and split of element. Having a std library function do this that doesnt allocate a return slice would help readability:
https://go-review.googlesource.com/c/go/+/231738/1/src/encoding/asn1/common.go

returning elem, separator, tail or elem, tail, ok as 3 return values does not need a slice or allocation.

martisch on 4 Nov 2020

👍1

We certainly write code to do this operation all the time. Even just grepping for 'Index.* >=' in the Go tree I find plenty of code like:

if i := strings.Index(line, "#"); i >= 0 {
    line = line[:i]
}

and

if i := strings.Index(raw, "@"); i >= 0 {
    pattern, rawVers = raw[:i], raw[i+1:]
    if strings.Contains(rawVers, "@") || rawVers == "" {
        return nil, fmt.Errorf("invalid module version syntax %q", raw)
    }
}

If we had a function like:

func Cut(s, sep string) (before, after string, ok bool) {
    if i := strings.Index(s, sep); i >= 0 {
        return s[:i], s[i+len(sep):], true
    }
    return s, "", false
}

then snippets like the above would become:

line, _, _ = strings.Cut(line, "#")

if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
    // no need for: pattern, rawVers = raw[:i], raw[i+1:]
    ...
}

It does seem like it comes up a lot. What do people think? Should we add this (single) Cut function?

Update: Edited to return (before, after, ok) instead of (before, sep, after), as suggested by @nightlyone.

rsc on 11 Nov 2020

👍57 👎1

Since we pass in the separator, one could also return an OK bool instead of the separator.

For this I would suggest

Cut(s, sep string) (before, after string, ok bool)

```

nightlyone on 11 Nov 2020

👍28 👎1

@nightlyone indeed, that signature is much better! Thanks.

rsc on 11 Nov 2020

Since when does Go favour functions to save a line or two? 😞 Who wants to remember all those names, including the argument order? There are quite a few non-trivial things missing in the strings package. If there's room for growth then this cannot be the priority. 🙏

pascaldekloe on 11 Nov 2020

👎4 👍2

@pascaldekloe can you list the issue numbers of some open string package proposals that should be prioritized?

martisch on 11 Nov 2020

👍3

I can think of some right now @martisch.

// NRunes returns the first n runes in s. If s has less that n runes the return is s.
// Illegal byte sequences count as one rune per byte
func NRunes(s string, n int) string

// IndexNth returns the index of the n-th occurence of substr in s.
func IndexNth(s, substr string) int

// SplitSeq returns the bytes in between the seps if and only if they occur in s in that very order.
func SplitSeq(s string, seps ...string) []string

Don't take these examples too literally. I'm making these up just now. The last one might even be quite useful like SplitSeq(kv, ": ", "\r\n"). The point is that you really need something to be generic and widely useful. Millions of developers will have to learn this function, even if they personally don't use it.

pascaldekloe on 12 Nov 2020

Also Split2 may be more consitent than Cut as a name.

pascaldekloe on 12 Nov 2020

@pascaldekloe, I've never written code that would have used any of those three. On the other hand, I can point to tons of code I've written using Index and manual slicing that would be helped by Cut. Just one data point, but those seem like pretty rare needs that could easily be handled outside the standard library.

rsc on 12 Nov 2020

👍3

If the usage metric is important then In(a []string, match string) bool perhaps. MinInt(a, b) int? I'll be quit now. 😄

pascaldekloe on 12 Nov 2020

@pascaldekloe I'd encourage you to file separate proposals if you have a strong case for any new API in particular. I don't think it's helpful to continue in this thread. This proposal should be considered on its own merits and drawbacks alone, not competing for priority against a handful of other potential proposals.

mvdan on 12 Nov 2020

👍3

If something like this is accepted (and on balance I'm in favour), I'd prefer at least one variant that has a single return value so that it can be used directly in expressions.

I think the latest design still reasonably supports this:

// can't be a single line, but not too bad
line, _, _ = strings.Cut(line, "#")
use(line)

// with the ok check, plus scoping of the variable
if pattern, _, ok = strings.Cut(raw, "@"); ok {
    use(pattern)
}

I agree with Russ's point in https://github.com/golang/go/issues/40135#issuecomment-721906268, though I agree in some cases it would be nice to not have to declare an extra variable.

mvdan on 12 Nov 2020

My point was *not to add any of these functions @mvdan 🙃

pascaldekloe on 12 Nov 2020

Then, if you point is "we shouldn't add this func because we shouldn't add any of those others either", I disagree. You're assuming that they're all equally important. If you argue that it's impossible to gauge how important they are relative to each other, then we would never add any new APIs because we would be stuck in "what about" and "what if" discussions.

mvdan on 12 Nov 2020

This idea

if a, b, ok := strings.Cut(s, sep); ok { ... }

looks like a special case of
``if list := strings.SplitN(s, sep, 2); len(list) == 2 { ... }

networkimprov on 12 Nov 2020

@networkimprov please see the response that @martisch already provided:

Why not strings.Split(N)? these allocate a return slice and there is no strings.SplitLastN.

mvdan on 12 Nov 2020

👍2

He asked for SplitFirst & SplitLast, but Cut is only SplitFirst.

Is the overhead of returning a slice vs two strings a problem much of the time?

EDIT: also your tone may imply that I hadn't read the thread.

networkimprov on 12 Nov 2020

An allocation is generally significantly more expensive than returning two extra values; allocations create GC work too.

EDIT: also your tone may imply that I hadn't read the thread.

I am simply pointing to previous replies to prevent the thread from going in circles. I don't imply anything else by it.

mvdan on 12 Nov 2020

Any possibility of optimizing creation of very short non-escaping slices, e.g. represent it as an array internally?

networkimprov on 12 Nov 2020

@networkimprov, Yes, avoiding the allocation keeps people from using strings.Split or SplitN here.
And while we could possibly fix the allocation in SplitN for small constant n, from an API point of view it is still far nicer to write:

if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
    // no need for: pattern, rawVers = raw[:i], raw[i+1:]
    ...
}

than:

if x := strings.SplitN(raw, "@", 2); len(x) == 2 {
    pattern, rawVers := x[0], x[1]
    ...
}

rsc on 12 Nov 2020

👍6

This thread is starting to escalate, and it feels like the kind of trivial change that leads to classic bikeshed arguments.

Before you post, please think about:
(1) whether the point you are about to make is really important for evaluating this simple API, and
(2) whether someone has already made the point - in that case, just thumbs up the existing comment.

Thanks!

rsc on 12 Nov 2020

👍7 ❤2

I'd disagree that "SplitN" isn't nice enough. I name a short, constant slice to indicate both items:
pattern_rawver := strings.SplitN(s, sep, 2)

I'd instead suggest adding "SplitLastN" and optimizing slice creation. Both have wider applicability, and the latter improves existing programs.

networkimprov on 12 Nov 2020

👎3

Cut has quite a nice API, and a nice name. But I'm sympathetic to the concern about adding too many new functions to the package. I buy the argument that Cut is a worthwhile addition, but I would like to understand better where we draw the line, and would be curious how others in this thread think about that. After all, adding new APIs does add complexity.

hherman1 on 13 Nov 2020

@hherman1 I think Cut can be evaluated on its own without knowing exactly where the general threshold is for inclusion. There are many dimensions and the existing Go ecosystem and how developers write Go code to consider. If there is a general discussion needed of inclusion guidelines then I would suggest to open a new separate github issue about that to keep the discussion here focused on the specific proposal and use cases.

martisch on 13 Nov 2020

I'd instead suggest adding "SplitLastN" and optimizing slice creation. Both have wider applicability, and the latter improves existing programs.

Performance:

Returning a variable sized value (slice backing array) on the stack requires upstack allocation and variable stack allocation (alloca). Both add considerable runtime complexity and can also affect performance of code not using those features as all Go code now has to deal with the possibility of the callee allocating a variable sized slice on the stack.

While doing performance optimizations in dozens of different projects my experience strings.Split/SplitN has been often the one where the return slice allocation was the easiest to replace unfortunately with reduced readability of the new code. (unless writing Cut outside of std lib into its own library). Having a replacement in the std library would be nice to improve performance and readability and could likely also lead to Go programers using the more performant function from the start.

As an anecdote to understand the potential magnitude of the performance difference of avoiding strings.Split and allocation. I have made at least one replacement CL for strings.Split in a large production code base that saved more resources than any other string function in the string package uses in absolute resources in the production environment. One reason maybe that strings.Split is often used in parsing strings (e.g. key value pairs) and parsing strings happens a lot without writing full custom hand optimized parsers. strings.Split is one of the common string functions that allocates and allocation is more compute and lock heavy vs the more lightweight computations that are usually needed for strings (Index, Trim, HasPrefix, ...).

Readability:

To me also the utility of strings.Cut to strings.SplitN(..., ..., 2) is a bit like strings.Contains to strings.Index. One could have said strings.Contains is easy and can be written outside of std library. But its so common and readable that it warrants inclusion to be there from the start. Now strings.Cut is less common and has a more complex signature to start but the replacement code with strings.SplitN(..., ..., 2) is also more complex.

Side note for function type signature

I had proposed key, value, ok = strings.SplitFirst(kv, separator) in my first comment on the issue as alternative to introducing PrefixUntil and this matches the type signature of the current thumbs up favorited version with function name Cut later proposed by @nightlyone. As far as I remember this would have fitted all use cases I had for the Cut function well which was why it immediately came to my mind as alternative for the original proposal.

martisch on 13 Nov 2020

👎1

Returning a variable sized value (slice backing array) on the stack requires upstack allocation and variable stack allocation (alloca).

I'm not v familiar with the runtime, but you'd only create a slice on the stack when it's small, can't escape, has a cap known at compile time, and can't be assigned anything of len > cap.

I'm not opposed to SplitFirst & SplitLast; I'd use them. I don't think they should be added instead of optimization.

networkimprov on 13 Nov 2020

👎1

I'm not v familiar with the runtime, but you'd only create a slice on the stack when it's small, can't escape, has a cap known at compile time, and can't be assigned anything of len > cap.

If that is the constraints of the optimization that is supposed to optimize the allocation in Split and SplitN it wont apply since the returned slice is variable length (Update: also capacity, inside SplitN: N ist not const) and it escapes those functions. Leaving what happens after inlining aside for a moment IF its inlined which is not always the case.

martisch on 13 Nov 2020

😕1

Creating it on the stack to return it means it can't escape the caller.

cap(s) must be known at compile time (i.e. SplitN with const N), but len(s) can vary.

networkimprov on 13 Nov 2020

@networkimprov I think you've made your points clear - you don't think Cut helps readability, and you think some SplitN calls could avoid allocations. For that last point, you should file an issue with the details on how exactly that would work with the compiler. I don't think it's a good idea to continue that discussion here, like Russ mentioned in his last comment.

mvdan on 13 Nov 2020

👍6

Sorry to belabor the point. @martisch claimed it wasn't feasible, so I tried to explain why it might be.

Someone more familiar with compiler and runtime should file that issue. And maybe this could be placed on hold until that's decided.

networkimprov on 13 Nov 2020

👎2

There's far too much discussion about this for right now. Putting on hold. Sorry for such an attractive bikeshed.

rsc on 18 Nov 2020

👍2 🎉1

cc @josharian @mdempsky re filing an issue to consider optimizing small, constant-cap slices; discussion began at https://github.com/golang/go/issues/40135#issuecomment-726561975

networkimprov on 18 Nov 2020

Is there a reason to even consider the second example as a justification for this API?

Can't this snippet

idx = strings.LastIndex(address, "@")
if idx != -1 {
  host = address[idx+1:]
} else {
  host = address
}

Simply be rewritten as:

host := address[1+strings.LastIndex(address, "@"):]

as on 18 Nov 2020

❤2 👀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

proposal: cmd/vet: vet should warn when time.Time type (or types embed it) is used as map keys.

go101 · 3Comments

cmd/compile: regression in 69a7c15

OneOfOne · 3Comments

all: T.FailNow used in goroutines in standard library tests

dominikh · 3Comments

x/build/cmd/coordinator: ssh proxy should support scp

bradfitz · 3Comments

cmd/go: cross compile failed without useful error message

mingrammer · 3Comments