In many projects I've seen code like this:
idx = strings.Index(username, "@")
if idx != -1 {
name = username[:idx]
} else {
name = username
}
idx = strings.LastIndex(address, "@")
if idx != -1 {
host = address[idx+1:]
} else {
host = address
}
I think this operation鈥攇etting a prefix or suffix based on
a substring鈥攊s common enough to think about adding helpers for them.
So I propose to add functions PrefixUntil
and
SuffixAfter
(names up for debate) to package strings.
Something like:
// PrefixUntil returns the prefix of the string s until the first
// occurrence of the substring substr, excluding the substring itself.
// It returns s if substr was not found.
func PrefixUntil(s, substr string) (prefix string) {
var idx = Index(s, substr)
if idx == -1 {
return s
}
return s[:idx]
}
// SuffixAfter returns the suffix of the string s after the last
// occurrence of the substring substr, excluding the substring itself.
// It returns s if substr was not found.
func SuffixAfter(s, substr string) (suffix string) {
var idx = LastIndex(s, substr)
if idx == -1 {
return s
}
return s[idx+len(substr):]
}
So that the code could be rewritten as:
name = strings.PrefixUntil(username, "@")
host = strings.SuffixAfter(address, "@")
It would help if you could identify some cases in the standard library, and/or in other popular Go packages, where the new functions would be used. That is, give some supporting information for "in many projects I've seen...." Thanks.
@ianlancetaylor A simple grep
query shows lots,
actually. Here's one from package encoding/xml
:
if i := strings.LastIndex(prefix, "/"); i >= 0 {
prefix = prefix[i+1:]
}
Here's one from path
:
if i := strings.LastIndex(path, "/"); i >= 0 {
path = path[i+1:]
}
net/url
:
i := strings.LastIndex(authority, "@")
if i < 0 {
host, err = parseHost(authority)
} else {
host, err = parseHost(authority[i+1:])
}
net/http
(IndexByte
, but the meaning is the
same):
if i := strings.IndexByte(resp.Status, ' '); i != -1 {
statusCode = resp.Status[:i]
}
go/types
:
if i := strings.LastIndex(name, "/"); i >= 0 {
name = name[i+1:]
}
test/fixedbugs/issue32778.go
, a copy of the code from
the original issue, which in turn, I assume, is a version of the code
the reporter of that issue saw:
if i := strings.LastIndexByte(string(n), '.'); i >= 0 {
return Name(n[i+1:])
}
return Name(n)
Lots of that in test/run.go
.
I discovered these using:
$ grep -A 5 -e 'strings.\(Last\)\?Index(' -r
And then sifting the results manually.
This looks like a variation on strings.SplitN(str, "@", 2)
. So maybe SplitFirst & SplitLast?
@networkimprov Those functions don't really split (turn one
into many) strings. And the proposed logic is different from that of
the Split.*
functions. Perhaps FirstPrefix
and LastSuffix
.
@ainar-g I believe these would be equivalent
name = strings.PrefixUntil(username, "@")
name, _ = strings.SplitFirst(username, "@")
If I should rather Split off the below into its own proposal I can but it seems the frequency of both patterns (prefix, suffix and partition) together makes for a potential better outlook to be relevant enough for addition.
Much code I have optimised in parsing functions involves splitting around an identifier and uses Index and LastIndex.
so a key, value, ok = strings.SplitFirst(kv, separator)
(and same for Last) would be nice and would cover the use cases of strings.PrefixUntil and SuffixAfter too.
Why ok
as a return? To distinguish the case of separator appearing in the source string or not. Knowing value == ""
does not provide that.
I can give Googlers references to internal CLs. Here is one for std lib:
https://go-review.googlesource.com/c/go/+/231738/1/src/encoding/asn1/common.go
Why not strings.Split(N)
? these allocate a return slice and there is no strings.SplitLastN
.
Another take on this could be to align this with str.partition from Python:
https://docs.python.org/3/library/stdtypes.html#str.partition
str.partition(sep)
Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.
So we could discuss strings.Partition(s, sep string) (first, split, tail)
and strings.PartitionLast(s, sep string) (head, split, last string)
as an alternative with similar semantics as SplitFirst
and SplitLast
.
If something like this is accepted (and on balance I'm in favour), I'd prefer at least one variant that has a single return value so that it can be used directly in expressions.
There may be an opportunity here, but it's a pretty subtle thing to get right as a general function.
For example, the example above is a bit difficult if the string has no @ sign:
name = strings.PrefixUntil(username, "@")
host = strings.SuffixAfter(address, "@")
parsing "rsc" will end up with name="rsc", host="rsc", which is not what you want.
The Python partition function is more like what you want, in that it distinguishes "separator is present" from "separator is not". But it's a bit awkward to use too since there are no tuples to index like in Python, and allocating a slice defeats the purpose.
There are many helper functions in strings already. Why is this one important, and - even more importantly - can we handle all the needs with a single function and still be convenient?
The reason why I like a function that does a pair split is that I have optimized many parsing functions where you want to split off one element of a delimited list in a string without allocating and at the same time getting the rest of the string without the delimiter and split of element. Having a std library function do this that doesnt allocate a return slice would help readability:
https://go-review.googlesource.com/c/go/+/231738/1/src/encoding/asn1/common.go
returning elem, separator, tail or elem, tail, ok as 3 return values does not need a slice or allocation.
We certainly write code to do this operation all the time. Even just grepping for 'Index.* >=' in the Go tree I find plenty of code like:
if i := strings.Index(line, "#"); i >= 0 {
line = line[:i]
}
and
if i := strings.Index(raw, "@"); i >= 0 {
pattern, rawVers = raw[:i], raw[i+1:]
if strings.Contains(rawVers, "@") || rawVers == "" {
return nil, fmt.Errorf("invalid module version syntax %q", raw)
}
}
If we had a function like:
func Cut(s, sep string) (before, after string, ok bool) {
if i := strings.Index(s, sep); i >= 0 {
return s[:i], s[i+len(sep):], true
}
return s, "", false
}
then snippets like the above would become:
line, _, _ = strings.Cut(line, "#")
if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
// no need for: pattern, rawVers = raw[:i], raw[i+1:]
...
}
It does seem like it comes up a lot. What do people think? Should we add this (single) Cut function?
Update: Edited to return (before, after, ok) instead of (before, sep, after), as suggested by @nightlyone.
Since we pass in the separator, one could also return an OK bool instead of the separator.
For this I would suggest
Cut(s, sep string) (before, after string, ok bool)
```
@nightlyone indeed, that signature is much better! Thanks.
Since when does Go favour functions to save a line or two? 馃槥 Who wants to remember all those names, including the argument order? There are quite a few non-trivial things missing in the strings package. If there's room for growth then this cannot be the priority. 馃檹
@pascaldekloe can you list the issue numbers of some open string package proposals that should be prioritized?
I can think of some right now @martisch.
// NRunes returns the first n runes in s. If s has less that n runes the return is s.
// Illegal byte sequences count as one rune per byte
func NRunes(s string, n int) string
// IndexNth returns the index of the n-th occurence of substr in s.
func IndexNth(s, substr string) int
// SplitSeq returns the bytes in between the seps if and only if they occur in s in that very order.
func SplitSeq(s string, seps ...string) []string
Don't take these examples too literally. I'm making these up just now. The last one might even be quite useful like SplitSeq(kv, ": ", "\r\n")
. The point is that you really need something to be generic and widely useful. Millions of developers will have to learn this function, even if they personally don't use it.
Also Split2
may be more consitent than Cut
as a name.
@pascaldekloe, I've never written code that would have used any of those three. On the other hand, I can point to tons of code I've written using Index and manual slicing that would be helped by Cut. Just one data point, but those seem like pretty rare needs that could easily be handled outside the standard library.
If the usage metric is important then In(a []string, match string) bool
perhaps. MinInt(a, b) int
? I'll be quit now. 馃槃
@pascaldekloe I'd encourage you to file separate proposals if you have a strong case for any new API in particular. I don't think it's helpful to continue in this thread. This proposal should be considered on its own merits and drawbacks alone, not competing for priority against a handful of other potential proposals.
If something like this is accepted (and on balance I'm in favour), I'd prefer at least one variant that has a single return value so that it can be used directly in expressions.
I think the latest design still reasonably supports this:
// can't be a single line, but not too bad
line, _, _ = strings.Cut(line, "#")
use(line)
// with the ok check, plus scoping of the variable
if pattern, _, ok = strings.Cut(raw, "@"); ok {
use(pattern)
}
I agree with Russ's point in https://github.com/golang/go/issues/40135#issuecomment-721906268, though I agree in some cases it would be nice to not have to declare an extra variable.
My point was *not to add any of these functions @mvdan 馃檭
Then, if you point is "we shouldn't add this func because we shouldn't add any of those others either", I disagree. You're assuming that they're all equally important. If you argue that it's impossible to gauge how important they are relative to each other, then we would never add any new APIs because we would be stuck in "what about" and "what if" discussions.
This idea
if a, b, ok := strings.Cut(s, sep); ok { ... }
looks like a special case of
``
if list := strings.SplitN(s, sep, 2); len(list) == 2 { ... }
@networkimprov please see the response that @martisch already provided:
Why not
strings.Split(N)
? these allocate a return slice and there is nostrings.SplitLastN
.
He asked for SplitFirst & SplitLast, but Cut is only SplitFirst.
Is the overhead of returning a slice vs two strings a problem much of the time?
EDIT: also your tone may imply that I hadn't read the thread.
An allocation is generally significantly more expensive than returning two extra values; allocations create GC work too.
EDIT: also your tone may imply that I hadn't read the thread.
I am simply pointing to previous replies to prevent the thread from going in circles. I don't imply anything else by it.
Any possibility of optimizing creation of very short non-escaping slices, e.g. represent it as an array internally?
@networkimprov, Yes, avoiding the allocation keeps people from using strings.Split or SplitN here.
And while we could possibly fix the allocation in SplitN for small constant n, from an API point of view it is still far nicer to write:
if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
// no need for: pattern, rawVers = raw[:i], raw[i+1:]
...
}
than:
if x := strings.SplitN(raw, "@", 2); len(x) == 2 {
pattern, rawVers := x[0], x[1]
...
}
This thread is starting to escalate, and it feels like the kind of trivial change that leads to classic bikeshed arguments.
Before you post, please think about:
(1) whether the point you are about to make is really important for evaluating this simple API, and
(2) whether someone has already made the point - in that case, just thumbs up the existing comment.
Thanks!
I'd disagree that "SplitN" isn't nice enough. I name a short, constant slice to indicate both items:
pattern_rawver := strings.SplitN(s, sep, 2)
I'd instead suggest adding "SplitLastN" and optimizing slice creation. Both have wider applicability, and the latter improves existing programs.
Cut has quite a nice API, and a nice name. But I'm sympathetic to the concern about adding too many new functions to the package. I buy the argument that Cut is a worthwhile addition, but I would like to understand better where we draw the line, and would be curious how others in this thread think about that. After all, adding new APIs does add complexity.
@hherman1 I think Cut can be evaluated on its own without knowing exactly where the general threshold is for inclusion. There are many dimensions and the existing Go ecosystem and how developers write Go code to consider. If there is a general discussion needed of inclusion guidelines then I would suggest to open a new separate github issue about that to keep the discussion here focused on the specific proposal and use cases.
I'd instead suggest adding "SplitLastN" and optimizing slice creation. Both have wider applicability, and the latter improves existing programs.
Returning a variable sized value (slice backing array) on the stack requires upstack allocation and variable stack allocation (alloca). Both add considerable runtime complexity and can also affect performance of code not using those features as all Go code now has to deal with the possibility of the callee allocating a variable sized slice on the stack.
While doing performance optimizations in dozens of different projects my experience strings.Split/SplitN has been often the one where the return slice allocation was the easiest to replace unfortunately with reduced readability of the new code. (unless writing Cut outside of std lib into its own library). Having a replacement in the std library would be nice to improve performance and readability and could likely also lead to Go programers using the more performant function from the start.
As an anecdote to understand the potential magnitude of the performance difference of avoiding strings.Split and allocation. I have made at least one replacement CL for strings.Split in a large production code base that saved more resources than any other string function in the string package uses in absolute resources in the production environment. One reason maybe that strings.Split is often used in parsing strings (e.g. key value pairs) and parsing strings happens a lot without writing full custom hand optimized parsers. strings.Split is one of the common string functions that allocates and allocation is more compute and lock heavy vs the more lightweight computations that are usually needed for strings (Index, Trim, HasPrefix, ...).
To me also the utility of strings.Cut
to strings.SplitN(..., ..., 2)
is a bit like strings.Contains
to strings.Index
. One could have said strings.Contains
is easy and can be written outside of std library. But its so common and readable that it warrants inclusion to be there from the start. Now strings.Cut
is less common and has a more complex signature to start but the replacement code with strings.SplitN(..., ..., 2)
is also more complex.
I had proposed key, value, ok = strings.SplitFirst(kv, separator)
in my first comment on the issue as alternative to introducing PrefixUntil
and this matches the type signature of the current thumbs up favorited version with function name Cut
later proposed by @nightlyone. As far as I remember this would have fitted all use cases I had for the Cut
function well which was why it immediately came to my mind as alternative for the original proposal.
Returning a variable sized value (slice backing array) on the stack requires upstack allocation and variable stack allocation (alloca).
I'm not v familiar with the runtime, but you'd only create a slice on the stack when it's small, can't escape, has a cap known at compile time, and can't be assigned anything of len > cap.
I'm not opposed to SplitFirst & SplitLast; I'd use them. I don't think they should be added instead of optimization.
I'm not v familiar with the runtime, but you'd only create a slice on the stack when it's small, can't escape, has a cap known at compile time, and can't be assigned anything of len > cap.
If that is the constraints of the optimization that is supposed to optimize the allocation in Split and SplitN it wont apply since the returned slice is variable length (Update: also capacity, inside SplitN: N ist not const) and it escapes those functions. Leaving what happens after inlining aside for a moment IF its inlined which is not always the case.
Creating it on the stack to return it means it can't escape the caller.
cap(s) must be known at compile time (i.e. SplitN with const N), but len(s) can vary.
@networkimprov I think you've made your points clear - you don't think Cut helps readability, and you think some SplitN calls could avoid allocations. For that last point, you should file an issue with the details on how exactly that would work with the compiler. I don't think it's a good idea to continue that discussion here, like Russ mentioned in his last comment.
Sorry to belabor the point. @martisch claimed it wasn't feasible, so I tried to explain why it might be.
Someone more familiar with compiler and runtime should file that issue. And maybe this could be placed on hold until that's decided.
There's far too much discussion about this for right now. Putting on hold. Sorry for such an attractive bikeshed.
cc @josharian @mdempsky re filing an issue to consider optimizing small, constant-cap slices; discussion began at https://github.com/golang/go/issues/40135#issuecomment-726561975
Is there a reason to even consider the second example as a justification for this API?
Can't this snippet
idx = strings.LastIndex(address, "@")
if idx != -1 {
host = address[idx+1:]
} else {
host = address
}
Simply be rewritten as:
host := address[1+strings.LastIndex(address, "@"):]
Most helpful comment
We certainly write code to do this operation all the time. Even just grepping for 'Index.* >=' in the Go tree I find plenty of code like:
and
If we had a function like:
then snippets like the above would become:
It does seem like it comes up a lot. What do people think? Should we add this (single) Cut function?
Update: Edited to return (before, after, ok) instead of (before, sep, after), as suggested by @nightlyone.