Go: proposal: regexp: add (*Regexp).SubexpIndex

Created on 4 Jun 2019  路  9Comments  路  Source: golang/go

Regular expressions are handy in a variety of simple string parsing situations. Using named capture groups is a good way to document the structure of a regular expression and to eliminate bugs due to the introduction of additional capture groups. Mapping the name to the capture group index is currently quite heavy-weight. I regularly find myself writing code like namedSubexp in the below toy example when using regular expressions to parse strings. What's worse is that I often don't write this code and instead just rely on a brittle hard-coded index.

The rejected proposal in https://github.com/golang/go/issues/24208 argued in favour of a much more heavyweight interface change that also does not appeal to me. github.com/ghemawat/re.Scan uses slices and reflection and thus is too inefficient for anything performance critical. I understand that the bar for changes here is high Furthermore, I do hear an argument in favour of using an external library as the regexp package is already quite large but I think it's exactly in those cases where I'd avoid writing this helper function that I'd also avoid pulling in a new dependency. The proposal here is compact, useful and encourages more maintainable code so I figured I'd float it and see if the it resonates.

This issue proposes a new method on *regexp.Regexp called NamedSubexp which takes a string and returns an integer. The open-ended portion of this proposal is how to deal with the case where no such capture group exists. I'm quite open to that integer being -1 if no such named capture group exists (as in the strings package) or to augmenting the method signature to additionally return a boolean. My inclination for the panic comes from a tendency to use this pattern to define global vars and it's probably bad practice to search the list of strings at runtime.

package main

import (
    "fmt"
    "regexp"
)

var (
    re     = regexp.MustCompile("foo (?P<bar>[0-9]+) (?P<baz>0x[0-9A-Fa-f]+)")
    barIdx = namedSubexp(re, "bar")
    bazIdx = namedSubexp(re, "baz")
)

func namedSubexp(re *regexp.Regexp, name string) int {
    for i, exp := range re.SubexpNames() {
        if exp == name {
            return i
        }
    }
        panic(fmt.Errorf("%v does not have a capture group named %s", re, name))
}

func main() {
    matches := re.FindStringSubmatch("foo 12 0xA123")
        // if matches == nil { ... }
    bar, baz := matches[barIdx], matches[bazIdx]
    fmt.Println(baz, bar)
}

https://play.golang.org/p/WT8dFyp1TCE

Proposal Proposal-Accepted

Most helpful comment

I just pushed a simple implementation of this, with tests and an example.

All 9 comments

Perhaps a more palatable solution would be to augment the strings library instead of adding another method to Regexp. The following function or something like it in the strings package would make me similarly happy:

package strings

// SliceIndex returns the index of the first instance of needle in haystack, or -1 if needle is not present in haystack.
// Slice index performs a linear search of haystack for needle.
func SliceIndex(haystack []string, needle string) int {
   for i, s := range haystack {
      if s == needle {
         return i
      }
   }
   return -1
}

Of course this then begs the question about the need for at least (Last)?SliceIndex(Func)? functions for symmetry with the regular string functions. It's also not obvious that this slice searching belongs in the strings package though I can't think of a more suitable package.

What do people think of this?

// SubexpIndex returns the index of the first subexpression with the given name,
// or else -1 if there is no subexpression with that name.
//
// Note that multiple subexpressions can be written using the same name, as in
// (?P<bob>a+)(?P<bob>b+), which declares two subexpressions named "bob".
// In this case SubexpIndex returns the index of the leftmost such subexpression
// in the regular expression.
func (*Regexp) SubexpIndex(name string) int

Leftmost match or leftmost in the expression?

(?Pa*)|(?Pb+) matching b

Leftmost in the expression, because the method is on Regexp. There's no input text involved when you ask the question. (Hopefully people will just not name the same subexpression twice but there has to be a clear rule.)

Based on mild happiness and no negativity, accepting for API in https://github.com/golang/go/issues/32420#issuecomment-505621472.

Sounds good to me. I'm hopeful this will encourage people to name subexpressions. I'm especially on board if the lookup will be O(1). I worry a bit that a linear scan might be something of a footgun for somebody if they call it in a hot loop. The thought occurred to me that a detailed example on Regexp.SubexpNames which demonstrates looking up indexes at init time with a small function one could just copy out of the godoc might be good enough. Having this method is more likely to change behavior and thus is better. Thanks for taking this up!

Change https://golang.org/cl/187919 mentions this issue: regexp: add (*Regexp).SubexpIndex

I just pushed a simple implementation of this, with tests and an example.

I went a little further than this but I certainly appreciate that support for named capture groups is being added! Here's what I created - https://github.com/PennState/subexp.

Was this page helpful?
0 / 5 - 0 ratings