Go: regexp: backreference to capturing group breaks if followed by underscore

Created on 15 Jun 2020  路  3Comments  路  Source: golang/go

What version of Go are you using (go version)?

$ go version
1.14.4

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/a/.cache/go-build"
GOENV="/home/a/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/a/.local/share/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build462357482=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Minimal case: Play

What did you expect to see?

I expect the pattern "$2_$1" to work without needing to escape into "${2}_$1", as in python etc.

NeedsInvestigation

Most helpful comment

JavaScript

console.log('foo,bar'.replace(/(\w+),(\w+)/, '$2_$1'));

Result is bar_foo

Perl

my $a = 'foo,bar';
$a =~ s/(\w+),(\w+)/\2_\1/;
warn $a;

Result is bar_foo

Ruby

puts 'foo,bar'.sub(/(\w+),(\w+)/, '\2_\1')

Result is bar_foo

So, I propose to fix the behavior of Go.

diff --git a/src/regexp/all_test.go b/src/regexp/all_test.go
index be7a2e7111..7d944d4844 100644
--- a/src/regexp/all_test.go
+++ b/src/regexp/all_test.go
@@ -227,6 +227,7 @@ var replaceTests = []ReplaceTest{
    {"(a)(((b))){0}c", ".$1.", "xacxacx", "x.a.x.a.x"},
    {"((a(b){0}){3}){5}(h)", "y caramb$2", "say aaaaaaaaaaaaaaaah", "say ay caramba"},
    {"((a(b){0}){3}){5}h", "y caramb$2", "say aaaaaaaaaaaaaaaah", "say ay caramba"},
+   {"(Hello)_(World)", "$2_$1", "Hello_World!", "World_Hello!"},
 }

 var replaceLiteralTests = []ReplaceTest{
diff --git a/src/regexp/regexp.go b/src/regexp/regexp.go
index b547a2ab97..7bab7a5d81 100644
--- a/src/regexp/regexp.go
+++ b/src/regexp/regexp.go
@@ -981,12 +981,24 @@ func extract(str string) (name string, num int, rest string, ok bool) {
        str = str[1:]
    }
    i := 0
-   for i < len(str) {
-       rune, size := utf8.DecodeRuneInString(str[i:])
-       if !unicode.IsLetter(rune) && !unicode.IsDigit(rune) && rune != '_' {
-           break
+   b := str[0]
+   if !brace && '0' <= b && b <= '9' {
+       i++
+       for i < len(str) {
+           rune, size := utf8.DecodeRuneInString(str[i:])
+           if !unicode.IsLetter(rune) && !unicode.IsDigit(rune) {
+               break
+           }
+           i += size
+       }
+   } else {
+       for i < len(str) {
+           rune, size := utf8.DecodeRuneInString(str[i:])
+           if !unicode.IsLetter(rune) && !unicode.IsDigit(rune) && rune != '_' {
+               break
+           }
+           i += size
        }
-       i += size
    }
    if i == 0 {
        // empty name is not okay

All 3 comments

@rsc

This may be counter-intuitive, but if I interpret the documentation correctly, I think this is the way it is supposed to work:

In the template, a variable is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores.
...
In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.

So, the template in the example "$2_$1" is the same as "${2_}${1}", not "${2}_${1}".

JavaScript

console.log('foo,bar'.replace(/(\w+),(\w+)/, '$2_$1'));

Result is bar_foo

Perl

my $a = 'foo,bar';
$a =~ s/(\w+),(\w+)/\2_\1/;
warn $a;

Result is bar_foo

Ruby

puts 'foo,bar'.sub(/(\w+),(\w+)/, '\2_\1')

Result is bar_foo

So, I propose to fix the behavior of Go.

diff --git a/src/regexp/all_test.go b/src/regexp/all_test.go
index be7a2e7111..7d944d4844 100644
--- a/src/regexp/all_test.go
+++ b/src/regexp/all_test.go
@@ -227,6 +227,7 @@ var replaceTests = []ReplaceTest{
    {"(a)(((b))){0}c", ".$1.", "xacxacx", "x.a.x.a.x"},
    {"((a(b){0}){3}){5}(h)", "y caramb$2", "say aaaaaaaaaaaaaaaah", "say ay caramba"},
    {"((a(b){0}){3}){5}h", "y caramb$2", "say aaaaaaaaaaaaaaaah", "say ay caramba"},
+   {"(Hello)_(World)", "$2_$1", "Hello_World!", "World_Hello!"},
 }

 var replaceLiteralTests = []ReplaceTest{
diff --git a/src/regexp/regexp.go b/src/regexp/regexp.go
index b547a2ab97..7bab7a5d81 100644
--- a/src/regexp/regexp.go
+++ b/src/regexp/regexp.go
@@ -981,12 +981,24 @@ func extract(str string) (name string, num int, rest string, ok bool) {
        str = str[1:]
    }
    i := 0
-   for i < len(str) {
-       rune, size := utf8.DecodeRuneInString(str[i:])
-       if !unicode.IsLetter(rune) && !unicode.IsDigit(rune) && rune != '_' {
-           break
+   b := str[0]
+   if !brace && '0' <= b && b <= '9' {
+       i++
+       for i < len(str) {
+           rune, size := utf8.DecodeRuneInString(str[i:])
+           if !unicode.IsLetter(rune) && !unicode.IsDigit(rune) {
+               break
+           }
+           i += size
+       }
+   } else {
+       for i < len(str) {
+           rune, size := utf8.DecodeRuneInString(str[i:])
+           if !unicode.IsLetter(rune) && !unicode.IsDigit(rune) && rune != '_' {
+               break
+           }
+           i += size
        }
-       i += size
    }
    if i == 0 {
        // empty name is not okay
Was this page helpful?
0 / 5 - 0 ratings