go version)?$ go version go version go1.15.3 darwin/amd64
Yes
go env)?go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/sethvargo/Library/Caches/go-build"
GOENV="/Users/sethvargo/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/sethvargo/Development/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/sethvargo/Development/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/sethvargo/.homebrew/Cellar/go/1.15.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/sethvargo/.homebrew/Cellar/go/1.15.3/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cs/jc9pj94x493gb8jr49ys7cnc00gy5b/T/go-build188500672=/tmp/go-build -gno-record-gcc-switches -fno-common"
https://play.golang.org/p/V3JHSB7kQX9
package main
import (
"fmt"
"strings"
"unicode"
)
func main() {
s := "hi\uFEFF"
fmt.Println(len(s))
s = strings.TrimSpace(s)
fmt.Println(len(s))
fmt.Printf("%t", unicode.IsSpace('\uFEFF'))
}
5
2
true
5
5
false
As documented at https://golang.org/pkg/unicode/#IsSpace, this is determined by the Unicode. Unicode character ffef is not in the "space" category. The characters in that category can be found at http://www.fileformat.info/info/unicode/category/Zs/list.htm. So this seems like an issue to raise with the Unicode consortium.
@ianlancetaylor would you be open to a docs update to clarify this? I understand its the spec, but I don't expect most Go developers to have completely read and understand the latest Unicode spec. The character has the name "space" in it and developers would incorrectly assume that TrimSpace would remove it. Adding something like the following to IsSpace could save a future developer a lot of time without much overhead of maintenance for the Go team:
Despite their name, the characters ZERO WIDTH SPACE (\u200B) and ZERO WIDTH NO-BREAK SPACE (\uFEFF) are not classified as space characters in Unicode.
CC @mpvl for thoughts.
For reference, there are 71 unicode characters that have "SPACE" in their name but for which IsSpace returns false. 62 of them actually have "MONOSPACE" (e.g. 0x1d670 MATHEMATICAL MONOSPACE CAPITAL A); the other 9 are:
0x1361 ETHIOPIC WORDSPACE
0x200b ZERO WIDTH SPACE
0x2408 SYMBOL FOR BACKSPACE
0x2420 SYMBOL FOR SPACE
0x303f IDEOGRAPHIC HALF FILL SPACE
0xfeff ZERO WIDTH NO-BREAK SPACE
0x1da7f SIGNWRITING LOCATION-WALLPLANE SPACE
0x1da80 SIGNWRITING LOCATION-FLOORPLANE SPACE
0xe0020 TAG SPACE
package main
import (
"fmt"
"strings"
"unicode"
"golang.org/x/text/unicode/runenames"
)
func main() {
for r := rune(0); r < unicode.MaxRune; r++ {
name := runenames.Name(r)
if !unicode.IsSpace(r) && strings.Contains(name, "SPACE") {
fmt.Printf("%#0x %s\n", r, name)
}
}
}
I'm not suggesting we enumerate _all_ of them, but ZERO WIDTH NO-BREAK SPACE is especially problematic because it frequently appears if you copy a value from Microsoft Excel 馃槓
Most helpful comment
As documented at https://golang.org/pkg/unicode/#IsSpace, this is determined by the Unicode. Unicode character
ffefis not in the "space" category. The characters in that category can be found at http://www.fileformat.info/info/unicode/category/Zs/list.htm. So this seems like an issue to raise with the Unicode consortium.