go version)?$ go version go version devel +c870d56f98 Wed Oct 10 02:08:36 2018 +0000 darwin/amd64
Yes it does, with Go 1.11.1.
go env)?go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dmitri/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/dmitri/Dropbox/Work/2013/GoLanding:/Users/dmitri/Dropbox/Work/2013/GoLand"
GOPROXY=""
GORACE=""
GOROOT="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/b8/66r1c5856mqds1mrf2tjtq8w0000gn/T/go-build878012089=/tmp/go-build -gno-record-gcc-switches -fno-common"
I tried scanning over the following minified Go code containing illegal tokens:
a..b
With the following program:
package main
import (
"fmt"
"go/scanner"
"go/token"
)
func main() {
// src is the input that we want to tokenize.
src := []byte("a..b")
// Initialize the scanner.
var s scanner.Scanner
fset := token.NewFileSet() // positions are relative to fset
file := fset.AddFile("", fset.Base(), len(src)) // register input "file"
s.Init(file, src, nil /* no error handler */, scanner.ScanComments)
// Repeated calls to Scan yield the token sequence found in the input.
for {
pos, tok, lit := s.Scan()
if tok == token.EOF {
break
}
fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit)
}
}
(Playground: https://play.golang.org/p/No77Yx0D4Ki.)
go/scanner.Scanner.Scan documentation says:
If the returned token is token.ILLEGAL, the literal string is the offending character.
In all other cases, Scan returns an empty literal string.
Perhaps I'm misunderstanding the documentation, but I expected that if returned token is token.ILLEGAL, then the literal string should not be empty. I expected to see the literal string contain the offending character or characters.
1:1 IDENT "a"
1:2 ILLEGAL ".."
1:4 IDENT "b"
1:5 ; "\n"
A token.ILLEGAL token and empty literal string:
1:1 IDENT "a"
1:2 ILLEGAL ""
1:4 IDENT "b"
1:5 ; "\n"
/cc @griesemer
The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.
@cznic The offending character here is the 'b' - it if were a '.' it would be ok. This is clearly a bug in the scanner.
Heh, used a .. b in my test by mistake. Still, I'm not sure if the offending character, when ir does not start a token, should be consumed as it may start a valid token.
@cznic My apologies - I concluded a bit too quickly. It's not clear here what exactly the offending character should be. Your suggestion of an empty character might be just as valid. It's certainly problematic that the 'if' starting on scanner.go:740 has not 'else' branch and thus silently exits the big 'switch' statement. Because the result 'tok' is not set at all, it returns the initial (invalid) token.
For reference: #28128
Actually, while this is a bug in the scanner (and there are others, related to this), the spec is pretty clear:
While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.
So the longest valid sequence of characters here is a dot . followed by another one. That is, the scanner should not return an ILLEGAL token at all. That is what cmd/compile's scanner does.
I vaguely remember having been aware of this for a long time but never bothered to get it 100% correct in go/scanner because it doesn't matter in practice. There's no Go code containing two dots ".." that is valid, so as long as we get an error, we're mostly ok.
There are probably other such cases such as invalid octal or hex numbers, say: 07a. Is that the octal literal 07 followed by an a, or an invalid octal constant? Following the word of the spec exactly, it should be the former. On the other hand, an error complaining about an invalid octal constant may be clearer to a reader than an error due to invalid syntax because the octal 07 is incorrectly followed by the identifier a. Maybe the spec should be less demanding in those cases.
I will try to address the specific problem here if I can find an easy fix (which doesn't slow down the scanner everywhere because we need to keep more state around) but leave the more general problem open.
Change https://golang.org/cl/141337 mentions this issue: go/scanner: don't return token.INVALID for ".." sequence
There's no Go code containing two dots ".." that is valid,
In case you're curious how I discovered this, I ran into this in issue #28082. You left a comment there containing a fenced code block marked up with "Go" syntax, but it was actually a _diff_ of Go code.
I was viewing that comment with some software that performs syntax highlighting for Go using a highlighter implemented on top of go/scanner, and it was behaving incorrectly on the index 1de7cd81b2..38b50f2596 100644 line because I made the wrong assumption about literal being non-empty when token is token.ILLEGAL.
@dmitshur Interesting. The scanner should always return the offending character (lit) if there's an ILLEGAL token. I believe this should be the case now.
Most helpful comment
The
ELLIPSIStoken is missing its last character, so the the "offending character" in this case is empty. The leading".."sequence is ok, but the only valid character after that is a third'.'.