go/scanner: Scanner.Scan sometimes returns token.ILLEGAL and empty literal string

Created on 10 Oct 2018  路  9Comments  路  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version devel +c870d56f98 Wed Oct 10 02:08:36 2018 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes it does, with Go 1.11.1.

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dmitri/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/dmitri/Dropbox/Work/2013/GoLanding:/Users/dmitri/Dropbox/Work/2013/GoLand"
GOPROXY=""
GORACE=""
GOROOT="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/b8/66r1c5856mqds1mrf2tjtq8w0000gn/T/go-build878012089=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I tried scanning over the following minified Go code containing illegal tokens:

a..b

With the following program:

package main

import (
    "fmt"
    "go/scanner"
    "go/token"
)

func main() {
    // src is the input that we want to tokenize.
    src := []byte("a..b")

    // Initialize the scanner.
    var s scanner.Scanner
    fset := token.NewFileSet()                      // positions are relative to fset
    file := fset.AddFile("", fset.Base(), len(src)) // register input "file"
    s.Init(file, src, nil /* no error handler */, scanner.ScanComments)

    // Repeated calls to Scan yield the token sequence found in the input.
    for {
        pos, tok, lit := s.Scan()
        if tok == token.EOF {
            break
        }
        fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit)
    }
}

(Playground: https://play.golang.org/p/No77Yx0D4Ki.)

What did you expect to see?

go/scanner.Scanner.Scan documentation says:

If the returned token is token.ILLEGAL, the literal string is the offending character.

In all other cases, Scan returns an empty literal string.

Perhaps I'm misunderstanding the documentation, but I expected that if returned token is token.ILLEGAL, then the literal string should not be empty. I expected to see the literal string contain the offending character or characters.

1:1 IDENT   "a"
1:2 ILLEGAL ".."
1:4 IDENT   "b"
1:5 ;   "\n"

What did you see instead?

A token.ILLEGAL token and empty literal string:

1:1 IDENT   "a"
1:2 ILLEGAL ""
1:4 IDENT   "b"
1:5 ;   "\n"

/cc @griesemer

FrozenDueToAge NeedsInvestigation

Most helpful comment

The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.

All 9 comments

The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.

@cznic The offending character here is the 'b' - it if were a '.' it would be ok. This is clearly a bug in the scanner.

Heh, used a .. b in my test by mistake. Still, I'm not sure if the offending character, when ir does not start a token, should be consumed as it may start a valid token.

@cznic My apologies - I concluded a bit too quickly. It's not clear here what exactly the offending character should be. Your suggestion of an empty character might be just as valid. It's certainly problematic that the 'if' starting on scanner.go:740 has not 'else' branch and thus silently exits the big 'switch' statement. Because the result 'tok' is not set at all, it returns the initial (invalid) token.

For reference: #28128

Actually, while this is a bug in the scanner (and there are others, related to this), the spec is pretty clear:

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

So the longest valid sequence of characters here is a dot . followed by another one. That is, the scanner should not return an ILLEGAL token at all. That is what cmd/compile's scanner does.

I vaguely remember having been aware of this for a long time but never bothered to get it 100% correct in go/scanner because it doesn't matter in practice. There's no Go code containing two dots ".." that is valid, so as long as we get an error, we're mostly ok.

There are probably other such cases such as invalid octal or hex numbers, say: 07a. Is that the octal literal 07 followed by an a, or an invalid octal constant? Following the word of the spec exactly, it should be the former. On the other hand, an error complaining about an invalid octal constant may be clearer to a reader than an error due to invalid syntax because the octal 07 is incorrectly followed by the identifier a. Maybe the spec should be less demanding in those cases.

I will try to address the specific problem here if I can find an easy fix (which doesn't slow down the scanner everywhere because we need to keep more state around) but leave the more general problem open.

Change https://golang.org/cl/141337 mentions this issue: go/scanner: don't return token.INVALID for ".." sequence

There's no Go code containing two dots ".." that is valid,

In case you're curious how I discovered this, I ran into this in issue #28082. You left a comment there containing a fenced code block marked up with "Go" syntax, but it was actually a _diff_ of Go code.

I was viewing that comment with some software that performs syntax highlighting for Go using a highlighter implemented on top of go/scanner, and it was behaving incorrectly on the index 1de7cd81b2..38b50f2596 100644 line because I made the wrong assumption about literal being non-empty when token is token.ILLEGAL.

@dmitshur Interesting. The scanner should always return the offending character (lit) if there's an ILLEGAL token. I believe this should be the case now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stub42 picture stub42  路  3Comments

OneOfOne picture OneOfOne  路  3Comments

natefinch picture natefinch  路  3Comments

rsc picture rsc  路  3Comments

dominikh picture dominikh  路  3Comments