go/scanner: Scanner.Scan sometimes returns token.ILLEGAL and empty literal string

Created on 10 Oct 2018 · 9Comments · Source: golang/go

What version of Go are you using (`go version`)?

$ go version
go version devel +c870d56f98 Wed Oct 10 02:08:36 2018 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes it does, with Go 1.11.1.

What operating system and processor architecture are you using (`go env`)?

go env Output



$ go env

GOARCH="amd64"

GOBIN=""

GOCACHE="/Users/dmitri/Library/Caches/go-build"

GOEXE=""

GOFLAGS=""

GOHOSTARCH="amd64"

GOHOSTOS="darwin"

GOOS="darwin"

GOPATH="/Users/dmitri/Dropbox/Work/2013/GoLanding:/Users/dmitri/Dropbox/Work/2013/GoLand"

GOPROXY=""

GORACE=""

GOROOT="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go"

GOTMPDIR=""

GOTOOLDIR="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go/pkg/tool/darwin_amd64"

GCCGO="gccgo"

CC="clang"

CXX="clang++"

CGO_ENABLED="1"

GOMOD=""

CGO_CFLAGS="-g -O2"

CGO_CPPFLAGS=""

CGO_CXXFLAGS="-g -O2"

CGO_FFLAGS="-g -O2"

CGO_LDFLAGS="-g -O2"

PKG_CONFIG="pkg-config"

GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/b8/66r1c5856mqds1mrf2tjtq8w0000gn/T/go-build878012089=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I tried scanning over the following minified Go code containing illegal tokens:

a..b

With the following program:

package main

import (
    "fmt"
    "go/scanner"
    "go/token"
)

func main() {
    // src is the input that we want to tokenize.
    src := []byte("a..b")

    // Initialize the scanner.
    var s scanner.Scanner
    fset := token.NewFileSet()                      // positions are relative to fset
    file := fset.AddFile("", fset.Base(), len(src)) // register input "file"
    s.Init(file, src, nil /* no error handler */, scanner.ScanComments)

    // Repeated calls to Scan yield the token sequence found in the input.
    for {
        pos, tok, lit := s.Scan()
        if tok == token.EOF {
            break
        }
        fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit)
    }
}

(Playground: https://play.golang.org/p/No77Yx0D4Ki.)

What did you expect to see?

go/scanner.Scanner.Scan documentation says:

If the returned token is token.ILLEGAL, the literal string is the offending character.

In all other cases, Scan returns an empty literal string.

Perhaps I'm misunderstanding the documentation, but I expected that if returned token is token.ILLEGAL, then the literal string should not be empty. I expected to see the literal string contain the offending character or characters.

1:1 IDENT   "a"
1:2 ILLEGAL ".."
1:4 IDENT   "b"
1:5 ;   "\n"

What did you see instead?

A token.ILLEGAL token and empty literal string:

1:1 IDENT   "a"
1:2 ILLEGAL ""
1:4 IDENT   "b"
1:5 ;   "\n"

/cc @griesemer

FrozenDueToAge NeedsInvestigation

Source

dmitshur

Most helpful comment

The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.

cznic on 10 Oct 2018

👍2

All 9 comments

cznic on 10 Oct 2018

👍2

@cznic The offending character here is the 'b' - it if were a '.' it would be ok. This is clearly a bug in the scanner.

griesemer on 10 Oct 2018

Heh, used a .. b in my test by mistake. Still, I'm not sure if the offending character, when ir does not start a token, should be consumed as it may start a valid token.

cznic on 10 Oct 2018

@cznic My apologies - I concluded a bit too quickly. It's not clear here what exactly the offending character should be. Your suggestion of an empty character might be just as valid. It's certainly problematic that the 'if' starting on scanner.go:740 has not 'else' branch and thus silently exits the big 'switch' statement. Because the result 'tok' is not set at all, it returns the initial (invalid) token.

griesemer on 10 Oct 2018

For reference: #28128

griesemer on 10 Oct 2018

Actually, while this is a bug in the scanner (and there are others, related to this), the spec is pretty clear:

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

So the longest valid sequence of characters here is a dot . followed by another one. That is, the scanner should not return an ILLEGAL token at all. That is what cmd/compile's scanner does.

I vaguely remember having been aware of this for a long time but never bothered to get it 100% correct in go/scanner because it doesn't matter in practice. There's no Go code containing two dots ".." that is valid, so as long as we get an error, we're mostly ok.

There are probably other such cases such as invalid octal or hex numbers, say: 07a. Is that the octal literal 07 followed by an a, or an invalid octal constant? Following the word of the spec exactly, it should be the former. On the other hand, an error complaining about an invalid octal constant may be clearer to a reader than an error due to invalid syntax because the octal 07 is incorrectly followed by the identifier a. Maybe the spec should be less demanding in those cases.

I will try to address the specific problem here if I can find an easy fix (which doesn't slow down the scanner everywhere because we need to keep more state around) but leave the more general problem open.

griesemer on 10 Oct 2018

👍1

Change https://golang.org/cl/141337 mentions this issue: go/scanner: don't return token.INVALID for ".." sequence

gopherbot on 10 Oct 2018

There's no Go code containing two dots ".." that is valid,

In case you're curious how I discovered this, I ran into this in issue #28082. You left a comment there containing a fenced code block marked up with "Go" syntax, but it was actually a _diff_ of Go code.

I was viewing that comment with some software that performs syntax highlighting for Go using a highlighter implemented on top of go/scanner, and it was behaving incorrectly on the index 1de7cd81b2..38b50f2596 100644 line because I made the wrong assumption about literal being non-empty when token is token.ILLEGAL.

dmitshur on 11 Oct 2018

@dmitshur Interesting. The scanner should always return the offending character (lit) if there's an ILLEGAL token. I believe this should be the case now.

griesemer on 11 Oct 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

x/tools/cmd/stringer: cannot import vendored dependencies

stub42 · 3Comments

cmd/compile: regression in 69a7c15

OneOfOne · 3Comments

hcigvjrjir

natefinch · 3Comments

cmd/compile: testing/quick misbehaves on Nexus 9 linux/arm64

rsc · 3Comments

all: T.FailNow used in goroutines in standard library tests

dominikh · 3Comments

go/scanner: Scanner.Scan sometimes returns token.ILLEGAL and empty literal string

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Most helpful comment

All 9 comments

Related issues

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?