Go: proposal: spec: disallow Hangul filler codepoints in Go identifiers

Created on 12 Aug 2020  Â·  6Comments  Â·  Source: golang/go

The Hangul filler codepoints (U+115F, U+1160, U+3164) are rendered as zero-width white space as specified by the Unicode standard. And they are allowed in Go identifiers.
https://twitter.com/omengue/status/1293476660268998656
They are only useful for obfuscation. Despites that would be a breaking change I propose to forbid them in Go 1 spec as I only see nasty uses.

What version of Go are you using (go version)?

$ go version
1.4.6

Does this issue reproduce with the latest release?

yes

What did you do?

var á…Ÿ, á… , ã…¤ = 0, 1, 2

Go Playground: https://play.golang.org/p/bOmUvRfC3Co

What did you expect to see?

Compile time failure.

What did you see instead?

Compiles fine.

Proposal

Most helpful comment

I suggest closing this as a duplicate of #20115.

All 6 comments

Apart from code golf, can these been used maliciously?

@davecheney I see much potential for confusion in code review by hijacking names of basic types:

Go playground: https://play.golang.org/p/EYIrCh9XtI_u

func main() {
        var v interface{}
    fmt.Println(v)
    v = interfaceá…Ÿ{}
    fmt.Println(v)
}

type interfaceá…Ÿ struct{}

Output:

<nil>
{}

I'd hope that widespread use of this style of identifier would be self correcting via natural selection.

Said another way, rather than changing the language, can this problem be solved by telling people that this isn't a good way to write code?

Relevant issue about homographs attacks with go/vet: https://github.com/golang/go/issues/20115 Seems this would be a similar use case.

Referencing rsc in https://github.com/golang/go/issues/20209#issuecomment-335273273

I think we agree that the first step is to add a vet check and only after building experience with it think about actual language restrictions (or not). Marking this proposal-hold until there is a proposal (a short list would be fine) of what a vet "unicode" check would check.

The Hangul filler codepoints (U+115F, U+1160, U+3164) are rendered as zero-width white space as specified by the Unicode standard.

Looks like you missed U+FFA0.

But where in the Unicode standard does it say these are zero-width spaces?
They are classified as Lo (Letter, other).

I do see that they (including U+FFA0) have the Default_Ignorable_Code_Point property, and they are the only Letter-classified code points that do. So we should probably include them in the vet checks being contemplated in #20115.
Those vet checks are on hold for higher priority work.

Also, at least in the font I'm using, there's a clear indication that dodgy Unicode is happening:

Screen Shot 2020-08-12 at 10 35 52 AM

I suggest closing this as a duplicate of #20115.

Was this page helpful?
0 / 5 - 0 ratings