Go: x/net/html: 'noscript' inner HTML parsing issue

Created on 11 Jul 2016 · 23Comments · Source: golang/go

What version of Go are you using (go version)?
```
go1.6 windows/amd64
```

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64                                  
set GOBIN=                                        
set GOEXE=.exe                                    
set GOHOSTARCH=amd64                              
set GOHOSTOS=windows                              
set GOOS=windows                                  
set GOPATH=F:\go\path64                           
set GORACE=                                       
set GOROOT=F:\go\root16                           
set GOTOOLDIR=F:\go\root16\pkg\tool\windows_amd64 
set GO15VENDOREXPERIMENT=1                        
set CC=gcc                                        
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0  
set CXX=g++                                       
set CGO_ENABLED=1

What did you do?

package main

import (
   "golang.org/x/net/html"
   "log"
   "bytes"
)


func main() {
   data := "<noscript><img src='https://golang.org/doc/gopher/frontpage.png' /></noscript><p><img src='https://golang.org/doc/gopher/doc.png' /></p>"
   if document, err := html.Parse(bytes.NewReader([]byte(data))); err == nil {
       var parser func(*html.Node)

       parser = func(n *html.Node) {
           if n.Data == "img" {
               log.Println(n.Attr[0].Val)
           }
           if n.Data == "noscript" {
               // here is the issue - noscript tag inner html is represented as TextNode and can't be used as ElementNode
               log.Println("noscript", n.FirstChild.Type == html.TextNode, n.FirstChild.Data)
           }
           for c := n.FirstChild; c != nil; c = c.NextSibling {
               parser(c)
           }
       }

       parser(document)
   } else {
       log.Panicln("Parse html error", err)
   }
}

What did you expect to see?

2016/07/11 13:47:33 noscript false img
2016/07/11 13:47:33 https://golang.org/doc/gopher/frontpage.png
2016/07/11 13:47:33 https://golang.org/doc/gopher/doc.png

What did you see instead?

2016/07/11 13:39:57 noscript true <img src='https://golang.org/doc/gopher/frontpage.png' />
2016/07/11 13:39:57 https://golang.org/doc/gopher/doc.png

FrozenDueToAge NeedsDecision

Source

bearburger

👍11

Most helpful comment

Here is the fix but I need someone who completely understand how parser and tokenizer works as I'm not sure if there is no side effects of my changes. Also I'm not sure if it works as it should in edge case tests like <noscript><iframe></noscript>. All tests are updated to new version so I think for one of the core package maintainers shouldn't be hard to understand what has been changed and check if it works OK.

bearburger on 11 Jul 2016

👍3

All 23 comments

Same here

Mac OS 10.11.5

What version of Go are you using (go version)?

go version go1.6.2 darwin/amd64

What operating system and processor architecture are you using (go env)?

OARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Georg/Develop/Go"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.6.2/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.6.2/libexec/pkg/tool/darwin_amd64"
GO15VENDOREXPERIMENT="1"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"

Detailed explanation with code and source html

roma86 on 11 Jul 2016

bearburger on 11 Jul 2016

👍3

According to W3C, <noscript> may contain other elements, so this is clearly incorrect behavior. Can we please apply @bearburger's commit and get this fixed?

nathan-osman on 5 Dec 2016

👍1

@bearburger the test changes for that seem wrong: I'm pretty sure that

<!doctype html><noscript><iframe></noscript>X

should be parsed as

| <!DOCTYPE html>
| <html>
|   <body>
|     <noscript>
|       <iframe>
|     "X"

riking on 3 May 2017

@riking you are right - iframe can't include other tags.

bearburger on 3 May 2017

I meant that the </noscript> should force closing of all unclosed tags until the noscript open tag. There's a mechanism for that, right? Like how closing a </table> would close still-open <td>s.

<!doctype html><noscript><div></noscript>X
should give the same parse structure.

riking on 3 May 2017

Is there a workaround I can use until this is resolved? Should I recursively tokenize z.Text() for a <noscript>? And if so, would NewTokenizerFragment("noscript") just do the same thing as this bug?

andlabs on 2 Sep 2017

reg, err := regexp.Compile("</?noscript>")
safeHTMLBytes := reg.ReplaceAll(data, []byte{})

hryamzik on 2 Sep 2017

👎1

That is such a bad solution that I am tempted to file an issue on the project linked about it. It will not capture every case (uppercase, attributes, whitespace, etc.) and it will capture false positives. A solution must exist that does not attempt to transform input outside the scope of x/net/html.

andlabs on 2 Sep 2017

I have a package affected by this issue as well. The simplest solution is to just feed the content back into the parser:

https://github.com/nathan-osman/go-sechat/blob/563300db9224ed01797dfe1f89f489e8eb9dd5ba/auth.go#L65

Assuming this bug is eventually fixed, I'll need to come up with a way to determine whether the recursive parsing is necessary at runtime.

nathan-osman on 2 Sep 2017

Is this accepted as a valid bug? I just found that noscript children are not parsed and I've found this issue.

winteraz on 3 Nov 2017

Discovered this yesterday working with goquery. While I could parse the noscript tag contents (get document.Find("noscript").Text(), pass it to strings.NewReader(), then goquery.NewDocumentFromReader()), this is clumsy.

opennota on 1 Dec 2017

I stumbled upon this as well, everything under