Go: x/net/html: 'noscript' inner HTML parsing issue

Created on 11 Jul 2016  路  23Comments  路  Source: golang/go

  1. What version of Go are you using (go version)?

    go1.6 windows/amd64
    
  2. What operating system and processor architecture are you using (go env)?

    set GOARCH=amd64                                  
    set GOBIN=                                        
    set GOEXE=.exe                                    
    set GOHOSTARCH=amd64                              
    set GOHOSTOS=windows                              
    set GOOS=windows                                  
    set GOPATH=F:\go\path64                           
    set GORACE=                                       
    set GOROOT=F:\go\root16                           
    set GOTOOLDIR=F:\go\root16\pkg\tool\windows_amd64 
    set GO15VENDOREXPERIMENT=1                        
    set CC=gcc                                        
    set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0  
    set CXX=g++                                       
    set CGO_ENABLED=1                                 
    
  3. What did you do?

    package main
    
    import (
       "golang.org/x/net/html"
       "log"
       "bytes"
    )
    
    
    func main() {
       data := "<noscript><img src='https://golang.org/doc/gopher/frontpage.png' /></noscript><p><img src='https://golang.org/doc/gopher/doc.png' /></p>"
       if document, err := html.Parse(bytes.NewReader([]byte(data))); err == nil {
           var parser func(*html.Node)
    
           parser = func(n *html.Node) {
               if n.Data == "img" {
                   log.Println(n.Attr[0].Val)
               }
               if n.Data == "noscript" {
                   // here is the issue - noscript tag inner html is represented as TextNode and can't be used as ElementNode
                   log.Println("noscript", n.FirstChild.Type == html.TextNode, n.FirstChild.Data)
               }
               for c := n.FirstChild; c != nil; c = c.NextSibling {
                   parser(c)
               }
           }
    
           parser(document)
       } else {
           log.Panicln("Parse html error", err)
       }
    }
    
  4. What did you expect to see?

    2016/07/11 13:47:33 noscript false img
    2016/07/11 13:47:33 https://golang.org/doc/gopher/frontpage.png
    2016/07/11 13:47:33 https://golang.org/doc/gopher/doc.png
    
  5. What did you see instead?

    2016/07/11 13:39:57 noscript true <img src='https://golang.org/doc/gopher/frontpage.png' />
    2016/07/11 13:39:57 https://golang.org/doc/gopher/doc.png
    
FrozenDueToAge NeedsDecision

Most helpful comment

Here is the fix but I need someone who completely understand how parser and tokenizer works as I'm not sure if there is no side effects of my changes. Also I'm not sure if it works as it should in edge case tests like <noscript><iframe></noscript>. All tests are updated to new version so I think for one of the core package maintainers shouldn't be hard to understand what has been changed and check if it works OK.

All 23 comments

Same here

Mac OS 10.11.5

  1. What version of Go are you using (go version)?
go version go1.6.2 darwin/amd64
  1. What operating system and processor architecture are you using (go env)?
OARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Georg/Develop/Go"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.6.2/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.6.2/libexec/pkg/tool/darwin_amd64"
GO15VENDOREXPERIMENT="1"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"

Detailed explanation with code and source html

Here is the fix but I need someone who completely understand how parser and tokenizer works as I'm not sure if there is no side effects of my changes. Also I'm not sure if it works as it should in edge case tests like <noscript><iframe></noscript>. All tests are updated to new version so I think for one of the core package maintainers shouldn't be hard to understand what has been changed and check if it works OK.

According to W3C, <noscript> may contain other elements, so this is clearly incorrect behavior. Can we please apply @bearburger's commit and get this fixed?

@bearburger the test changes for that seem wrong: I'm pretty sure that

<!doctype html><noscript><iframe></noscript>X

should be parsed as

| <!DOCTYPE html>
| <html>
|   <body>
|     <noscript>
|       <iframe>
|     "X"

@riking you are right - iframe can't include other tags.

I meant that the </noscript> should force closing of all unclosed tags until the noscript open tag. There's a mechanism for that, right? Like how closing a </table> would close still-open <td>s.

<!doctype html><noscript><div></noscript>X
should give the same parse structure.

Is there a workaround I can use until this is resolved? Should I recursively tokenize z.Text() for a <noscript>? And if so, would NewTokenizerFragment("noscript") just do the same thing as this bug?

reg, err := regexp.Compile("</?noscript>")
safeHTMLBytes := reg.ReplaceAll(data, []byte{})

That is such a bad solution that I am tempted to file an issue on the project linked about it. It will not capture every case (uppercase, attributes, whitespace, etc.) and it will capture false positives. A solution must exist that does not attempt to transform input outside the scope of x/net/html.

I have a package affected by this issue as well. The simplest solution is to just feed the content back into the parser:

https://github.com/nathan-osman/go-sechat/blob/563300db9224ed01797dfe1f89f489e8eb9dd5ba/auth.go#L65

Assuming this bug is eventually fixed, I'll need to come up with a way to determine whether the recursive parsing is necessary at runtime.

Is this accepted as a valid bug? I just found that noscript children are not parsed and I've found this issue.

Discovered this yesterday working with goquery. While I could parse the noscript tag contents (get document.Find("noscript").Text(), pass it to strings.NewReader(), then goquery.NewDocumentFromReader()), this is clumsy.

I stumbled upon this as well, everything under

It's odd that 12.2.6.4.5 isn't implemented in parse.go
https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inheadnoscript

Ah, upon further inspection, parser assumes scripting is always enabled. There is no support for scripting = false (that I can see). Therefore, this behavior (to treat noscript contents as raw text) is correct.

https://html.spec.whatwg.org/multipage/scripting.html#the-noscript-element

The ask, then, is that parse.go supports scripting disabled mode.

Even if the parser does assume scripting is enabled the behaviour is still wrong. The spec says "it represents nothing if scripting is enabled", and it's being represented as a TextNode by the parser.

The documentation for the package also doesn't state anywhere that it acts as if scripting is enabled (which is the opposite of what most people would for assume for a Go HTML parser given the lack of javascript engines written in Go.)

Change https://golang.org/cl/172557 mentions this issue: html: add "in head noscript" im support

This behavior can be considered correct given that scripting is always true.
On the other hand, we have now incorporated the implementation of in head noscript im into the x/net/html package.
There is no way to use this im at this time, but I think it would be nice to provide some way to configure the scripting property as a public API.
I ask @nigeltao for making decision on this.

Isn't scripting being true itself a deviation from the spec?

Scripting enabled/disabled should be configurable in order to support emulating different browser contexts.

https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting

Yes, I'm sorry if my previous comment made you confusing, but I was thinking about how to make the scripting property configurable.

Yeah, we should have some way to configure scripting. I'd also like to keep API compatibility, and room for future configuration options. Perhaps some new functions like

func ParseWithOptions(r io.Reader, opts ...ParseOptions) (*Node, error) { etc }
func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOptions) (*Node, error) { etc }

type ParseOption func(*parser)

func ParseOptionEnableScripting(enable bool) ParseOption {
  return func(p *parser) {
    p.scripting = enable
  }
}

Call sites would look like:

doc, err := html.ParseWithOptions(r, html.ParseOptionEnableScripting(false))

The "options are functions" pattern here is the same as e.g. https://github.com/golang/text/blob/master/internal/ucd/ucd.go#L134

Change https://golang.org/cl/174157 mentions this issue: html: implement ParseWithOptions and ParseFragmentWithOptions

@nigeltao
Thank you for the fix for this issue and the work done.

I just ran into this issue and investigated it and if I see it right, the fix is not complete yet. It just fixes parsing noscript elements in the head, but not outside of head. Outside of head the content of the noscript element is still treated as text.

According to the specs the content of noscript elements should be parsed normally outside of head.

Outside of head elements, if scripting is disabled for the noscript element
The noscript element's content model is transparent, with the additional restriction that a noscript element must not have a noscript element as an ancestor (that is, noscript can't be nested).

I may create a pull request to fix this, but have to work through the Contribution Guide first next week.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

natefinch picture natefinch  路  3Comments

enoodle picture enoodle  路  3Comments

Miserlou picture Miserlou  路  3Comments

gopherbot picture gopherbot  路  3Comments

dominikh picture dominikh  路  3Comments