What version of Go are you using (go version)?
go1.6 windows/amd64
What operating system and processor architecture are you using (go env)?
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=F:\go\path64
set GORACE=
set GOROOT=F:\go\root16
set GOTOOLDIR=F:\go\root16\pkg\tool\windows_amd64
set GO15VENDOREXPERIMENT=1
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
What did you do?
package main
import (
"golang.org/x/net/html"
"log"
"bytes"
)
func main() {
data := "<noscript><img src='https://golang.org/doc/gopher/frontpage.png' /></noscript><p><img src='https://golang.org/doc/gopher/doc.png' /></p>"
if document, err := html.Parse(bytes.NewReader([]byte(data))); err == nil {
var parser func(*html.Node)
parser = func(n *html.Node) {
if n.Data == "img" {
log.Println(n.Attr[0].Val)
}
if n.Data == "noscript" {
// here is the issue - noscript tag inner html is represented as TextNode and can't be used as ElementNode
log.Println("noscript", n.FirstChild.Type == html.TextNode, n.FirstChild.Data)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
parser(c)
}
}
parser(document)
} else {
log.Panicln("Parse html error", err)
}
}
What did you expect to see?
2016/07/11 13:47:33 noscript false img
2016/07/11 13:47:33 https://golang.org/doc/gopher/frontpage.png
2016/07/11 13:47:33 https://golang.org/doc/gopher/doc.png
What did you see instead?
2016/07/11 13:39:57 noscript true <img src='https://golang.org/doc/gopher/frontpage.png' />
2016/07/11 13:39:57 https://golang.org/doc/gopher/doc.png
Same here
Mac OS 10.11.5
go version go1.6.2 darwin/amd64
OARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Georg/Develop/Go"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.6.2/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.6.2/libexec/pkg/tool/darwin_amd64"
GO15VENDOREXPERIMENT="1"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"
Here is the fix but I need someone who completely understand how parser and tokenizer works as I'm not sure if there is no side effects of my changes. Also I'm not sure if it works as it should in edge case tests like <noscript><iframe></noscript>. All tests are updated to new version so I think for one of the core package maintainers shouldn't be hard to understand what has been changed and check if it works OK.
According to W3C, <noscript> may contain other elements, so this is clearly incorrect behavior. Can we please apply @bearburger's commit and get this fixed?
@bearburger the test changes for that seem wrong: I'm pretty sure that
<!doctype html><noscript><iframe></noscript>X
should be parsed as
| <!DOCTYPE html>
| <html>
| <body>
| <noscript>
| <iframe>
| "X"
@riking you are right - iframe can't include other tags.
I meant that the </noscript> should force closing of all unclosed tags until the noscript open tag. There's a mechanism for that, right? Like how closing a </table> would close still-open <td>s.
<!doctype html><noscript><div></noscript>X
should give the same parse structure.
Is there a workaround I can use until this is resolved? Should I recursively tokenize z.Text() for a <noscript>? And if so, would NewTokenizerFragment("noscript") just do the same thing as this bug?
reg, err := regexp.Compile("</?noscript>")
safeHTMLBytes := reg.ReplaceAll(data, []byte{})
That is such a bad solution that I am tempted to file an issue on the project linked about it. It will not capture every case (uppercase, attributes, whitespace, etc.) and it will capture false positives. A solution must exist that does not attempt to transform input outside the scope of x/net/html.
I have a package affected by this issue as well. The simplest solution is to just feed the content back into the parser:
https://github.com/nathan-osman/go-sechat/blob/563300db9224ed01797dfe1f89f489e8eb9dd5ba/auth.go#L65
Assuming this bug is eventually fixed, I'll need to come up with a way to determine whether the recursive parsing is necessary at runtime.
Is this accepted as a valid bug? I just found that noscript children are not parsed and I've found this issue.
Discovered this yesterday working with goquery. While I could parse the noscript tag contents (get document.Find("noscript").Text(), pass it to strings.NewReader(), then goquery.NewDocumentFromReader()), this is clumsy.
I stumbled upon this as well, everything under
It's odd that 12.2.6.4.5 isn't implemented in parse.go
https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inheadnoscript
Ah, upon further inspection, parser assumes scripting is always enabled. There is no support for scripting = false (that I can see). Therefore, this behavior (to treat noscript contents as raw text) is correct.
https://html.spec.whatwg.org/multipage/scripting.html#the-noscript-element
The ask, then, is that parse.go supports scripting disabled mode.
Even if the parser does assume scripting is enabled the behaviour is still wrong. The spec says "it represents nothing if scripting is enabled", and it's being represented as a TextNode by the parser.
The documentation for the package also doesn't state anywhere that it acts as if scripting is enabled (which is the opposite of what most people would for assume for a Go HTML parser given the lack of javascript engines written in Go.)
Change https://golang.org/cl/172557 mentions this issue: html: add "in head noscript" im support
This behavior can be considered correct given that scripting is always true.
On the other hand, we have now incorporated the implementation of in head noscript im into the x/net/html package.
There is no way to use this im at this time, but I think it would be nice to provide some way to configure the scripting property as a public API.
I ask @nigeltao for making decision on this.
Isn't scripting being true itself a deviation from the spec?
Scripting enabled/disabled should be configurable in order to support emulating different browser contexts.
https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting
Yes, I'm sorry if my previous comment made you confusing, but I was thinking about how to make the scripting property configurable.
Yeah, we should have some way to configure scripting. I'd also like to keep API compatibility, and room for future configuration options. Perhaps some new functions like
func ParseWithOptions(r io.Reader, opts ...ParseOptions) (*Node, error) { etc }
func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOptions) (*Node, error) { etc }
type ParseOption func(*parser)
func ParseOptionEnableScripting(enable bool) ParseOption {
return func(p *parser) {
p.scripting = enable
}
}
Call sites would look like:
doc, err := html.ParseWithOptions(r, html.ParseOptionEnableScripting(false))
The "options are functions" pattern here is the same as e.g. https://github.com/golang/text/blob/master/internal/ucd/ucd.go#L134
Change https://golang.org/cl/174157 mentions this issue: html: implement ParseWithOptions and ParseFragmentWithOptions
@nigeltao
Thank you for the fix for this issue and the work done.
I just ran into this issue and investigated it and if I see it right, the fix is not complete yet. It just fixes parsing noscript elements in the head, but not outside of head. Outside of head the content of the noscript element is still treated as text.
According to the specs the content of noscript elements should be parsed normally outside of head.
Outside of head elements, if scripting is disabled for the noscript element
The noscript element's content model is transparent, with the additional restriction that a noscript element must not have a noscript element as an ancestor (that is, noscript can't be nested).
I may create a pull request to fix this, but have to work through the Contribution Guide first next week.
Most helpful comment
Here is the fix but I need someone who completely understand how parser and tokenizer works as I'm not sure if there is no side effects of my changes. Also I'm not sure if it works as it should in edge case tests like
<noscript><iframe></noscript>. All tests are updated to new version so I think for one of the core package maintainers shouldn't be hard to understand what has been changed and check if it works OK.