I found that RSS feeds that Hugo generates can contain characters that are invalid for XML.
The XML 1.0 spec defines valid characters: https://www.w3.org/TR/2006/REC-xml-20060816/#charsets
One I encountered in the wild in a blog using Hugo is U+000b (\v, vertical tab). (It was this blog, if you're interested: https://blog.hypriot.com/)
Trying to parse such XML raises an error with Go's decoder (which is how I noticed this in the first place):
XML syntax error on line 10: illegal character code U+000B
My environment:
Here are two small sample Go programs to help demonstrate the problem:
Create a post with an invalid character:
package main
import "fmt"
func main() {
post := `+++
date = "2017-04-02T16:11:58+05:30"
draft = false
title = "New post"
+++
Hi there
`
post += "\u000bsudo apt-get update\u000b"
fmt.Println(post)
}
Use like this: $ ./create-problem-post > ~/t/bookshelf/content/post/newpost.md
Then re-generate the site: $ hugo
Then try to decode the RSS feed with this program:
package main
import (
"encoding/xml"
"io/ioutil"
"log"
"os"
)
func main() {
buf, err := ioutil.ReadAll(os.Stdin)
if err != nil {
log.Fatalf("Reading from stdin: %s", err)
}
type TestStruct struct {
Blah string
}
t := TestStruct{}
if err := xml.Unmarshal(buf, &t); err != nil {
log.Fatalf("Unmarshal XML: %s", err)
}
}
Like so:
$ cat ~/t/bookshelf/public/index.xml | ./read-problem-post
2017/04/02 21:27:43 Unmarshal XML: XML syntax error on line 22: illegal character code U+000B
Thank you!
This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.
The problem still exists. I just tested with the latest master branch:
$ hugo version
Hugo Static Site Generator v0.32-DEV linux/amd64 BuildDate: 2017-12-10T12:38:16-08:00
I had to change the sample program I provided that creates the problem post slightly to account for front matter changes:
$ cat create-problem-post/main.go
package main
import "fmt"
func main() {
post := `---
title: "New post"
date: "2017-04-02T16:11:58+05:30"
draft: false
---
Hi there
`
post += "\u000bsudo apt-get update\u000b"
fmt.Println(post)
}
It is also generating &ldquo and &rdquo which no browser or validator accepts as valid.
This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.
The problem still exists with master as of this moment:
will@snorri:~/t/hugosite$ hugo version
Hugo Static Site Generator v0.42-DEV linux/amd64
will@snorri:~/t/hugosite$ ~/go/src/github.com/horgh/hugo-rss-test/read-problem-post/read-problem-post < public/index.xml
2018/06/09 09:01:41 Unmarshal XML: XML syntax error on line 20: illegal character code U+000B
This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.
This is still a problem with current master:
will@snorri:~/t/hugosite$ hugo version
Hugo Static Site Generator v0.50-DEV linux/amd64 BuildDate: unknown
will@snorri:~/t/hugosite$ rm public/index.xml
will@snorri:~/t/hugosite$ hugo
[snip]
will@snorri:~/t/hugosite$ ~/go/src/github.com/horgh/hugo-rss-test/read-problem-post/read-problem-post < public/index.xml
2018/10/07 10:28:55 Unmarshal XML: XML syntax error on line 20: illegal character code U+000B
Notes from a short investigation:
I attempted to use <!CDATA[ ... ]]> around the <description> contents, but that didn't fix the issue. The illegal character within the CDATA block still violates the XML spec. (Additionally, we use html/template for the RSS feed, so the <!CDATA gets escaped if we try to use that in the RSS template, anyway.)
I then added a transform.XMLEscape template function that essentially calls xml.EscapeText. That doesn't work by itself because xml.EscapeText will bail out when it finds an illegal character.
So, it looks like we'd need to add a sanitizeXML function to strip illegal characters prior to escaping.
Thanks for looking at this! I was actually going to take a stab at fixing it too.
Your XMLEscape idea seems great! Regarding the error from xml.EscapeText: I tested with U+000B and it output the Unicode replacement character for it rather than erroring. What character did you test with that caused an error?
A sanitizeXML function seems okay too if there are indeed errors.
I looked at the xml.EscapeText result rather quickly at the end, so you may be right. I'll take another look.
I have a branch that uses the EscapeText template method: https://github.com/gohugoio/hugo/compare/master...horgh:horgh/rss-invalid-chars?expand=1
I had some trouble with the tests. For some reason in the tests the vertical tab disappears all together. If I build and run a test against a hugo directory it works fine though. Any ideas? Or maybe you're making a branch anyway!
Edit: And I don't understand that Travis failure!