Hugo: RSS XML can contain invalid characters

Created on 3 Apr 2017  路  11Comments  路  Source: gohugoio/hugo

I found that RSS feeds that Hugo generates can contain characters that are invalid for XML.

The XML 1.0 spec defines valid characters: https://www.w3.org/TR/2006/REC-xml-20060816/#charsets

One I encountered in the wild in a blog using Hugo is U+000b (\v, vertical tab). (It was this blog, if you're interested: https://blog.hypriot.com/)

Trying to parse such XML raises an error with Go's decoder (which is how I noticed this in the first place):

XML syntax error on line 10: illegal character code U+000B

My environment:

  • Hugo Static Site Generator v0.20-DEV linux/amd64 BuildDate: 2017-04-02T17:42:16-07:00
  • Debian Linux (testing) 64bit

Here are two small sample Go programs to help demonstrate the problem:

Create a post with an invalid character:

package main

import "fmt"

func main() {
    post := `+++
date = "2017-04-02T16:11:58+05:30"
draft = false
title = "New post"

+++

Hi there
`

    post += "\u000bsudo apt-get update\u000b"

    fmt.Println(post)
}

Use like this: $ ./create-problem-post > ~/t/bookshelf/content/post/newpost.md

Then re-generate the site: $ hugo

Then try to decode the RSS feed with this program:

package main

import (
    "encoding/xml"
    "io/ioutil"
    "log"
    "os"
)

func main() {
    buf, err := ioutil.ReadAll(os.Stdin)
    if err != nil {
        log.Fatalf("Reading from stdin: %s", err)
    }

    type TestStruct struct {
        Blah string
    }

    t := TestStruct{}

    if err := xml.Unmarshal(buf, &t); err != nil {
        log.Fatalf("Unmarshal XML: %s", err)
    }
}

Like so:

$ cat ~/t/bookshelf/public/index.xml | ./read-problem-post 
2017/04/02 21:27:43 Unmarshal XML: XML syntax error on line 22: illegal character code U+000B

Thank you!

Bug Keep

All 11 comments

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

The problem still exists. I just tested with the latest master branch:

$ hugo version
Hugo Static Site Generator v0.32-DEV linux/amd64 BuildDate: 2017-12-10T12:38:16-08:00

I had to change the sample program I provided that creates the problem post slightly to account for front matter changes:

$ cat create-problem-post/main.go 
package main

import "fmt"

func main() {
    post := `---
title: "New post"
date: "2017-04-02T16:11:58+05:30"
draft: false
---

Hi there
`

    post += "\u000bsudo apt-get update\u000b"

    fmt.Println(post)
}

It is also generating &ldquo and &rdquo which no browser or validator accepts as valid.

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

The problem still exists with master as of this moment:

will@snorri:~/t/hugosite$ hugo version
Hugo Static Site Generator v0.42-DEV linux/amd64
will@snorri:~/t/hugosite$ ~/go/src/github.com/horgh/hugo-rss-test/read-problem-post/read-problem-post < public/index.xml
2018/06/09 09:01:41 Unmarshal XML: XML syntax error on line 20: illegal character code U+000B

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

This is still a problem with current master:

will@snorri:~/t/hugosite$ hugo version
Hugo Static Site Generator v0.50-DEV linux/amd64 BuildDate: unknown
will@snorri:~/t/hugosite$ rm public/index.xml
will@snorri:~/t/hugosite$ hugo
[snip]
will@snorri:~/t/hugosite$ ~/go/src/github.com/horgh/hugo-rss-test/read-problem-post/read-problem-post < public/index.xml
2018/10/07 10:28:55 Unmarshal XML: XML syntax error on line 20: illegal character code U+000B

Notes from a short investigation:

I attempted to use <!CDATA[ ... ]]> around the <description> contents, but that didn't fix the issue. The illegal character within the CDATA block still violates the XML spec. (Additionally, we use html/template for the RSS feed, so the <!CDATA gets escaped if we try to use that in the RSS template, anyway.)

I then added a transform.XMLEscape template function that essentially calls xml.EscapeText. That doesn't work by itself because xml.EscapeText will bail out when it finds an illegal character.

So, it looks like we'd need to add a sanitizeXML function to strip illegal characters prior to escaping.

Thanks for looking at this! I was actually going to take a stab at fixing it too.

Your XMLEscape idea seems great! Regarding the error from xml.EscapeText: I tested with U+000B and it output the Unicode replacement character for it rather than erroring. What character did you test with that caused an error?

A sanitizeXML function seems okay too if there are indeed errors.

I looked at the xml.EscapeText result rather quickly at the end, so you may be right. I'll take another look.

I have a branch that uses the EscapeText template method: https://github.com/gohugoio/hugo/compare/master...horgh:horgh/rss-invalid-chars?expand=1

I had some trouble with the tests. For some reason in the tests the vertical tab disappears all together. If I build and run a test against a hugo directory it works fine though. Any ideas? Or maybe you're making a branch anyway!

Edit: And I don't understand that Travis failure!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

VoidingWarranties picture VoidingWarranties  路  3Comments

antifuchs picture antifuchs  路  3Comments

geddski picture geddski  路  3Comments

arikroc picture arikroc  路  3Comments

bep picture bep  路  3Comments