Given a couple of bugs reported by @tr3ee from malformed/incomplete tags
like:
whose reproducers are quite simple and have caused runtime panics or infinite hangs, perhaps fuzzing could help us discover what lurks beyond and even such cases.
/cc @namusyaka @dgryski @dvyukov @bradfitz @nigeltao
This bug has fixed.
See: https://go-review.googlesource.com/136875
The current implementation seems to be incompleted, will be fixed by conforming latest spec.
@yorelog that CL fixed #27702 in particular, but I agree with @odeke-em that, in general, it could be useful to fuzz x/net/html, over and above reacting to specific bugs.
That's easy for me to say, though. I don't have time to work on this myself.
What would you suggest as a fuzzing strategy? I could run domato against this lib and report crashes/hangs if this makes any sense.
The idea is to use go-fuzz.
I would happily use go-fuzz but I'm not sure fuzzing an html parser with just random data would cover all interesting paths. It's hard to produce stuff like <math><template><mo><template> (one of the bugs listed above was triggered by that) with a random sequence generator.
Maybe we could use both?
It's not completely random. You can specify an initial corpus data. go-fuzz will take it from there.
It's not just random data, see:
https://go-talks.appspot.com/github.com/dvyukov/go-fuzz/slides/go-fuzz.slide
https://go-talks.appspot.com/github.com/dvyukov/go-fuzz/slides/fuzzing.slide
Also you can pre-bootstrap corpus with some meaningful inputs.
Actually I think I already did this in 2015:
https://github.com/dvyukov/go-fuzz-corpus/blob/master/html/html.go
But the corpus is not checked in.
Update:
Running gofuzz but didn't find anything so far (except for the already reported bugs) but I will leave it running for a while
Ran domato against the patched html library and found 3 crashes with a sample size of 10K files. Is anyone interested in looking into the cause of the crash? (The files are big and messy to inspect, will probably take me some time to go through them).
I'm not sure fuzzing an html parser with just random data would cover all interesting paths
There's at least a couple of approaches to addressing this.
One is to use a "fuzzing dictionary" and/or "seed corpus", described in https://github.com/google/oss-fuzz/blob/master/docs/ideal_integration.md
Two is to accept arbitrary random bytes as input, and map each byte to a string, a string more likely to tickle interesting code paths in the HTML parser. For example: https://play.golang.org/p/3QE4960bHsa
Doing the reverse map from the existing HTML test cases to this "compressed" format is left as an exercise for the reader.
Once you have a dense mapping like this, where each raw input byte is relatively independent, it might be relatively straightfoward to minimize the repro case, if go-fuzz doesn't already help you do so: cut out random sub-slices of the "compressed", backing off if it no longer crashes.
Nice idea, that is probably going to take a longer while. I'll add info when I have news. Thanks for this.
Just to give a quick update: I gave this a shot a couple of months ago and didn't find any relevant crashes in a couple of weeks of fuzzing.
My plan now is to wait and see how support for oss-fuzz and first-class citizenship for fuzzing discussions will unfold. If fuzzing becomes part of the testing flow in Go I'll provide the needed FuzzXyz functions and write the necessary configurations to have it run on some beefy hardware and cover it properly.
Otherwise I'll setup some machines to fuzz it in other ways.
Most helpful comment
It's not just random data, see:
https://go-talks.appspot.com/github.com/dvyukov/go-fuzz/slides/go-fuzz.slide
https://go-talks.appspot.com/github.com/dvyukov/go-fuzz/slides/fuzzing.slide
Also you can pre-bootstrap corpus with some meaningful inputs.