Cheerio: Guarantee a Cheerio.load(dom) overload

Created on 25 Dec 2017 · 8Comments · Source: cheeriojs/cheerio

Since there is no built-in stream-reading method in Cheerio (see the discussion), I have built my own:

function fromStream(stream) {
    return new Promise((resolve, reject) => {
        const parser = new htmlparser.Parser(new htmlparser.DomHandler((err, dom) => {
            if (err) {
                reject(err);
            } else {
                resolve(cheerio.load(dom)); // <-- Not public API!
            }
        }));

        stream.on('error', reject)
            .pipe(parser)
            .on('error', reject);
    });
}

Even though the call cheerio.load(dom) works*, it actually does not conform to Cheerio's public API, which states that load only accepts a string (cf. README, code).

Could the public API be extended to include a Cheerio.load(dom) overload, where dom is a DOM tree compatible to the output produced by htmlparser.DomHandler?

*) see https://github.com/IonicaBizau/scrape-it/issues/83#issuecomment-353850115.

Source

ComFreek

👍5

Most helpful comment

@coryarmbrecht The streams I mentioned above and (afaik) parse5's ParserStream only deal with the problem that you would need to store all the HTML in memory if you had not such streaming approaches. Why would you need to store all the HTML in memory if you were to feed it into the parser chunk-by-chunk anyway?

What you are describing, is called SAX parsing in case of XML, for example. By a quick search, I found sax-js, but I have no idea how up-to-date it is.

ComFreek on 8 Feb 2018

👍3

All 8 comments

This would need to use http://inikulin.github.io/parse5/classes/parserstream.html for HTML, otherwise happy to add this as an additional method (.stream or something)!

fb55 on 25 Dec 2017

Great to hear!

I think .stream should also support streaming HTML fragments. This is something parse5 seems to be missing at the moment in ParserStream (cf. parseFragment), see https://github.com/inikulin/parse5/issues/227.

PS: I've just realized that I constantly referred to the "old" master branch in my previous comment. Maybe it would be a good idea to directly link from NPM to the v1.0.0 branch or to mention it in master's README.

ComFreek on 26 Dec 2017

Glad to see there's development here! I just hit this snag as I have been changing my sync node script to streams. @ComFreek I have looked at your nested links, but it is beyond my knowledge-

Is fragments support a requirement for streaming to Cheerio selectors? Like $('a.new-link').each? I guess it comes down to how chunks are separated, and it makes sense that you need to wait for certain tags (large containers) to be closed.

If I wanted to start going in your direction and try get Cheerio to work with streams (I was thinking a through stream), where should I start? It sounds like without fragment support, I can't just do something like:

const links = []
let readStream = fs.createReadStream(htmlFile);
    let chunks = []

    // Listen for data
    readStream.on('data', chunk => {
        //chunks.push(chunk)
        $('a.new-link').each(function(i, elem) { 
            links[i] = elem
        })
    });

coryarmbrecht on 7 Feb 2018

👍2

What you are describing, is called SAX parsing in case of XML, for example. By a quick search, I found sax-js, but I have no idea how up-to-date it is.

ComFreek on 8 Feb 2018

👍3

@ComFreek, ok I think I figured out my disconnect. I was thinking that if a single chunk has an opening element tag <span>, but doesn't have the closing tag </span>, then Cheerio (or another DOM selector lib) can't read it properly- and you're going to need to store that element's tag in memory until it closes </span>. Once it finally closes, then you can use a selector func. I was thinking I would need the entire containing element to finish streaming in order to retrieve the children. Silly me.

But, I guess all you really need is the opening tag, and the closing tag is just a sign of where to stop. I was thinking about the chunks as needing to be complete objects in order to parse correctly, and not how I just need the beginning tag.

coryarmbrecht on 9 Feb 2018

There is also the parse5.SAXParser option. Should we try and create a streaming solution based on this - anyone up for it ?