(I know @mikegeyser said he would look today but raising so we don't forget).
As discussed in https://github.com/HTTPArchive/almanac.httparchive.org/pull/394#discussion_r344455485, generating the chapters using npm run generate leads to extra <p></p> lines in both <img and <table figures:
table:
</table>
</div>
</div>
<p></p>
<figcaption>Figure 4. HTTP version usage for home pages.</figcaption>
<p></p>
</figure>
img:
<figcaption>Figure 9. TCP connections per page. (Source: <a href="https://httparchive.org/reports/state-of-the-web#tcp">HTTP Archive</a>)</figcaption>
<p></p>
</figure>
This is invalid HTML when you validate it.
At least some of them look to be due to calling wrap_tables as commenting that out doesn't lead to the issue.
A simple fix is to add a regex replace in generate_chapters.js to remove these spurious tags:
body = generate_figure_ids(body);
body = wrap_tables(body);
body = body.replace(/<p><\/p>/g,"");
const toc = generate_table_of_contents(body);
Will see if @mikegeyser has a better fix to prevent them happening in first place before we go this route.
As @bazzadp pointed out, I think it's the wrap_tables functionality. That uses JSDOM rather than regex, which relies on serializing the jsdom document to string. We had some uncontrollable behaviour in the generate_figure_ids chapter while using that approach, and eventually abandoned it in favour of regex. I think this is probably a similar situation, which is why the problem disappears when you comment our wrap_tables.
I'll carry on looking into it, though, and see if there's an expedient fix.
Actually, I think we should put in the fix that @bazzadp recommended while I keep working on a proper solution for the next release.
Most helpful comment
Actually, I think we should put in the fix that @bazzadp recommended while I keep working on a proper solution for the next release.