Almanac.httparchive.org: Generate an ebook

Created on 6 Jun 2019  路  30Comments  路  Source: HTTPArchive/almanac.httparchive.org

@HTTPArchive/developers curious to hear thoughts from others about this, might be crazy.

I'd like to see the _entire contents of the Almanac_ on a single web page, formatted similar to a book. It would also have a print stylesheet to handle things like page breaks and page numbers, so one could print to PDF and it'd just work鈩笍 as a fully formed e-book. It'd also be a PWA that could be added to home screen and read offline.

There are concerns like lazy loading, history state management, deep linking, etc but I think these are all solvable problems.

I'm excited about this idea because a report on the state of the web should ideally maximize the web's capabilities for a great UX.

WDYT?

Requirements (edit by @mikegeyser):

Structure:

  • [x] Table of contents
  • [x] Page numbers
  • [x] Header/footer in the margins
  • [x] Cover page
  • [x] Methodology section
  • [x] Contributors section

Rendering:

  • [x] Set page metadata correctly (title etc.)
  • [x] Mirror page margins.
  • [x] Solve the problem of urls when printed.
  • [x] Table wrappers aren't rendering properly
  • [x] Problem with 'that z-index figure' in the CSS chapter
  • [x] Weird black chart in the markup chapter
  • [x] Prevent breaking a figure over multiple pages (separating the caption from the image)
  • [x] Figure out what to do with tables that span multiple pages (particularly font chapter)
  • [x] Figure out what to do with charts that are too big for a page
  • [x] Handle internal links in-chapter references, that are repeated (e.g. #fig1, #conclusion).
  • [x] Handle internal links cross-chapter references).
  • [x] Handle internal links (Author links top).
  • [x] Handle internal links (Author links top).
  • [x] Author Avatars missing in PDF.
  • [x] Internationalise book name (e.g. ebook-en.pdf)
  • [x] Internationalise CSS content (e.g. title, pg...etc.) - maybe with inline <style> tags in ebook template?
  • [x] ToC page numbers are misaligned in the Japanese PDF.

Tooling:

  • [ ] Try and come up with a solution that doesn't need the html to be served from flask
  • [ ] Integrate weasyprint into the generate script
  • [ ] Make the rendering config more dynamic
  • [ ] Currently have to call weasyprint per year and language - script this.

Please feel free to add any more, and we can see if they're ~possible~ feasible. :)

development question

Most helpful comment

Okokok this is cool. :D

I've had some success with weasyprint, and while huge (17mb!) you can see the results here.

What do you think?

All 30 comments

Do you see it as being horizontally navigable, or vertical? Swapping documents instead of scrolling would help with lazy-loading, state-management, unique routes etc. Scrolling could be emulated using animations.

E-book + lazy-loading is going to take some skill.

Would it be similar to the https://tympanus.net/Development/FlipboardPageLayout/ (src at: https://tympanus.net/codrops/2012/05/07/experimental-page-layout-inspired-by-flipboard/). It's an old example i had bookmarked way back when researching something but concepts should still apply.

A newer example leveraging the CSS Grid can be found at https://tympanus.net/Development/PageFlipLayout/ (src at: https://tympanus.net/codrops/2018/11/12/page-flip-layout/)

Lazy loading can be done using the IntersectionObserver when a page or a piece of content comes into view, on any particular user interaction we choose, we can also get couple of pages eagerly (can decide if all their content is needed or some).

Verifying all is OK from SEO side of things https://developers.google.com/search/docs/guides/lazy-loading

Presenting the Quick page view can be done similar to the https://tympanus.net/Development/GridLayoutScrollableContent/ (src at: https://tympanus.net/codrops/2018/09/19/grid-layout-scrollable-content-view/).

Kinda like this approach as a mix with the above.

This was inspired by working backwards from the goal of being able to Ctrl+P some page and be able to get a print version of the entire Almanac (save as PDF or whatever). It's not a mission critical use case but something I thought would be a nice touch.

Thinking more about this, we should focus on the straightforward task of serving each chapter as a standalone page. Then as a bonus if there's time (or even after launch) we could think about combining all of the contents into one document.

My main hesitation with making this the primary experience is the technical complexity in order to provide a great UX. And I'm not sure the amount of work is worth it compared to building a more traditional document structure.

I'll close this issue for now and we can revisit it later in the project depending on resources.

So, considering that the content is all static (more or less) and we already generate html for the jinja templates, we can generate a single 'print friendly' view that can be a single index.html page with minimal styling. While not trivial, it would be pretty simple to do.

Making the almanac an offline-eable PWA will just mean having a service worker with some aggressive content caching. We can even have a nice UX on the page, prompting users to 'download' the content or 'make the content available offline' and have that trigger the fetch and cache.

I don't see it being easy to combine the two, however, into a single experience. What do you think?

75

The MVP for this issue can be to build a single page containing all of the chapters without PWA functionality. In theory, saving this page to a PDF would effectively be downloading the entire ebook. Per #520 it should default to static images for the figures.

I'd also love to explore the possibilities with:

  • a master table of contents at the top of the page

    • containing the part numbers, chapter titles, and section headings

    • with page numbers that correspond to the actual page they'd appear on when printed (dynamically generated?)

  • header/footer in the margins of each printed page, similar to the margins of a book that have the title, part/chapter, and page number

I'd love to work on that next, if that's alright? It sounds fun. :)

Just an update: I'm still planning on working on this, will have a bit more breathing room over the next week. Thanks for understanding! :)

No probs. Btw might want to reuse some of https://github.com/HTTPArchive/almanac.httparchive.org/pull/566

@mikegeyser have you made any progress on this?

Hey!

I've written and deleted a bunch of code, so I might be spinning my wheels a little bit. Getting all of the content generated in a single file is the easy part, but it's huge (like 1mb raw markup) and doesn't look great. It increasingly feels like book and SPA may be mutually exclusive?

Do you have an example of what you were hoping for? Perhaps something like this? I've been trying to emulate the guidance from this article (by Rachel Andrew) but I'm not sure if I'm off course.

(Also, welcome back! I hope you had a great break.)

Hey Mike!

Yeah don't worry about SPA functionality. Everything from the TOC, all 20 chapters, and the methodology should all be rendered to the same (extremely long) page so there's no need for any extra client/server-side rendering of secondary pages. 1 MB sounds like a lot so maybe we can optimize the loading of figures. In any case, do you have a branch I could play with to see how it looks?

The link to Addy's ebook is similar to what I had in mind, but when I look at the print preview it doesn't seem to apply any print-specific formatting. The documentation in Rachel's article is way more complicated than I thought but much closer to the kind of formatting I had in mind, with page numbers and chapter titles in the margins, control of page breaks, etc.

It seems like you're on the right path and I'm excited to see the Almanac in one document.

The biggest problem I have (for page numbers and the TOC) is that the CSS Generated Content for Paged Media Module (css-gcpm) isn't supported in browsers yet. We could look at generating straight to PDF using Prince (as mentioned in Rachel's article) but I haven't committed us to anything yet.

If I'm on the right track, let me tidy some stuff up and I'll push a branch for you to look at.

I've pushed what I have so far to a branch called ebook. It's still very rough, but please feel free to let me know what you think.

Okokok this is cool. :D

I've had some success with weasyprint, and while huge (17mb!) you can see the results here.

What do you think?

This is VERY cool!!! Table of Contents is even hyperlinked to pages!

Presume this is server-side generated, and would be part of npm run generate?

Some nits I've spotted (and know this is just an early version and you've probably got some of these on your to do list, but thought I'd list them anyway cause I'm annoying that way):

  • Title is set to "None | 2019 | The Web Almanac". This is what shows in tab in Chrome.
  • Front page could do some work
  • Doesn't have Methodology (before chapters?) nor Contributors (at end after HTTP/2 chapter?)
  • If this is printed then links won't show URLs. Not sure if this is a concern, or if possible to add them in parenthesis after each link or as footnotes to each page (would make this even bigger!)
  • Again if printing then might want headers and footers alternatingly mirrored (so page numbers appear on left hand side on even pages and right hand side on odd pages).
  • Again if printing then might want to start page 1 after cover sheet, or even have page i from ToC and then page 1 as first chapter. On flip side if more for viewing online as a PDF then people might prefer the page numbers to map to the actual page numbers.

But don't take that as criticism of this - it's AWESOME!!

Yes, you're right about it being a part of the generation process. Those points are all valid, and will work through them. :D

We could also generate an appendix for all of the urls, with a page reference?

We could also generate an appendix for all of the urls, with a page reference?

Possibly, would there be a link after each one? So something like this:

This meant that even those without the skills and resources to concentrate on web performance [243] would suddenly have performant websites...

With 243 being the appendix reference to the URL? Might be easier (for you and the reader) to just show the URL after the link to be honest...

And on that note, by happy coincidence just got this into my inbox: https://www.sitepoint.com/css-printer-friendly-pages/ and it suggests the following to add links:

/* print.css */
a::after {
  content: " (" attr(href) ")";
}

Though should limit this to just the chapter text (add an article element selector?) and probably not the author, reviewer, translators so may want exceptions. Also need to consider whether to show for cross reference chapter links.

Also this might also be a bit much and annoying when viewing as PDF. Maybe we need a "PDF" and a Print Friendly PDF" version? I presume there's no such thing as "only display this text in print mode" for PDFs?

This looks awesome @mikegeyser! 467 pages!!!

More to iterate on but this is a great start.

I've updated the original issue to consolidate a list of what we feel needs to be done for us to consider this complete.

Would it be alright for me to open up a long-running PR for this, so that people can see the extent of the changes and chime in?

Go for it! Just set the PR to "draft".

Added some requirements.

I think we need to decide on the purpose and scope of this. Is the intention to have a viewable PDF? Or a printable PDF? Or Both?

As I alluded to earlier, think those have slightly different requirements (for example whether to display links, whether to change headings from left to right, whether to use real page numbers of start at page 1 for first chapter...etc.) and the requirements are in contradiction to each other. So we might drop some of those requirements completely if only targeting as a PDF to be read on a computer. Or we might want two versions.

To be honest at 467 pages I don't see people printing this themselves. Individual chapters yes - but not the whole thing. If we did ever want to publish this as a "real book" then that might come into play but then would have to deal with the requirements of the publisher then so would raise a separate issue for that if that ever comes to play.

So personally I would limit the scope/intention of this to have an online PDF for now and drop some of those print requirements. Though possibly we should raise a separate issue to re-look at the basic print.css I added in #566 to perhaps add URLs when printing chapters individually?

Of course those people who did decide to print it off completely would still get a pretty professional looking result, just not quite as optimised to include URLs, left/right alternating headings and footers...etc. that we might do if targeting that medium primarily.

Thoughts?

@mikegeyser why did you choose weasyprint over puppeteer and did you look at puppeteer as an option?

We're trying to something similar to this at work and looks like Chrome doesn't support repeating headers on pages so will have a look at weasyprint but wondering if that was the reason for your choice as per your comment in https://github.com/HTTPArchive/almanac.httparchive.org/issues/37#issuecomment-571955659 ?

Nice that there are work benefits to side projects, as well, as immediately thought of this! 馃槉

That's it @bazzadp. Chrome (and thus puppeteer) doesn't support css-gcpm, so the choices were weasyprint (free) or princexml (super expensive).

This is too cool to sit around unused. I'm looking into merging master with the ebook branch to get all the latest updates, but unexpectedly running into a few merge conflicts and other procedural incompatibilities. I'll try to work through those and will update this issue with any progress on that or the feature wishlist. @bazzadp @mikegeyser LMK if you want to help!

This is too cool to sit around unused.

I agree! Was going to get to this eventually but didn't want to step on @mikegeyser 's toes and found other things to amuse myself with.

I'm looking into merging master with the ebook branch to get all the latest updates, but unexpectedly running into a few merge conflicts and other procedural incompatibilities. I'll try to work through those and will update this issue with any progress on that or the feature wishlist. @bazzadp @mikegeyser LMK if you want to help!

I found it easier to go the other way, as @mikegeyser only had a few commits. So forked off a new ebook2 branch from master and merged those into that.

I also moved this to the base templates and internationalised it (mostly - headers and footers still to do) and added the Japanese version, The French and Spanish versions won't generate as not complete, but not looked into why yet - but probably not worth until they are complete anyway.

Still a few things that need to be done to tidy up the PDFs but it's close.

@mikegeyser let us know if you plan on coming back to finish this, otherwise will work on it myself at some point.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rviscomi picture rviscomi  路  5Comments

rviscomi picture rviscomi  路  5Comments

rviscomi picture rviscomi  路  3Comments

rviscomi picture rviscomi  路  3Comments

rviscomi picture rviscomi  路  6Comments