Pandoc: Feature request: Integrate a native PDF renderer

Created on 19 Nov 2020 · 12Comments · Source: jgm/pandoc

This is the one big thing I'm still missing from Pandoc: An easy, cross-platform way to generate PDFs, without having to rely on any external dependencies.
I understand that this will be massive undertaking, but I think even a simple implementation, which only supports to print some simple text or graphics would already be really helpful.

Rasterific (https://github.com/Twinside/Rasterific) looks like it could be a good library to achieve something like this.

Source

ad-si

Most helpful comment

I agree it's out of scope for pandoc – better to leave this concern to a separate program – and we support already quite a few pdf-engines.

And automatic layouting and typesetting is indeed a very difficult problem (kerning, widows, orphans, hyphenation using language dictionaries, etc. etc.), which is part of the reason TeX is still in use (of all the open source engines, it still produces the best typographic output).

While pandoc happily supplies the semantic markup to those programs, people will always want to send layout instructions along as well. That's where a custom LaTeX template or CSS comes in. Personally, I feel CSS is a much nicer way to declaratively instruct a pdf engine on layout customizations – but browser vendors don't care about pages and care more about not doing too many passes (CSS flex-box takes 2 passes to layout, CSS grid 3) than optimal typography, and the other open source implementations are currently all still somewhat lacking.

Anyway, guess that's not the OP's use-case either. So yes, what's wrong with pdfroff if you just want "some PDF" and don't care much how it looks.

mb21 on 20 Nov 2020

👍2

All 12 comments

Well, even with this library, it's a pretty massive undertaking you're talking about -- manually laying out text in a PDF. Not to mention math layout and the complexities that brings.

jgm on 19 Nov 2020

This is out of scope.

PDF is a different case than every other format Pandoc handles. Needing external dependencies to handle it makes perfect sense. From another perspective, all document formats that Pandoc handles require external dependencies to render.

docx requires Word or LibreOffice to layout and typeset the document
odt requires LibreOffice or another word processor to layout and typeset the document
HTML requires some form of browser to layout and typeset the document
rst requires Wordpad or some other restructured text viewer to layout and typeset the document
mediawiki requires the MediaWiki engine or comparable plus usually a browser to layout and typeset

Even markdown requires some form of text editor that handles things like line wrap (layout and typesetting) to or conversion to another format for rendering.

Why should PDF be any different? The only difference is expecting a pre-renedered output with layout and typesetting work done already. Just because the final viewing step is separated from the layout and typesetting steps doesn't mean it should get special treatment. Pandoc is not a layout engine and does not do typesetting. It is a document format conversion tool. Trying to make it do layout and typesetting would be wildly out of scope, out of character, and frankly just not that feasible.

If you want lightweight PDF renderers that do layout and typesetting there are lots to choose from. They all ave different strengths and weaknesses because this is a huge job with lots of decisions to make that are not part of the document content. Take it from someone who writes layout and typesetting tools, this is not something that should be shoehorned into Pandoc.

alerque on 19 Nov 2020

Ok, I see your point.
But what about outputting PostScript then? You would need Ghostsciprt or similar to render it.
So it'd more similar to the languages you enumerated.

My usecase would be converting from Markdown to PDF. So just some headings and text blocks. Think contracts, letters, text only ebooks, …. I'd be happy with even the most basic implementation.

ad-si on 19 Nov 2020

PDF files are basically just PostScript with some fancy trappings. The same argument would apply to PostScript: in order to generate it you would have to convert raw document content (the Pandoc AST) to a rendered form that has all the physical shape (layout and typesetting) done. This requires things like canvas size, fonts, text shaping, line breaking, styling, and so on and so forth. None of these things are the purview of a document conversion tool.

My usecase would be converting from Markdown to PDF. So just some headings and text blocks. Think contracts, letters, text only ebooks, …. I'd be happy with even the most basic implementation.

Okay, so use a light weight layout engine. I don't think you realize how complex the "simple" cases you are talking about can be, but there are a number of options for doing page layout and typesetting whether from Markdown directly or from one of many formats that Pandoc converts to.

alerque on 19 Nov 2020

Ok, I thought PS might also have some more high level constructs.

there are a number of options for doing page layout and typesetting whether from Markdown directly or from one of many formats that Pandoc converts to.

I think I tried out most of them by now, and all of them have some considerable issues. I guess pdfroff is probably the most lightweight and robust solution at the moment.

ad-si on 19 Nov 2020

I think the most promising Haskell library for this purpose is HPDF, which includes some functions that fill boxes with text.
https://hackage.haskell.org/package/HPDF-1.5.1/docs/Graphics-PDF-Documentation.html
I've thought about this before. The problem is that even something as simple as handling a two-page document, where we'll need to split the text of a paragraph into two boxes, is still pretty complex.

jgm on 20 Nov 2020

Well, maybe fillContainer from HPDF can be used to fill to the end of page and return a new container and the remaining text. I might have to try it out.

jgm on 20 Nov 2020

I fooled around a bit and got some text laid out with this:

{-# LANGUAGE OverloadedStrings #-}
module Main where
import Graphics.PDF
import qualified Data.Text as T
import Data.List (intersperse)
import Debug.Trace

main :: IO ()
main = do
  let rect = PDFRect 0 0 600 400
  Just timesRoman <- mkStdFont Times_Roman
  runPdf "test.pdf" standardDocInfo rect $ do
    theDoc timesRoman

theDoc :: AnyFont -> PDF ()
theDoc font = do
  page1 <- addPage Nothing
  drawWithPage page1 $ drawing font

drawing :: AnyFont -> Draw ()
drawing font = do
  let black = Rgb 0 0 0
  let white = Rgb 1 1 1
  let hsty = Font (PDFFont font 26) white black
  let hrect = Rectangle (100 :+ 320) (500 :+ 360)
  displayFormattedText hrect NormalParagraph hsty $ heading
  let psty = Font (PDFFont font 16) white black
  let prect = Rectangle (100 :+ 100) (500 :+ 300)
  let vboxes = getBoxes NormalParagraph psty para
  let verstate =
          VerState { baselineskip = (12, 0.17, 0.0)
                   , lineskip = (3.0, 0.33, 0.0)
                   , lineskiplimit = 2
                   , currentParagraphStyle = NormalParagraph }
  let (dr, newc, vboxes') = fillContainer
         verstate
         (mkContainer 50 300 100 100 1)
         vboxes
  dr
  trace (show $ containerContentHeight newc) (return ())
  let (dr', _, _) = fillContainer
         verstate
         (mkContainer 50 (300 - containerContentHeight newc - 20) 200 100 1)
         vboxes'
  dr'
  -- displayFormattedText prect NormalParagraph psty $ para

heading  :: TM StandardParagraphStyle StandardStyle ()
heading = do
  paragraph $ do
    startPara
    sequence $ intersperse (glue 5 2 2)
                (map txt $ take 2 $ T.words lorem)
    endPara


para  :: TM StandardParagraphStyle StandardStyle ()
para = do
  setJustification FullJustification
  setBaseLineSkip 20 1 1
  paragraph $ do
    startPara
    sequence $ intersperse (glue 4 4 4 >> txt " ") (map txt $ T.words lorem)
    endPara

lorem :: T.Text
lorem = "Nisi cömmodo arcu, vitae cursus neque ante sed elit. Sed sit amet erat. Phasellus luctus cursus risus. Phasellus ac felis. Proin nec eros quis ipsum pellentesque congue. Curabitur et diam sed odio accumsan cursus.  Pellentesque ultricies. Quisque aliquam. Sed nisi velit, consectetuer eget, dictum ac, molestie a, magna. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Curabitur consequat leo et dui.  Aenean ligula mi, dignissim ut, imperdiet tristique, interdum a, dolor."

This shows how you can fill a rectangle as much as possible, and get a list of the remaining vboxes to fill another rectangle (which is what you need to do at a page break).

jgm on 20 Nov 2020

👍2

feels to me like re-inventing tex ... starts easy and limited and in the end we get a pandoc-latex ...

cagix on 20 Nov 2020

👍2

I think it is great that there is Haskell development in this area. It is good to have options. Thank you for pointing the Haskell library out.
I am still pretty new to TeX, and even as it is great tool, probably unparallel still, it has some sore spots that most probably wont be solved in near future (a few of which I have read: Grid typesetting, dealing with whitespace "rivers," dealing with repeated words at the end or beginning of a line, or creating a different line-breaking or page building algoritm).
Not that I understand any of this deeply.
But maybe, a new tool could take a different look at these issues, someday ...
However, for pandoc project this to me also looks like out of scope. Maybe, if someday there would be a big enough "library" to just integrate in pandoc ... ?

Delanii on 20 Nov 2020

I agree it's out of scope for pandoc – better to leave this concern to a separate program – and we support already quite a few pdf-engines.

Anyway, guess that's not the OP's use-case either. So yes, what's wrong with pdfroff if you just want "some PDF" and don't care much how it looks.

mb21 on 20 Nov 2020

👍2

I think it could be in-scope, potentially. I can see the advantages to being able to render PDF without external tools.
Note that HPDF uses the hyphenation library, which implements the Knuth-Liang hyphenation algorithm, so its output is not too bad. It offers full control over kerning and glue, like low-level tex. If we did want to go in this direction, it would probably be worth creating a library that handles some of the lower-level details.

One worry is that the original creator of HPDF hasn't done anything on the project since 2016. Someone else has taken it over and seems to be maintaining it now, so maybe that's okay (though I note they've disabled issues and PRs on the repository, not a great sign). But one might worry about depending on it.

In my experimentation, the main stumbling block I see is with fonts. Using the built-in Times New Roman, Helvetica, and Courier (which probably only support the latin1 glyphs) is too limiting. I tried loading a type 1 font with the included functions, but had no success yet. This also requires file paths to .pfb and .afm files; we'd need something higher level that gets system fonts on all the major platforms.

jgm on 20 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings