Weasyprint: Generating PDF/A conforming PDFs

Created on 15 May 2018  ·  14Comments  ·  Source: Kozea/WeasyPrint

Is it possible to generate PDFs that conform to PDF/A using Weasyprint?
From wikipedia:

Other key elements to PDF/A compatibility include:

  • Audio and video content are forbidden.
  • JavaScript and executable file launches are forbidden.
  • All fonts must be embedded and also must be legally embeddable for
    unlimited, universal rendering. This also applies to the so-called
    PostScript standard fonts such as Times or Helvetica.
  • Colorspaces specified in a device-independent manner.
  • Encryption is disallowed.
  • Use of standards-based metadata is mandated.

Many Thanks

feature

Most helpful comment

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time :wink:. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

All 14 comments

I opened a ticket on PDF X/3 compliance: https://github.com/Kozea/WeasyPrint/issues/640

Perhaps to start the discussion on what direction WeasyPrint should take, it may be worthwhile to collect the purpose of the different standards:

PDF A -> a standard used predominantly for document archiving
PDF X -> a standard used predominantly for professional print (e.g. offset print)

For detailed differences on the two standards, see page 17 of this document: https://www.impressed.de/DOWNLOADS/pdfToolbox_Server/callas_pdfEngine_Reference.pdf

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

I've tried to give Acrobat various PDF files generated by WeasyPrint… It's awful, there are many, many, many things to fix before reaching PDF/A or PDF/X conformance.

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

I agree, but there's a long way waiting for us.

Hi - opening this can of worms - can we list the things needed to conform to PDF/A?
@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?
I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

opening this can of worms

🐛🐛🐛🐛🐛🐛🐛🐛

can we list the things needed to conform to PDF/A?

That would be really useful.

@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?

I don’t really remember, but I think that there’s a PDF validator in Acrobat (not in Reader, it’s not free :cry:).

Does anyone know an open source (or at least free) tool to check PDF/A and PDF/X conformance?

I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

As far as I can remember, there were lots of errors, and most of them were just impossible to fix with Cairo. I think that we need a dedicated PDF generator for that (see #841).

I seem to recall Apache PDFBox having some features, I'll have to check better though.

I think that we need a dedicated PDF generator for that

Maybe this is another use for a post-processor that would parse through the pdf and do what is needed. Seems like a massive undertaking though if it is supposed to support changing everything to be pdf/a compliant. Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

The current post-processor only knows how to parse PDF files generated by Cairo. It removes a lot of edge cases.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

Of course, removing all external dependencies is not a goal per se. But there are some reasons why it would be interesting to consider getting rid of some of them:

  • Having non-Python dependencies is the source of many, many, many installation problems, at least on Windows and macOS.
  • We’ve had many problems with Cairo. More than 20% of the reported issues have the "Cairo" word in their comments.
  • Cairo releases are … sometimes late. #278 is a good example of why it’s been really frustrating to work with its dev team.
  • Cairo does a lot of things WeasyPrint’s not interested in. Generating PNG is useful for WeasyPrint, but it could be done with a PDF-to-PNG converter. Cairo is complex, it will probably never get new PDF-only features soon (the latest stable version is the first one providing metadata and links, for example).
  • Pango should be useless for us. We use it to break lines, but HTML has requirements that are really different from "normal" use cases. That’s why we have a lot of workarounds for texts. We should use Harfbuzz instead, and break lines using a custom algorithm, just as other browsers do. See #301, for example.

So. Here’s what I think.

  • Using a "real" PDF generator would be hard but not impossible. I don’t really like ReportLab for many reasons, but something like that would be really useful.
  • Having a real line-breaking algorithm would make Pango useless.
  • FontConfig is really convenient for Pango, but it should be used only on Linux where it’s the standard library. We could probably rely on macOS and Windows APIs to find fonts (what do other browsers do?).
  • We have to keep HarfBuzz.

Ok, I understand and agree with your points.

I don’t really like ReportLab for many reasons

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

:+1:

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

It can be a separate project, with a quite low-level API. The hard part is probably to handle fonts, by creating a PangoCairo equivalent.

(If anyone knows how to convert PDF to PNG in pure Python, that would be useful too :unamused:.)

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/
(Download here: https://verapdf.org/software/)
There's both a simple gui for checking individual files and also a commandline that can be used for automatic testing.
It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/

That’s really cool, thanks!

It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

That’s really impressive.

Having PDF/A conformance is probably one of the best features we can get once we have a new PDF generator. I’m currently working on that :wink:. (That = the generator, not the PDF/A conformance yet)

I’m currently working on that

Cool, do you have an open repo for it yet? I had been pondering the same.
Thinking out loud the PDF/A conformance has to be an option as it would impact speed and available features?

@liZe is teasing a lot about this new generator. If you need help let me know 😄

How is it going?

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time :wink:. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thejasechen picture thejasechen  ·  3Comments

mjbeyeler picture mjbeyeler  ·  4Comments

ivanprice picture ivanprice  ·  3Comments

amarnav picture amarnav  ·  5Comments

Daniyal-Javani picture Daniyal-Javani  ·  3Comments