Pdf.js: Include support for tagged PDFs

Created on 25 Jul 2015 · 14Comments · Source: mozilla/pdf.js

While working on a feature to show outlines for documents without outlines, I found that the PDF format supports a standard way to attach semantics to the structure of the PDF (14.6, 14.7, 14.8 of PDF spec). This could be used to improve the text selection, searching and accessibility.

This is a complex feature, and probably not going to be resolved soon. However, we can incrementally add support for smaller features that are under the umbrella of tagged PDFs. I'm now developing the minimal internal data structures and parsers (NumTree, StructTree, StructElem) for the use case of extracting outlines from PDFs, which could be used as a basis for further improvements related to tagged PDFs.

Relevant bugzilla bugs:

https://bugzilla.mozilla.org/show_bug.cgi?id=727819 "Make PDF.js accessible"
https://bugzilla.mozilla.org/show_bug.cgi?id=861157 "Support tagged PDFs in pdf.js"

External resources:

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf (section 14.8 Tagged PDF, but also 14.6 Marked Content and 14.7 Logical Structure)
http://www.aiim.org/Research-and-Publications/standards/committees/PDFUA/Technical-Implementation-Guide-32000-1 "PDF/UA Technical Implementation Guide: Understanding ISO 32000-1 (PDF 1.7)"

1-core 2-feature

Source

Rob--W

Most helpful comment

Edge has touted native support for tagged PDFs. Chrome also now supports it, and has also touted its coming ability to export tagged PDFs from web pages.

Today, Firefox does not expose the tagging in PDFs to the accessibility tree / accessibility APIs. However, this text is on the features list for Firefox 80:

Firefox can now be set as the default system PDF viewer.

If a user who relies AT does this, or a system administrator who does not know the make-up of users does this, it can be problematic for those users who otherwise relied on Edge, Chrome, or Adobe's Reader to parse tagged PDFs for them.

I strongly suggest that advice be stricken from the release notes for 80, and that this bug priority be bumped up. I understand Mozilla is resource constrained now, but the optics on promoting an inaccessible feature that is better served in competing browsers is not a good look.

aardrian on 27 Aug 2020

👍2

All 14 comments

Added [triage-needed] label. Do we need a new label (4-tagged-pdf) for development related to tagged PDFs?

Rob--W on 25 Jul 2015

Do we have example PDFs? I personally have never seen such PDFs. How often are they used in practice?

timvandermeij on 25 Jul 2015

Yes, we have a couple of those:

$ cd test/pdfs/
$ grep -rla '/Marked true'
i9.pdf
fips197.pdf
issue1169.pdf
smaskdim.pdf
issue3879.pdf
bug816075.pdf
pdf.pdf
issue1709.pdf
f1040.pdf
wdsg_fitc.pdf
annotation-border-styles.pdf
ecma262.pdf
bug887152.pdf
issue1133.pdf
issue2442.pdf
issue1796.pdf
type4psfunc.pdf

If you need more, https://encrypted.google.com/search?q=filetype%3Apdf+"%2FMarkInfo"+"%2FMarked+true"

Rob--W on 25 Jul 2015

Thank you! In that case, it is definitely interesting to look into this.

timvandermeij on 25 Jul 2015

I think it might be relatively easy to implement this using a mix of HTML and ARIA attributes - no changes to the rendering required - just add some new attributes.

The PDF tagging info is stored in the StructTreeRoot tree, which contains structure elements with accessibility info like alt text, language and semantic type (H1, TH, LI, etc). The structure elements contain references to objects in the page content stream. There's a graphic showing this here:
https://stackoverflow.com/a/34047585

I think you can inject the PDF tagging info in _layoutText(textDiv) using something like this:

1) Look up the corresponding structure element in the StructTreeRoot tree for the PDF object being rendered
2) Add a role attribute to the div if the structure element has a structure type like H1, H2, LI etc.
3) Add an aria-label attribute to the div if the structure element has an /Alt entry
4) Add an aria-level attribute to the div corresponding to heading level for the structure types H1-H6

This should make headings, lists and images accessible to a screen reader. Tables might be more complicated to implement.

The PDF structure types are listed in section 14.8.4.3. of
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

For a heading the rendering would change from this:

<span style="left: 173.529px; top: 237.049px; 
font-size: 5.99874px; font-family: sans-serif; 
transform: scaleX(1.05905);">
7.  Evaluation
</span>

to this:

<span style="left: 173.529px; top: 237.049px; 
font-size: 5.99874px; font-family: sans-serif; 
transform: scaleX(1.05905);" 
role="heading" aria-level="1">
7.  Evaluation
</span>

That would then be read by a screen reader as "7. Evaluation, heading level 1" and more importantly lets the user skip between headings using the 'next heading' key (which makes large documents much easier to navigate)

dd8 on 10 Jan 2020

👀2

I've noticed the 4-tagged-pdf label has been removed. Is this still something that's being pursued?

blackdrago on 27 Mar 2020

The issue being open indicates that we're considering it. This is a feature, and the labels have been reordered a bit.

timvandermeij on 27 Mar 2020

Wow, that's great! Does this feature-being-considered include support for generating tagged PDFs? It could facilitate implementing something like a parser/analyzer for existing PDFs, but it would also provide support for generating 508c PDFs.

Core Functionality Required for Generating 508c PDFs:

tag the document (with a language and a title, possibly other tags)
tag the structural objects inside the PDF (header, table, th, td, lists, etc.)
add alternate text to visual media (images, video, figures, etc.)
create/maintain a tab order of elements

If core functionality existed for these 4 things, then it would be possible to implement logic in the PDF generation process that would produce 508c PDFs. Which, to be honest, would be huge, since I've not yet found any OpenSource javascript tool with this functionality supported.

After having written this, I'm not sure if this would qualify as a separate feature request or not... I'd happily create a new issue if that is the case.

blackdrago on 1 Apr 2020

I've been working with @cuhaller to provide better compliance with SC 2.4.10 and 1.1.1 of the WCAG 2.0 for use cases specific to the app his team is working on.

I believe the changes should be sufficient for a subset of what this issue is requiring to be done. I'll have a PR up in the next week or so following the contributing guidelines. I'll update this thread when I submit.

trjohnst on 17 May 2020

👀1

I have changes in a fork from 2.3.200 of PDF.js to provide heading levels and alternative image text (without positioning) located in the headings-and-img-alt-text branch of this repo.

I'm hesitant to open a PR as there are merge conflicts against master and I currently do not have the time to resolve them.

If anyone has availability to bring this branch up to date with master let's get in touch!

trjohnst on 21 May 2020

Edge has touted native support for tagged PDFs. Chrome also now supports it, and has also touted its coming ability to export tagged PDFs from web pages.

Today, Firefox does not expose the tagging in PDFs to the accessibility tree / accessibility APIs. However, this text is on the features list for Firefox 80:

Firefox can now be set as the default system PDF viewer.

aardrian on 27 Aug 2020

👍2

Our organization is looking to implement an accessible PDF solution for users of assistive technologies. We’ve come to the conclusion that previewing a PDF with PDF JS is not accessible as semantic markup is missing. The lack of semantic information creates barriers for users who interact with screen reader software. While the PDF does display in plain text and announce annotations, markup is not supplied for headings, tables, images or links.

The use case surrounding tables is particularly difficult for screen reader users. Tables that lack semantic markup provide no context to users and impossible for screen reader users to fully understand the information presented in the PDF.

Links are announced as URLs instead of the specific link text which makes understanding the purpose of the link difficult. We would recommend that links use the visible link text instead of the link URL, so that users will understand the link in context.

Without this support, we have concerns about implementing PDF JS broadly. Is there any update or timeline around supporting a feature to provide semantic markup? We ask that this issue be considered a higher priority as it impacts a users ability to perceive and interact with content.

samsmith-workday on 24 Sep 2020

As far as I know, contributions are more than welcome

fgilio on 24 Sep 2020

Thanks @trjohnst for your work on this.

I started manually rebasing @trjohnst's branch on pdf.js master. This approach works well for tags which only need a single level; e.g. headings or images with alt text. When walking the content stream, if it encounters a marked-content sequence, it looks up the associated structure element and places the appropriate ARIA role on the text span in the HTML output by the pdf.js text layer.

Unfortunately, this isn't sufficient for anything that needs nested tags; e.g. lists or tables. I don't think the approach can be extended to cover those, at least not without a lot of tricky edge cases. Furthermore, in order to properly support links and form fields (and note that form fields weren't supported by pdf.js at the time of @trjohnst's contribution), we need to be able to consider the annotation layer, not just the text layer. Thinking even further forward, it'd be good to be able to implement heuristics to try to detect (and correctly position) headings, links, tables, form fields, etc. in untagged PDFs.

Rather than trying to do this in the text layer, I think we're going to need to walk the structure tree and render nodes based on that, setting ARIA properties on the elements we output. The structure tree can reference data in both the text and annotation layers. We can either reorder the text and annotation layer DOM nodes based on the structure tree (might be tricky without breaking the visual rendering?) or use aria-owns to reorder just the a11y tree without reordering the DOM.

Architecturally, this is tricky because the text and annotation layers are already rendered separately, and now we need to look at a third layer (or at least source of truth), the structure tree, which can move (or reference) nodes in both of the other layers. The simplest way to do this is probably to attach an id to every marked-content sequence (in the text layer) and link/form field (in the annotation layer). I see form fields already have a data attribute specifying an id. If we're going to use aria-owns, we need to set the id attribute anyway, so this might feed two birds with one scone. The id would need to be something we can calculate from outside of the text and annotation layers, from within our new structure layer. When we're handling the structure tree, we'd then output elements for the structure elements, moving/owning elements from the text/annotation layers based on their ids.

Going beyond tagged PDF to heuristics, we'd need to be able to do things like: given a link or form field annotation, does its rectangle encompass something in the text layer? if it does, the annotation should be associated with its text (aria-owns or DOM move). Again, that's architecturally tricky because the text and annotation layers (and their inputs) are separate and I don't think we have any cached state from those layers we can use. However, we can potentially look at the bounds of the nodes rendered by the text and annotation layers, though that starts to blur the architectural boundaries between content and presentation processing.

While an initial implementation of tagged PDF doesn't necessarily need to support heuristics, I'd strongly encourage this to be considered as part of the architectural design. The reality is that untagged PDFs are unfortunately very prevalent and it'd be sad to be locked into an architecture which doesn't allow these to be made more accessible. (Note that Acrobat Reader, and to a much lesser extent Chromium, use heuristics to try to make untagged PDFs more accessible.)