Pdf.js: Badly rendered Times New Roman PS

Created on 22 Mar 2019 · 3Comments · Source: mozilla/pdf.js

Attach (recommended) or Link to PDF file here:
EMG8 -Cambridge Essentail Gold Maths 8-B 117.pdf

Configuration:

Web browser and its version: Chrome 73
Operating system and its version: * Macos 10.14.1 *
PDF.js version: 2.2.91
Is a browser extension: No

Steps to reproduce the problem:

Open the PDF (its a single page extracted from a book) and see the badly rendered fonts

What is the expected behavior? (add screenshot)
On any other PDF reader it renders fine:

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
Current github pages viewer: https://mozilla.github.io/pdf.js/web/viewer.html

Notes:
We've managed make it work by editing the fonts within the PDF. It is currently "Times New Roman PS", re-rendering it with just "Times New Roman" seems to fix it.

There are no console errors, or any other visible signs of solutions such as missing CMaps.

Unfortunately we are not allowed to alter PDFs, so re-rendering each one is not a viable solution.

If anyone can give any insight into this and a possible solution, that would be massively appreciated 🤠

4-font-conversion

Source

dmisdm

Most helpful comment

The PDF defines the same fonts many times. For example, font LULQLP+TimesLTStd-Roman is defined nine times. Each one refers to the same FontDescriptor and the same embedded CFF data stream.

There is a font hash computation in PartialEvaluator.prototype.preEvaluateFont() in core/evaluator.js. It adds entries Encoding, ToUnicode, and Widths in the hash. Some fonts in the PDF get identical hash codes because all the mentioned entries are identical, even Widths. Only entries FirstChar and LastChar differ. If fonts get identical hash codes, could it cause a font to be skipped so that it won't be converted to OpenType?

Here is a reduced PDF that contains two fonts from the original PDF
issue10665_reduced.pdf

janpe2 on 26 Mar 2019

👍4

All 3 comments

The PDF defines the same fonts many times. For example, font LULQLP+TimesLTStd-Roman is defined nine times. Each one refers to the same FontDescriptor and the same embedded CFF data stream.

Here is a reduced PDF that contains two fonts from the original PDF
issue10665_reduced.pdf

janpe2 on 26 Mar 2019

👍4

There is a font hash computation in PartialEvaluator.prototype.preEvaluateFont() in core/evaluator.js. It adds entries Encoding, ToUnicode, and Widths in the hash. Some fonts in the PDF get identical hash codes because all the mentioned entries are identical, even Widths. Only entries FirstChar and LastChar differ.

Really excellent analysis, thank you; this made the bug easy to fix!

If fonts get identical hash codes, could it cause a font to be skipped so that it won't be converted to OpenType?

In some badly generated PDF files there can be huge amounts of identical fonts, and the purpose of preEvaluateFont was simply to avoid having to load/parse duplicate ones. Hence loadFont will compare hashes, and if possible use an already loaded/parsed font.
Obviously this all hinges on the fact that the hashes are actually correct/unique, but fortunately there's been relatively few bugs in that code over the years.