Pdf.js: Japanese characters incorrectly rendered

Created on 11 Dec 2017  路  11Comments  路  Source: mozilla/pdf.js

Attach (recommended) or Link to PDF file here:
pdfjs-error
Sample file_from_docbase.pdf

Configuration:

  • Web browser and its version: Latest chrome, also IE 11
  • Operating system and its version: Mac and Windows
  • PDF.js version: Latest demo site - 1.9
  • Is a browser extension: No

Steps to reproduce the problem:

  1. Goto https://mozilla.github.io/pdf.js/web/viewer.html
  2. Upload the attached PDF and open also in Acrobat Reader
  3. Compare text - some characters are different - see the attached screenshot

What is the expected behavior?
Characters rendered correctly as in Acrobat

What went wrong?
Incorrect character in viewer - Acrobat shows the correct one

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
https://mozilla.github.io/pdf.js/web/viewer.html

4-font-conversion 4-font-truetype

Most helpful comment

@generiscorp You could complain to Adlib, or check whether you're using the correct options with their software, maybe also ask them why they are using iText 2.1.7 which is from 2009.

I don't speak for PDF.js. I work with a different project (pdfbox) and that one has the same problem.

All 11 comments

The following is printed in the console:

PDF 91d7d110e0edd2aeed3b25db4e68f818 [1.7 www.adlibsoftware.com: CTP (5.4.0.30881) OS (Windows 2012,2,0,64); modified using iText 2.1.7 by 1T3XT / Microsoft Word(15.0)] (PDF.js: 2.0.203)   viewer.js:1607:7
Warning: FormatError: Required "loca" table is not found   pdf.worker.js:340:5

The error suggests either that the PDF file includes broken TrueType fonts that are missing the necessary tables required for glyph mapping to work, or that the PDF file contains inconsistent font data which "lies" about the type of the included font files.
Another possible explanation, based on a cursory look at the actual font files, would be that a number of the TrueType fonts contain bogus file header information.

@Snuffleupagus can you explain what that comment mean? :) Thank you.

PDFBox shows a different error, "head is mandatory". That table is needed too. I can see "head" and "loca" in the byte sequence. Saving MSMincho and opening it with DTL OTMaster fails. However it succeeded when I renamed the file to *.ttc. It's a font collection. (I noticed that the byte sequence started with "ttcf").

@generiscorp The software you have used ("www.adlibsoftware.com: CTP (5.4.0.30881) OS (Windows 2012,2,0,64); modified using iText 2.1.7 by 1T3XT") has embedded the whole truetype collection, instead of embedding just a truetype font, or better, a font subset. Check the options or contact their support. Font subsetting makes the files much smaller.

@THausherr thank you, but the PDF should still be displayed correctly in PDF.js as in Acrobat, right?

It's a font collection. (I noticed that the byte sequence started with "ttcf").

The specification can be found at https://www.microsoft.com/typography/otspec/otff.htm, under the "Font Collections" heading.

@generiscorp Adobe displays a lot of broken files. Nevertheless, your file is incorrect. You put a font collection at a place where one font is expected. And having a one page PDF with no image and just a few lines of text grow to a size of 23MB should give you a hint that something is wrong, and fix your PDF production. Such a file should have a size of less than 100KB.

@THausherr I understand. This is though a file generated by Adlib rendering tool and we have no impact on it's format. Anyway, since the font is embedded all the characters should be displayed correctly, right? Is it going to be fixes or is the only way to fix it is fixing the fonts being embedded?

@generiscorp You could complain to Adlib, or check whether you're using the correct options with their software, maybe also ask them why they are using iText 2.1.7 which is from 2009.

I don't speak for PDF.js. I work with a different project (pdfbox) and that one has the same problem.

The chrome pdf viewer and ghostscript have the same problem.

I haven't had time to run any tests yet, but it looks like this ought to work: https://github.com/mozilla/pdf.js/compare/master...Snuffleupagus:TrueType-Collection
One thing to note though is that the performance isn't so great, but that's probably to be expected since we're forced to parse more than 20 MB of font data for just one page (see also https://github.com/mozilla/pdf.js/issues/9262#issuecomment-351336064).

Edit: Also, @THausherr I just wanted to say thank you for helping narrowing down the root-cause of this issue :-)

Was this page helpful?
0 / 5 - 0 ratings