Phantomjs: Phantomjs 2.x renders PDF with huge file size as compare to 1.x

Created on 11 Feb 2016  路  43Comments  路  Source: ariya/phantomjs

When I try to create pdf using phantomjs 2.x, it increases file size almost ten times more than old versions i.e. 1.x.

Are there any parameters to control file size or pdf quality?

Thanks in advance.

Bug Confirmed QOther Regression Unix stale

Most helpful comment

For those with limited use cases looking for a temporary workaround, adding font files to the PhantomJs server's ~/.fonts folder and referencing them in CSS with the src: local('Font Name') syntax seems to work.

All 43 comments

Unfortunately, no, there is no way to control this.

Is it known issue? And are you going to fix this in the upcoming release?

No, we will not fix this issue. I don't see problems with it (unless the result file is really huge in comparison with the file created with 1.x).

We could eventually fix it, if there is a problem with Qt itself. You probably should search their bug tracker for similar issues.

@roshan-lrn also, could you please attach your PDF files?

I have to say that I wouldn't have closed this summarily - a 10x file size increase from 1.x to 2.x for the same website does, in my book, qualify as a regression worth at least digging into a little. However, we really do need to see the PDFs you got, and you should be aware that we may not be able to do anything about it. PDF generation happens deep inside Qt, not in our code.

@zackw yes, it's hard to do anything without examples and additional info (OS for example). Because I don't see the regression on Windows. I believe it happens on other platforms.

UPD
I quickly checked the Qt bug tracker. I didn't found anything related to the PDF file size. So, it might be a new issue.

On Windows platform, it's working fine.

I am using CentOS 6, 64 bit, PhantomJS 2.1.1, and qt 4.6.2.

Kindly find attached PDFs.

phantomjs-2.1.1.pdf - 4682kb
phantomjs-1.9.8.pdf - 577kb
phantomjs-1.9.8 (update html zoom).pdf - 607kb

OK, this is definitely a real regression. The problem is that _nearly all of the text is being emitted as vector graphics._ In other words, each glyph is being converted to a series of line drawing commands. This bulks up the file, but it also makes the text non-searchable, non-copyable, etc.

I think this probably does need to get reported upstream to Qt, but first we need to nail down a reproducer and figure out why it doesn't happen on Windows. It would be nice to have an example HTML file -- rendering as just one page, and without the huge background images, please! -- that shows the effect. You can confirm the effect by trying to select text in a PDF viewer. In your phantomjs-2.1.1.pdf, only the text in the footer of each page is selectable, whereas in -1.9.8.pdf all of the text is selectable.

I think the problem is in embedded fonts.
PhantomJS 2.1.1:
image

PhantomJS 1.9.8:
image

DejaVu Sans is obviously larger than other 3 fonts.

I am using 'ProximaNova-Regular', 'ProximaNova-SemiBold', and 'ProximaNova-RegularBold' as primary fonts and 'Arial' as secondary font. I have added 'eot', 'woff2', 'woff', 'ttf', and 'svg' files for fonts.

I am NOT using 'NimbusSanL-Regular' and 'DejaVuSans' fonts.

On landing page, I am using background image, otherwise all images are in 'img' tag. Footer is added through paperSize api.

One more thing, I would like to mention here, I have generated above PDF files on same machine using same files with two different phantomjs versions.

This is a recurring issue reported by users.

As mentioned in astefanutti/decktape#3, installing the TTF or OTF font on the local file system seems to overcome the issue, though that's not a solution obviously.

It used to be a similar problem on Mac OS X, until QTBUG-10094 got fixed and integrated into PhantomJS. So maybe, something equivalent is needed on Unix.

Anyway, +1 to let that issue opened.

Is there a trick to making installed ubuntu fonts work with a consumed phantomjs pdf? I assume disable webfonts as well?

@astefanutti We need a minimized test case as I said above.

It would be nice to have an example HTML file -- rendering as just one page, and without the huge background images, please! -- that shows the effect.

It sounds like this test case needs to use webfonts, but ideally there should be just one of them, and we need the actual font file, not a reference to a service.

@miganga Please take questions about how to work around this bug to a support forum, such as

The bug tracker is only for figuring out how to fix the bug.

@zackw agreed. I'll try to produce such a minimal test case ASAP.

@zackw, @Vitallium, here is a _minimal_ test case test.html for that issue:

<html>
<style>
    @font-face {
        font-family: 'Ubuntu Mono';
        font-style: normal;
        font-weight: 400;
        src: url(http://astefanutti.io/further-cdi/fonts/UbuntuMono-Regular.woff) format('woff');
    }
    body {
        font-family: 'Ubuntu Mono', monospace;
    }
</style>
<body>
    <div>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
        tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
        quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
        consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
        cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat
        non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    </div>
</body>
</html>

When executing phantomjs rasterize.js test.html test.pdf, the output PDF is 6 KB on Mac OS X and Windows while it's 122 KB on Debian 7. Besides, the text is not selectable in the latter.

@astefanutti great!

@astefanutti's example is correct. v2.1.x on debian and ubuntu at least, handles _local_ fonts correctly, i.e, they are embedded in the pdf and therefore rendered on screen in high quality, text is searchable and selectable and the file size is kept low. For _remote_ webfonts, as in the example above, the text is converted to outlines, fonts are not embedded and text isn't search and selectable, and makes the file size huge for longer docs.

It affects both woff, otf and ttf remote fonts.

Probably QT related but it works as intended in wkhtmltopdf's latest alpha versions.

@martent thanks a lot. Good to know that it works in wkhtmltopdf as it may help identify proper fix for this. I'll try to dig into that direction ASAP.

Yes, I have the same issue @astefanutti has demonstrated with .ttf fonts. FYI.

@martent I've tried exporting the above example using wkhtmltopdf's latest alpha on Debian 7 (wkhtmltox-0.13.0-alpha-7b36694_linux-wheezy-amd64.deb from http://wkhtmltopdf.org/downloads.html) and faced the issue as well.

What version / OS have you been using to produce a correct PDF? That may help identifying the fix...

@astefanutti Unfortunately, I get the same results as you when I test it now. Might be that I had the 0.12 version on the system level in a test box with the patched QT/WK. v0.13 will have those patches, including the one for selectable text, applied in a later stage according to the docs https://github.com/wkhtmltopdf/wkhtmltopdf/blob/0.13/README.md#013-rc

@martent I've tried with the latest stable version of wkhtmltopdf (wkhtmltox-0.12.3_osx-cocoa-x86-64.pkg) and it works. So it's either the patches yet to be applied in v0.13 or it's a regression in Qt introduced between Qt 4.8 and 5.x. I would favour the later assumption as most of the WOFF support patches are already in PhantomJS 2.x and there has been a major refactoring of QFontEngine platform implementations in Qt 5.0. I'll try to test the above sample in Qt directly to validate that assumption.

I've isolated a minimal Qt application that demonstrates this is a regression from Qt 4.8 to Qt 5.x. I've created QTBUG-52417 and I'll try to dig into Qt to eventually have a fix.

@astefanutti Great isolation of the problem.

I might add to the earlier discussion above that this is not just a problem with resulting file sizes but also that there are no real text in the pdf, just vector images of glyphs, meaning that you can't search the file or select and copy text. The reason why the fonts are not embedded is simple because they are not used in the final pdf, just in the conversion stage where the glyphs are converted to vector images.

Has anyone found a workaround to this issue? I tried base64 encoding out fonts, to no avail.

I pinpointed the code that's causing this issue, and came up with a hacky fix but don't quite know how to properly solve it..

https://github.com/qtproject/qtbase/blob/fb6000a74f57bd3c096f6a10142477bf2faf0ff2/src/gui/text/qfontengine.cpp#L2284 indicates that remote WebFonts do not have a filename (no surprise here)

The code that disables font embedding is https://github.com/qtproject/qtbase/blob/f34e73a16a3d757057e007874cb5008f16e20f02/src/gui/painting/qpdf.cpp#L2601. It disables font embedding if the font does not have a filename (which remote fonts do not have, as documented above).

A hack I came up with that is woefully horrible is:

diff --git a/src/gui/painting/qpdf.cpp b/src/gui/painting/qpdf.cpp
index 7e90d81..6318313 100644
--- a/src/gui/painting/qpdf.cpp
+++ b/src/gui/painting/qpdf.cpp
@@ -2576,6 +2576,19 @@ void QPdfEnginePrivate::drawTextItem(const QPointF &p, const QTextItemInt &ti)
     QFontEngine *fe = ti.fontEngine;

     QFontEngine::FaceId face_id = fe->faceId();
+    QByteArray filename = face_id.filename;
+
+    if (filename.isEmpty()) {
+        // HACKHACK (miller): Since webfonts don't have a file name, they can't be embedded on Linux.
+        // To allow for embedding web fonts, concatenate the family and styleName.
+        // This is probably not "strictly" legal (see https://github.com/qtproject/qtbase/blob/fb6000a74f57bd3c096f6a10142477bf2faf0ff2/src/gui/text/qfontengine.cpp#L2284),
+        // but it gets the job done for our purpose.
+
+       QFontDef fontDef = fe->fontDef;
+       QString fakeFilenameString = fontDef.family + fontDef.styleName;
+       face_id.filename = fakeFilenameString.toUtf8();
+    }
+
     bool noEmbed = false;
     if (!embedFonts
         || face_id.filename.isEmpty()

@michaelgmiller your fix works fine

@zowers yup, I'm using it ;) but it's (probably) not something that can be committed to QT. Would be interested to know if a QT person has a better idea for a fix.

For those with limited use cases looking for a temporary workaround, adding font files to the PhantomJs server's ~/.fonts folder and referencing them in CSS with the src: local('Font Name') syntax seems to work.

Any progress on getting a fix for this? This is not in my skill set but any help toward fixing this in Qt and then in Phantomjs would be greatly appreciated.

@toddhickerson if you want to get a fix, comment at https://bugreports.qt.io/browse/QTBUG-52417 and post in the QT forums / IRC chat room. The fix lies entirely in the QT project.

I've tested the following change (basically removing the checks on the filename existence) and been able to produce correct results for all my test samples:

diff --git a/src/gui/painting/qpdf.cpp b/src/gui/painting/qpdf.cpp
index d746ab9..e3cfa6c 100644
--- a/src/gui/painting/qpdf.cpp
+++ b/src/gui/painting/qpdf.cpp
@@ -2581,16 +2581,29 @@ void QPdfEnginePrivate::drawTextItem(const QPointF &p, const QTextItemInt &ti)
     QFontEngine *fe = ti.fontEngine;

     QFontEngine::FaceId face_id = fe->faceId();
     bool noEmbed = false;
     if (!embedFonts
-        || face_id.filename.isEmpty()
+        /*|| face_id.filename.isEmpty()*/
         || fe->fsType & 0x200 /* bitmap embedding only */
         || fe->fsType == 2 /* no embedding allowed */) {
         *currentPage << "Q\n";
         q->QPaintEngine::drawTextItem(p, ti);
         *currentPage << "q\n";
-        if (face_id.filename.isEmpty())
-            return;
+        /*if (face_id.filename.isEmpty())
+            return;*/
         noEmbed = true;
     }

I've tried to search the qpdf.cpp source history to understand the rational for this check though it's been there since the available history.

This leads to the following points / questions:

  • This is a regression from Qt4 to Qt5 and the related qpdf.cpp source part hasn't changed, so it must be the filename was set before?
  • As the related qpdf.cpp source part isn't platform specific, what's the filename value for the other platforms and what was the value set in Qt4?
  • Are these checks actually needed?

Before it gets integrated, a statically linked Linux binary (tested on Centos 6, 7, Debian 7, 8, Ubuntu 14.04, 16.04 and ArchLinux 2015.06.01) that contains the fix above is available here: https://github.com/astefanutti/decktape/releases/download/v1.0.0/phantomjs-linux-x86-64.

Any idea on when this will get integrated?

@astefanutti - The library you built works great for fonts, but only produces the first page of the PDF. I'm on Ubuntu 16.04. Are you able to produce multi-page PDFs with your library?

@johnjarrard I haven't tried producing multi-page PDFs. I suspect this is caused by #14268.

Wanted to post my solution to this problem. It turns out that loading a web font from a remote URL will cause PhantomJS to rasterize the font in the PDF. This creates a PDF where the text cannot be highlighted, since it is an image. This causes the PDF file size to grow 10 times.

We were using Proxima Nova, and our CSS file looked like this:

@font-face
    font-family ProximaNovaReg
    font-style normal
    font-weight 100
    src url("/assets/fonts/ProximaNova-ThinWeb.woff") format("woff")

body
   font-family ProximaNovaReg

To fix the issue, we installed the Proxima Nova TTF files directly onto our Ubuntu box. This means copying the TTF files to /usr/share/fonts/truetype, and running fc-cache -fv.

Now we can change our CSS to just the following:

   font-family "Proxima Nova"

PhantomJS now treats Proxima Nova as a natively installed font, and renders a smaller sized PDF with selectable text. This is the right solution.

Note: I only encountered this problem on Linux. Mac OS worked fine.

@robinfhu Thanks, excellent fix!

@robinfhu Amazing!

Due to our very limited maintenance capacity (see #14541 for more details), we need to prioritize our development focus on other tasks. Therefore, this issue will be automatically closed. In the future, if we see the need to attend to this issue again, then it will be reopened. Thank you for your contribution!

Does anybody could solve this?

Was this page helpful?
0 / 5 - 0 ratings