Pdf.js: Random / apparent "bolding" / corruption

Created on 31 Mar 2016  路  46Comments  路  Source: mozilla/pdf.js

Link to PDF file:
default.pdf

Configuration:

  • Web browser and its version: Version 49.0.2623.87 (64-bit) [problem is not browser specific]
  • Operating system and its version: Linux Ubuntu 15.10 [not OS specific]
  • PDF.js version: [all / 1.3.91]
  • Is an extension: pdf.js embedded in application

Steps to reproduce the problem:

  1. Repeatedly open the viewer
  2. Sometimes display is Ok, sometime there is random "bolding"
  3. Frequency appears to be Random
  4. This is CAUSED by setting "PDFJS.disableWorker = true;" (removing this fixes issue)
  5. I cannot "not" disable the worker because of the massive download this causes on _every_ view
  6. Content is loaded from an in-memory string
  7. I've verified the string contents are consistent between Ok and corrupt views
  8. On multi-page documents, moving forward a page and then back always fixes the issue

What is the expected behavior? (add screenshot)
ok

What went wrong? (add screenshot)
corrupt

1-other 4-chrome-specific

All 46 comments

I cannot "not" disable the worker because of the massive download this causes on every view

Could you explain this part?

If you are trying to use getDocument multiple time, use single PDFWorker instance. It's hard to tell what's causing fonts to be corrupted, but having link to the working example might shed some light. Can you create/publish example which causes the issue?

Ok, I can't easily publish a link. If I run with workers enabled, it works 100%. If I set the disable flag, you see the problem shown above, randomly, maybe 1 instance of 4.

I was heading off the response of "just enable the workers" by detailing "why" I'm disabling the worker, which is that I'm displaying lots of small PDF's fairly rapidly, and adding a 1.2Mb download for pdf_worker.js for each display is not practical. I've been looking at the web-worker code to see if there is an option for workers to cache the .js script they're called on, but I've not managed to find anything.

My initial guess (based on the effect) is that there is something somewhere with a global scope that is cleared correctly if the script is loaded for each instance, that causes a problem if the worker scripted is repeatedly re-used. However (!) given the issue can occur on the FIRST pdf display, I'm at a bit of a loss to know what to look at.

I was heading off the response of "just enable the workers" by detailing "why" I'm disabling the worker, which is that I'm displaying lots of small PDF's fairly rapidly, and adding a 1.2Mb download for pdf_worker.js for each display is not practical.

@oddjobz I still don't understand the concern. With disabled worker you _still_ downloading pdf.worker.js but it's happening on main thread. Properly setup on the web server allows caching of the static javascript file and avoid 1.2Mb download for each request without additional effort. PDFWorker shall help with caching of instances of the web worker on single page (e.g. when multiple getDocuments performed).

Not sure if this issue is complete without code for your solution, by default PDF.js caches web-worker code when used on standard web server with standard browser (in default configurations).

Well, I don't know what the difference is between the code I'm using and the code you have, but there is a differential somewhere. Firstly, with the worker disabled, the pdf_worker.js is loaded once on the first hit "only". With workers enabled, it loads the code on every new document and nothing I do seems have any effect on caching. (i.e. it's not cached) I rather suspect that because Chrome devs were having problems with web workers and cached code, they've switched off caching. As far as I can see all my headers are as they should be for caching, but caching isn't happening. (whereas other stuff _is_ cached)

I have three relevant bits;
a. Main code block with a script tag
b. Onload that sets the global variables
c. A stand alone class that displays a PDF in a DIV

main code block;

<script type="text/javascript" src="js/compatibility.js"></script>
<script type="text/javascript" src="js/pdf.js"></script>

onload code;

    PDFJS.disableWorker = true;
    PDFJS.verbosity = PDFJS.VERBOSITY_LEVELS.debug;
    PDFJS.workerSrc = "/js/pdf.worker.js";

class definition;

JS.require('JS.Class',function(Class) {

    CLASS.PDFViewer = new Class({

        locked  : false,
        page        : 0,
        pages   : 0,
        pdf     : null,
        doc_id  : null,

        initialize: function(prefix) {
            this.canvas   = prefix+'-canvas';           // Canvas element ID we'll be rendering to
            this.prefix   = prefix;
            this.id_page  = '#'+this.canvas+'-page';    // Ident of page number
            this.id_pages = '#'+this.canvas+'-pages';   // Ident of page count
            this.setfocus(null);                        // Element to focus after render
        },
        reset:      function() { this.now_showing = null; console.log("PDF Reset")},
        set:        function(doc_id) { this.doc_id = doc_id; console.log("Docid:",doc_id) },
        load:       function() { this.fetch(this.doc_id); },
        set_doc:    function() {},
        setfocus: function(field_id) { this.focuson = field_id; },

        decode: function(base64) {
            var raw = atob(base64);
            var uint8Array = new Uint8Array(raw.length);
            for (var i = 0; i < raw.length; i++) {
                uint8Array[i] = raw.charCodeAt(i);
                }
          return uint8Array;
        },

        full_screen: function() {
            if( $('#'+this.prefix+'-hide-me').is(':visible') ) {
                $('#'+this.prefix+'-hide-me').hide();
                $('#'+this.prefix+'-full-screen').removeClass("col-sm-7");
                $('#'+this.prefix+'-full-screen').addClass("col-sm-12");
            } else {
                $('#'+this.prefix+'-hide-me').show();
                $('#'+this.prefix+'-full-screen').removeClass("col-sm-12");
                $('#'+this.prefix+'-full-screen').addClass("col-sm-7");
            }
            this.turn_page();
        },
        focus: function() {
            if(this.focuson) {
                console.log("SetFocus>>",this.focuson);
                setTimeout("$('"+this.focuson+"').focus()",100);
                this.focuson = null;
            }
        },
        display: function(pdf) {
            this.pdf = pdf;
            $(this.id_pages).text(this.pdf.numPages);
            this.pages = this.pdf.numPages;
            this.page = 1;
            this.turn_page();
        },
        fetch: function(rid) {
            if(this.locked) return false;
            var self = this;
            var src = '/images/default.pdf';
            function success(data) {
                if(!LIB.check_error(data)) return false;
                if(data.pdf) src = self.decode(data.pdf);
                self.locked = true;
                PDFJS.getDocument(src).then(function(pdf){ self.display(pdf); });
                return true;
            }
            ionman.call('nac.rpc.pdf_spec',{'rid': rid},success)
            return true;
        },
        turn_page: function() {
        var self = this;
            self.pdf.getPage(self.page).then(function(page) {
                var canvas = document.getElementById(self.canvas);
        var ctx = canvas.getContext('2d');
        var unscaledViewport = page.getViewport(1.0);
                canvas.width = $('#'+self.canvas).width();
                var scale = canvas.width / unscaledViewport.width;
                var viewport = page.getViewport(scale);
                canvas.height = viewport.height;
            var renderContext = { canvasContext: ctx, viewport: viewport };
        page.render(renderContext).promise.then(function(){
                setTimeout(function(){
                    self.locked = false;
                        self.focus();
                    },250);
                });
                $(self.id_page).text(self.page);
            });
        },
        next: function() {
            if( this.page == this.pages) return;
            this.page += 1;
            this.turn_page();
        },
        prev: function() {
            if( this.page == 1) return;
            this.page -= 1;
            this.turn_page();
        }
    });

So I do;

var viewer = CLASS.PDFViewer('pdf');
viewer.fetch();

And I get the default document is a DIV with ID "pdf-canvas".

Let's try this:

  1. Open http://mozilla.github.io/pdf.js/web/viewer.html in Chrome
  2. Open devtools on Network page and show split console (hit 'esc' on this tab)
  3. Make sure it not caught in an exception (disable exception break and refresh otherwise)
  4. Make sure "Disable Cache" is off (uncheck and refresh otherwise)
  5. In the console, execute PDFJS.getDocument('http://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf')
  6. Notice second "pdf.worker.js" has "200 OK (from cache)" status

screen shot 2016-03-31 at 10 51 26 am

Code snippets are not helpful. Please deploy small example somewhere (e.g. at github pages)

Yep, I see it .. this appears to be a more recent version of PDF.js than I'm using .. assuming the difference it not the issue, I will compare the headers Varnish is putting out against my web server to see if I can spot why it's not caching.

I rather suspect that because Chrome devs were having problems with web workers and cached code, they've switched off caching.

I don't see a connection between this and disableWorker option. The pdf.worker.js is requested regardless it is true or false. So the problem must be irrelevant to caching.

For simplicity, I assume it's connected with how the messaging works in disableWorker mode (which is not really tested, and created only to support legacy browsers and ease of debugging). It will help to narrow down the issue to have a minimal test case where problem is visible (preferable accessible online).

Ok, so this is interesting .. testing against localhost:8443 on a dummy cert (where the hostname != localhost) , it doesn't cache. When I test against the live server, port 443 with a valid commercial certificate, it does cache (!) ... not quite sure what to make of that .. will do some more testing when I get a bit of time, but for now I will enable the web workers and see what happens. (but I think there's still a problem in there somewhere ...)

I'm not entirely sure I believe me .. so I've added some screenshots ...

live

dev

Disable cache is definitely _not_ ticked ...
(web server config is identical)

Is there anything else to be done here? From what I understand this does not sound like a bug in PDF.js, but rather in the custom implementation.

It will be nice to have a test case we can reproduce this (intermittent?) failure.

It's a bug, I will produce an online demonstration, but it will take some coding and a little time ...

Hi, I'm having the same issue.

Written nothing custom here, just downloaded the repo from Github
screencapture 7

@subhadip-codeclouds I don't think you have the same issue. Please open separate issue and provide requested details.

@subhadip-codeclouds Where can I find this pdf? I'm having a similar issue and would like to use it as a test case.

I believe I am having the same font render problem on Ubuntu with Chrome (haven't tested other platforms). I am using the latest pdf.js from master, and sometimes a PDF will look like @oddjobz's PDF and sometimes it will look like @subhadip-codeclouds's PDF. This seems to happen randomly on random PDFs.

I don't really know what is wrong or how to reproduce it reliably. However this is the scenario. I am using React to build a dynamic, single page website. Users will often click on a tab and that will create an iframe and display pdf.js within the iframe. Given the way React and my website works, an iframe is created and destroyed over and over. It may take awhile but eventually I will always get font render corruption. And once it happens for one PDF, it will start happening to other PDFs randomly. Some are always fine, and some are not.

Is there anything (e.g. debug flags) I can turn on or do to help figure out what is going on? I don't see any errors or warnings in the console.

Here is a PDF that almost always ends up with font render corruption when it starts.
https://datalanche-sec-public.s3.amazonaws.com/filings/0001047469-15-008315/a2226328zex-31_2.htm.pdf

One more thing. If I open a new tab in Chrome with the same URL then the PDFs are fixed. However if I stay in the same tab, navigate to a completely different website, and then navigate to my website (not using back button) the PDFs with corrupt fonts are still corrupt. It almost seems like whatever is happening is corrupting the tab's memory and/or cache.

It is possibly a caching problem in Chrome (see https://github.com/mozilla/pdf.js/issues/7751#issuecomment-256683285 for more details).

Any update on this? having the same issue

Although I still see this (inexplicably) on occasion, it's very rare and, indeed sufficiently rare I've really stopped worrying about it. The problem I seemed to be having was overlapping operations. It "seems" to be possible to operate on a pdf document (page next for example) while another operation is still in progress, and it's this that seems to cause the problem. My solution was to wrap all operations in an class, then insert a master lock on the entry and exit points so no pdf related operations could conflict - this "seems" to have fixed things for me. I'm vaguely assuming that pdf stuff runs in a separate thread or worker process, hence the possibility of a conflict. It was a little while ago, but from memory I think the threading thing is an option and I discovered the solution by disabling it, which had a negative impact on performance, but did solve the issue.

It happens to me all the time, but it is random enough that I cannot create a test case. However it is quite possible that it is memory corruption from threading in my case too, but I thought Javascript was single threaded?

I thought Javascript was single threaded

It is. I think what @oddjobz means (?) that there might a bug in Chrome when you paint on more than one HTML5 canvas at a time the defect has high chance to occur. But without reproducible test case it's hard to speculate, and create meaningful report to the Chromium bugs.

I think (from memory) it employs the option to use a new browser feature called "web workers", which effectively allows you to create javascript threads .. if you turn this feature off, then try to view a "large" PDF, you see "why" this feature is in use ... :)

it employs the option to use a new browser feature called "web workers"...
if you turn this feature off, then try to view a "large" PDF, you see "why" this feature is in use

Please notice that OP instructs to turn this feature off, means Web Workers are not used, which moves blame to the browser, and has nothing to do with JavaScript "threading".

It's a bit subtle, Javascript is single threaded, but Chrome is multi-threaded, and web workers allows you to run two Chrome processes and facilitates communication between them. I think only the master gets DOM access, but you can use sub-threads for processor intensive stuff without blocking the UI thread. It gets more fun when you find you can create web workers that aren't attached to a specific thread or tab, so they effectively survive a page reload (i.e. they're persistent). I see a lot of problems coming down the line stemming from this ...

Sure - but my comment is that without threading (i.e. by implementing my own thread level locking), 99% of this problem goes away. (for me).

@oddjobz , @rpedela try disabling hardware/gpu acceleration and see if problem still occurs.

@yurydelendik, yup, that was obvious, one of the first things I tried. (no difference)

@yurydelendik, my application has been live 6+ months, I'm happy that any remaining issue is "different" and most likely user-error or occasional weird document. THE problem I was having, which although not consistent was 100% reproduceable, that's gone. It was (IMHO) caused by overlap between document operations, i.e. a process starting before the previous one had finished - threading or not, putting in a manual lock to prevent operations starting before the previous one had finished fixed it. The easily reproduced example was quickly scanning forward through a document and having the "next" start processing before the previous page had completely finished rendering.

Javascript is single threaded, but Chrome is multi-threaded, and web workers allows you to run two Chrome processes and facilitates communication between them. I think only the master gets DOM access, but you can use sub-threads for processor intensive stuff without blocking the UI thread.

This statement with "web worker" term is confusing. Can you provide references to verify the statements above? Web Workers are has no access to the DOM by design, and PDF.js peforms painting on the main thread. Do you mean Chrome's rendering process? Still the only way to update DOM is from main thread and not from web workers.

rocess starting before the previous one had finished - threading or not, putting in a manual lock to prevent operations starting before the previous one had finished fixed it.

What exactly do you mean by "operations" in this context, is it lifetime of API's render() call?

@oddjobz I just read the thread again and there is lots of conflicting statements at different periods of time. Also configuration section is conflicting as well e.g. I cannot reproduce that locally on any browser on Mac OSX. I'm still not sure if you can reproduce it with standard viewer (not your custom one). I'll close this bug as invalid/incomplete to not confuse other thread participants.

@oddjobz, @rpedela, @badams, @pholisma, @subhadip-codeclouds can you provide a separate bug report with exact configuration(s) you are experiencing the issue and exact steps when can reproduce the issue (including the PDF)? if it's a custom solution provide a public link to it.

Ok, this is the code in question - you can see the fix in place.

Specifically for locking, I have a routine like so;


this.locked = true;
PDFJS.getDocument(path+doc_id).then(function(pdf) {
    $('#pdf-canvas-pages').text(pdf.numPages);
    self.pages = pdf.numPages;
    self.page = 1;
    self.pdf = pdf;
    pdf.getPage(1).then(function(page) { self.turnpage(); });
})

turnpage: function() {
    var self = this;
    self.pdf.getPage(self.page).then(function(page) {
        var canvas = document.getElementById('pdf-canvas');
        var ctx = canvas.getContext('2d');
        var unscaledViewport = page.getViewport(1);
        canvas.width = $('#pdf-canvas').width();
        var scale = canvas.width / unscaledViewport.width;
        var viewport = page.getViewport(scale);
        canvas.height = viewport.height;
        var renderContext = { canvasContext: ctx, viewport: viewport };
        page.render(renderContext);
        $('#pdf-canvas-page').text(self.page);
        self.locked = false;
    });
},

Yes, this is why I didn't push the point at the time, there seems to be great reluctance to even accepting that there is a problem - let alone that it needs to be fixed.

The problem with snippet above was addressed by https://github.com/mozilla/pdf.js/pull/6571

Well, what can I say, I was using the latest version mid 2016 and still had the problem. I do seem to recall being told at the time (repeatedly) that it was fixed. (and like a number of others, I could see it demonstrably wasn't)

Anyway, for anyone out there seeing the same issue, try sticking a lock in as above and see if it makes any difference .. it's only two lines ...

@yurydelendik This is happening pretty consistently for our users (mainly using Windows 7), but I've been able to reproduce it on OSX with the latest version, but not 100% consistently.

We are not using any custom code, simply doing the following

<iframe
    style="height: 650px; width: 600px"
    src="/path/to/pdfjs/web/viewer.html?file=/path/to/file.pdf"
/>

It seems to depend on a number of factors, such as whether there is any other bold text in the document, and the initial zoom level (zooming in and out will sometimes fix it) I've also noticed that this only affects the preview, and printing, when downloading the pdf it appears to render perfectly (I guess that's because pdf.js is just passing the provided file along to the user).

We have decided to move away from using this solution and downloading the file to the users machine directly, but I will try make some time to come up with a reproducible test-case though I already spent my entire day yesterday chasing this bug down..

@badams, I can confirm zoom was also a fix for me, as was a page next/prev. I was also under the impression that bolding made the problem more likely.

I will try make some time to come up with a reproducible test-case though

@badams thanks, anything that takes all variations when contributors are trying to reproduce the issue at their computers will work, and complete examples published online work best (you can publish complete example at a github repo's gh-pages branch).

Hi guys,

I did not understand this whole story right.
Is there a fix already? Or some kind of implementation that I should do?

Regards,
Tarcisio Pereira

I did not understand this whole story right.

I don't think anybody does.

This thread of closed since it does not provide exact steps to reproduce the issue (and due to some misleading recommendations in the comments). We don't expect the problem to be reproducible 100%, but making it appear at least once in 10 times will be great.

Possible items that can cause PDF.js to perform this way or have a bug in its code:

  • A HTTP server or browser does not properly handle HTTP range requests
  • A browser is not properly handling font loading or canvas operation
  • Custom solution conflicts with operations above

FWIW I've been seeing this corruption in rare situations with our deployment of pdf.js (v1.7.376). Our range request handling seems correct. Will report back if I can find reliable repro steps...

We only had this issue on Chrome, after changing the zoom it disappears. So we set showPreviousViewOnLoad to false and never had this problem again.

@TZanke could you clarify why removing showPreviousViewOnLoad would change the zoom? Thanks!

@tonyjin pdf.js autozoom calculates a zoom value and saves it to the local storage. After reloading the page the autozoom is not used, instead the previous calculated zoom value is used. And it looks like there is a problem loading this value again.

So when disabling showPreviousViewOnLoadthe autozoom feature kicks in every time, zooms the page correctly, and no render problems occur.

@TZanke -- I tried your approach, but unfortunately, the issue still pops up sometimes.. :(

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AlexP3 picture AlexP3  路  3Comments

SehyunPark picture SehyunPark  路  3Comments

jigskpatel picture jigskpatel  路  3Comments

PeterNerlich picture PeterNerlich  路  3Comments

anggikolo11 picture anggikolo11  路  3Comments