Pdf.js: "Invalid or corrupted PDF files" is displayed

Created on 12 Mar 2019 · 11Comments · Source: mozilla/pdf.js

Attach (recommended) or Link to PDF file here:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.7135&rep=rep1&type=pdf

Configuration:

Web browser and its version: Chrome 72.0.3621.121 (Official Build) (64-bit)
Operating system and its version: MacOS
PDF.js version: PDF.js v2.0.673 (build: 31012570)
Is a browser extension: Yes

Steps to reproduce the problem:

Click the above link
2.

What is the expected behavior? (add screenshot)
This is what I can see when I pasted chrome-extension prefixed url or reloading the error pdf page.
(chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.7135&rep=rep1&type=pdf)
Screen Shot 2019-03-12 at 17 20 26

What went wrong? (add screenshot)
This is what I can see when click the above link
Screen Shot 2019-03-12 at 17 19 38

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
https://chrome.google.com/webstore/detail/pdf-viewer/oemmndcbldboiebfnladdacbdfmadadm

1-other 4-chrome-specific

Source

simonhong

Most helpful comment

I found the cause of this issue.
The reason is requesting twice to server in a very short interval.
To the second request, server redirects to downloadsexceeded.html instead of pdf content.
So, pdf.js complains it's invalid/corrupted pdf file.
It's maybe server's DDoS protection I think.

Why two requests are issued when user clicks that link?
First one is issued by browser for user click.
Then, pdfjs extension intercepts header response and redirects to extension url.
Then, one more requesting is issued by pdf.js.

I think we can improve this more.
How about using the contents received from first request instead of requesting again?
This is just an idea.
(Sorry, if this idea doesn't make sense. I don't fully understand about pdf.js/extension implementation now.).

WDYT? @timvandermeij @Rob--W

simonhong on 15 Mar 2019

👍3

All 11 comments

Possibly a duplicate of #10562.

Snuffleupagus on 12 Mar 2019

@Snuffleupagus I think this is a different issue with #10562.
This is for some pdf isn't opened properly, whereas #10562 is the issue that pdf is opened by chrome's internal pdf viewer(pdfium?) instead of pdfjs extension.

simonhong on 12 Mar 2019

WDYT? @timvandermeij @Rob--W

simonhong on 15 Mar 2019

👍3

@simonhong
Will this be a temporary fix for this problem?

document.querySelectorAll('a[href]').forEach(function(a){
    if (a.href.match(/.+.pdf$/)){
        a.setAttribute('href', 'chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/' + a.href);
    }
});

shge on 17 Mar 2019

@shge good try. I think it would work for the link that ends .pdf suffix.
However, we can easily find pdf links that don't have that suffix.

simonhong on 19 Mar 2019

Do you try to reload the web page after this exception has displayed and the issue will be solved? This exception will be thrown when the pdf data loading in the first time.

ghost on 25 Mar 2019

@864534182 Do you try to reload the web page after this exception has displayed and the issue will be solved? This exception will be thrown when the pdf data loading in the first time.

Yes, but it does not work on some pages that require referer information.

I came across a website which prevents the second request by plugin because it is "a direct request".
It lets me download the file only when I access it by clicking a link in a specific webpage (the referer has to be a specific page).
Anyway, it should request once with referer information.
https://github.com/brave/brave-browser/issues/3474#issuecomment-473666538

shge on 26 Mar 2019

The referrer thing is a regression caused by a change in Chrome - see https://github.com/mozilla/pdf.js/issues/10645

Rob--W on 26 Mar 2019

I will post this on the brave-browser repository too:
Browser: Brave-browser.
In this link:
https://projecteuclid.org/euclid.rmjm/1181072068
there is a button linking to PDF file. When I click on the button, it shows the already mentioned "Invalid or corrupted PDf file" message.

The "reloading" workaround does not work.
Even after reloading the page with Ctrl+R , the download button on the upper right corner only lets me download an HTML file, but not a PDF file.
When I tried to load this supposedly "direct" link to the PDF file:
https://projecteuclid.org/download/pdf_1/euclid.rmjm/1181072068
, that link redirects me to the original link written in the second line of this post. In other words, there is no real direct link to the PDF file.
Exactly the same problem occurs with the following link:
https://projecteuclid.org/euclid.rmjm/1181069828