Pdf.js: Browser caching does not work for the first range request on Chrome and Safari

Created on 24 Feb 2020  路  11Comments  路  Source: mozilla/pdf.js

Dear community,

When one activates range requests with disableAutoFetch: true and disableStream: true and http server configured to cache contents with public, max-age=31536000, immutable, browser caching does work on Chrome, nor on Safari.

But the first range request return a code 200 and is not cached by the browser, until all the ranges has been download.

Attach (recommended) or Link to PDF file here: not file specific, but one can use https://public.fays.io/public/580af700-5f79-44c4-8a0c-f7e976ee528f.pdf

The server configuration is:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, HEAD, PUT, POST
Access-Control-Expose-Headers: Accept-Ranges, Content-Range, Content-Length, x-amz-request-id, x-amz-id-2, ETag
Access-Control-Max-Age: 31536000
Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
Cache-Control: public, max-age=31536000, immutable

Configuration:

Web browser and its version: Chrome Version 80.0.3987.116. Firefox does not yet cache range requests, so it is out of scope.
Operating system and its version: mac os x 10.15.3 
PDF.js version: 2.2.228
Is a browser extension: no

Steps to reproduce the problem:

  1. Clear the browser cache
  2. Go to https://app.fays.io/bf205bf6-43d2-4140-8140-ab4b6f48c2a5 and open the developper tool. This viewer should work like the integrated viewer of PDF.JS
  3. Refresh the tab

What is the expected behavior?
The expected behavior is that the browser caches all the HTTP requests, including the first one with response code 200.

What went wrong?
After step 3., one can see in the developer tool that the range requests have been cached but not the first request.
Capture d鈥檈虂cran 2020-02-24 a虁 13 21 15

The first request has specific header for CORS:

:authority: public.fays.io
:method: GET
:path: /public/2ce2766d-b491-43a0-a0a9-96c56c2bd667.pdf
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6
if-modified-since: Sun, 23 Feb 2020 23:43:01 GMT
if-none-match: "dd09a498518c55f91fe48b29bec2ffde"
origin: https://app.fays.io
range: bytes=0-196607
referer: https://app.fays.io/bf205bf6-43d2-4140-8140-ab4b6f48c2a5
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36

The request is not cached by the browser since PDF.JS cancels the first request.
https://github.com/mozilla/pdf.js/blob/93aa613db7a874e6ed964f394241605d5586c142/src/display/fetch_stream.js#L155-L157

*Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension): *
https://app.fays.io/bf205bf6-43d2-4140-8140-ab4b6f48c2a5

Best, A.

All 11 comments

The request is not cached by the browser since PDF.JS cancels the first request ?

Well, if you set disableStream = true then you're explicitly telling the PDF.js code to abort the initial (200) request and instead fetch the PDF document using range (206) requests. Quite frankly, if the initial request wasn't aborted there'd be no point in actually having the disableStream option.

Still, when the browser has downloaded all the ranges (step 4.), then the request is finally cached On Chrome (but not on Safari):

It seems that, again, your problem is related to the differences in how various browsers have chosen to implement caching of network requests.


All-in-all, I'm afraid that it does not seem to be much (if anything) actionable here from a PDF.js perspective.

Well, if you set disableStream = true then you're explicitly telling the PDF.js code to abort the initial (200) request and instead fetch the PDF document using range (206) requests. Quite frankly, if the initial request wasn't aborted there'd be no point in actually having the disableStream option.

Yes this is the expected behavior of disableStream in this scenario. I am wondering if this is possible to replace the first request by another type of request.

It seems that, again, your problem is related to the differences in how various browsers have chosen to implement caching of network requests.

Yes

I am wondering if this is possible to replace the first request by another type of request.

First of all, you need to dispatch a regular (200) request to check if range requests are actually supported and correctly implemented by the server, see https://github.com/mozilla/pdf.js/blob/dd893d59d99314fc9a377905432139db7b6f1873/src/display/fetch_stream.js#L137-L147, or https://github.com/mozilla/pdf.js/blob/dd893d59d99314fc9a377905432139db7b6f1873/src/display/network.js#L320-L332, and also https://github.com/mozilla/pdf.js/blob/dd893d59d99314fc9a377905432139db7b6f1873/src/display/network_utils.js#L23-L62

Second of all, in the default case where streaming is used the initial (200) request is simply allowed to continue. Not only is that the most efficient solution, but it also avoids errors with servers that don't allow a PDF document to be fetched more than once.

I found that the following is not mandatory on my computer. I skip this part with an if:
https://github.com/mozilla/pdf.js/blob/dd893d59d99314fc9a377905432139db7b6f1873/src/display/fetch_stream.js#L119-L159

And an else condition this._headersCapability.resolve();
It works if the parameter "length" (e.g. { url: URL, length: 1000聽}) is set to the contentLength of the file.

I don't really understand what https://github.com/mozilla/pdf.js/issues/11624#issuecomment-590435245 even suggests removing/skipping, but it's obviously not appropriate to simply remove a fetch call (assuming that's what was actually suggested).

Furthermore, as was clearly mentioned in https://github.com/mozilla/pdf.js/issues/11624#issuecomment-590318199 and is hopefully evident if you carefully read through the validateRangeRequestCapabilities function, there's a fair amount of validation of the server response that needs to happen before range requests can actually be used.

I found that the following is not mandatory on my computer.

It's hopefully easy to see why a single example doesn't really mean anything in general, and why you cannot assume that random servers on the internet are always correctly configured.

And an else condition this._headersCapability.resolve();

It's not really clear what you're suggesting here, but it's obviously not a good idea to purposely add options that would allow users to essentially "shot themselves in the foot" by disabling validation which is necessary for the correct function of the PDF.js library.


All-in-all, I'm really not seeing anything that can be fixed here (from a PDF.js perspective) and would thus suggest that the issue should be closed.

@Snuffleupagus I totally agree with you. I am looking for a specific solution for a specific server setting to cache this preflight 200 and be able to load it offline!

I am looking for a specific solution for a specific server setting to cache this preflight 200 and be able to load it offline!

Given that caching of range requests are not well supported across browsers, have you instead considered not depending so heavily on that and simply load the PDF document with default values for the relevant disable... options.
With modern browsers (i.e. those supporting the Fetch API), that would thus imply that most (if not all) of the PDF document is fetched with the initial (200) request, which browsers should have no trouble caching. Note also that, generally speaking, streaming is usually faster than range requests too.

@Snuffleupagus In fact, I have successfully tested offline with https://github.com/mozilla/pdf.js/issues/11624#issuecomment-590508658 for the following document and range request activated: https://public.fays.io/public/2ce2766d-b491-43a0-a0a9-96c56c2bd667.pdf 馃挴

I try with the PDF.JS integrated viewer. When one turns off Wifi and refreshes the tab 10 times it still loads the document and one may consult the previously requested pages. 馃

Still, I did not yet manage to make it works when one shutdowns the browser. 馃憥 One may try to use a production environment and see what happens with the ServiceWorker.

Of course, as you mentionned, this is not a general solution at all. Just a try. This feature can not be defaulted for every PDF.JS users.

But I find out that the first request (code 200) is not mandatory because the preflight CORS check is also performed by the browser if the request has a "range" header (for all browser, this is in the specs, who knows ?).

As you said:

if you carefully read through the validateRangeRequestCapabilities function, there's a fair amount of validation of the server response that needs to happen before range requests can actually be used.

The first request is mandatory when one does not know if the server accept ranges and to get the size of the document, this is specified on validateRangeRequestCapabilities

Still, the first request is an option if the user can ensure by itself that validateRangeRequestCapabilities is always true.
For example, one can access the size using an alternative strategy (not related to PDF.JS) and if one wants to force range requests and set disableStream and disableAutoFetch.

But all of this requires more investigation 馃敘

馃挴 Therefore, it will be nice to have a way to override this part easily outside PDF.JS but I don't know how to do it properly. Can one extends PDFFetchStreamReader ? 馃挴

have you instead considered not depending so heavily on that and simply load the PDF document with default values for the relevant disable... options.

I confirm that offline already works when disableStream and disableAutoFetch are not set. 馃憤

Given that caching of range requests are not well supported across browsers

In fact, it works well for range requests on Safari, Chrome, Opera (including mobile?), I have not tested it on Edge or any alternatives and they plan to implement the feature on Firefox.

Still, the first request is an option if the user can ensure by itself that validateRangeRequestCapabilities is always true.

You want to trust the server, rather than the user here[1], to avoid a support nightmare later on.
As was clearly mentioned in https://github.com/mozilla/pdf.js/issues/11624#issuecomment-590508658 we definitely do not want any foot guns in the options here, hence that won't happen!

Besides, it's really not at all correct to make the changes you're suggesting since it'd make PDFFetchStreamReader unable to read any data (which is bound to break lots of other things).

Therefore, it will be nice to have a way to override this part easily outside PDF.JS but I don't know how to do it properly. Can one extends PDFFetchStreamReader?

No, that's considered internal functionality and it's consequently not exposed in the public API. (You'd need to fork the code, but as I've tried to explain multiple times the changes you're suggesting aren't really a good idea.)

The "correct" way of handling special cases, with regards to fetching of data, would be to utilize PDFDataRangeTransport instead. That allows completely custom data delivery, that you thus can implement in what ever way you want/need in your case (it's being used in the PDF Viewer that's built-in to the Firefox browser); see https://github.com/mozilla/pdf.js/issues/11453#issuecomment-569347187 for some relevant links.

This issue really ought to remain closed. /cc @timvandermeij


[1] Previous experiences have shown that users may, more often than you'd like, set certain options without always fully understanding the ramifications of doing so.

But all of this requires more investigation 馃敘

The "correct" way of handling special cases, with regards to fetching of data, would be to utilize PDFDataRangeTransport instead. That allows completely custom data delivery, that you thus can implement in what ever way you want/need in your case (it's being used in the PDF Viewer that's built-in to the Firefox browser); see #11453 (comment) for some relevant links.

Thank you for this @Snuffleupagus
Can one extends PDFDataRangeTransport in Typescript ?

Yes, it's part of the public API; see also https://github.com/mozilla/pdf.js/blob/12400489c2f2b3f93fe394806a5f276c1d588161/src/display/api.js#L177

Closing as answered since there is not really anything we can/should do here.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zerr0s picture zerr0s  路  3Comments

PeterNerlich picture PeterNerlich  路  3Comments

azetutu picture azetutu  路  4Comments

patelsumit5192 picture patelsumit5192  路  3Comments

BrennanDuffey picture BrennanDuffey  路  3Comments