Html: Nosniffing for Worker Scripts

Created on 24 Nov 2017  ·  30Comments  ·  Source: whatwg/html

8.1.3.2 Fetching scripts says:

  • Under "To fetch a single module script", step 8: "If any of the following conditions are met [...] The result of extracting a MIME type from response's header list (ignoring parameters) is not a JavaScript MIME type."
  • There are no equivalent rules for classic or worker scripts.

Chrome would like to be more strict about the non-module scripts, too. On Chrome's beta channel, we see:

  • ca. 0.01% of page loads contain worker scripts (workers or scripts loaded from workers) that would fail this check if it were applied.
  • ca. 6% of classic, non-worker page loads contain scripts that would fail this check if applied

    • of these, the vast majority ( ~3/4 ) are text/html

    • ~1/4 text/plain

    • ~1/10 application/octet-stream

    • the rest is noise, <0.01%

These numbers would probably support blocking non-script MIME types for the "fetch a classic worker script" and "fetch a classic worker-imported script" cases, too, but not (yet) for all script types.

Would this make sense?

@mikewest

securitprivacy script

Most helpful comment

I am going to try blocking Worker scripts with wrong MIME types in Firefox.

All 30 comments

I do think this is a good idea. The numbers we have in Beta right now look pretty reasonable for the narrow case you've outlined (workers themselves, and scripts they import), and it would be nice to unify the requirements for workers and module scripts.

For the broader case of <script> in general, I agree with you that we need to do more work evaluating the status quo before we can make any changes. ~6 is a lot of percent, and though my intuition is that a large chunk of that text/html can be attributed to ads, the number's significantly higher than I was hoping, and we're going to have to dig in a bit to see if there are subsets we can carve out. That said, if text/html, text/plain, and application/octet-stream comprise the majority, perhaps we could invert the checks to explicitly allow the javascript types and those three additional types, and block everything else?

I'd be interested in opinions from other folks in @whatwg/security. Is this kind of restriction something other vendors would also be interested in experimenting with?

(@dveditz, @johnwilander, @patrickkettner)

Mozilla is. We've been looking into this as well; @evilpie primarily (added the text/csv restriction on the Fetch side for all scripts).

It would be interesting to find out if Mozilla's users are seeing similar ranges of scripts to Chrome's users. @evilpie, do y'all have any metrics in place for script content types?

I am always excited about restricting those MIME types further.
Sadly we don't currently have fresh metrics. Our old telemetry counter expired. We also didn't differentiate between the workers vs script etc. https://bugzilla.mozilla.org/show_bug.cgi?id=1399990 is going to add new telemetry.
I also got this information from HTTP Archive https://discuss.httparchive.org/t/can-you-get-a-list-of-non-standard-mime-types-used-for-js-scripts/1141
I think the most interesting take away is how common text/json seems to be.

It's interesting that the HTTP Archive data look quite different from @otherdaniel's telemetry data. Probably a matter of what happens when you weight by load frequency? How does Chrome's telemetry handle empty Content-Type headers?

It's interesting that the HTTP Archive data look quite different from @otherdaniel's telemetry data. Probably a matter of what happens when you weight by load frequency?

I expect that frequency weighting will give radically different results than number of resources requested, yes. The data @otherdaniel pointed to is also from beta, not stable, which will certainly shift things a bit (though usually the same order of magnitude).

How does Chrome's telemetry handle empty Content-Type headers?

Code is here: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/AllowedByNosniff.cpp?rcl=189b14fddd21542f2a1c5d81286cf342bad3b428&l=58

Looks like the data doesn't include empty Content-Type, which is certainly an oversight we should correct.


I dug into this a bit earlier in the week, pulling a list of sites that triggered the metrics we have in stable from HTTP Archive, and loading ~7k of them from my workstation overnight (raw data in a Google Sheets doc) with a build that dumps a bit more detail about the requests. This also doesn't give us frequency-weighted data, but it's at least somewhere to start. Some highlights:

Types

(Random data in the "Analysis" tab of the sheet)

  • Empty Content-Type accounts for ~13% of the requests.
  • text/html accounts for ~50%.
  • application/json for ~23%.
  • text/plain for ~5%.
  • application/octet-stream for ~4%.
  • text/js for ~2%.
  • Everything else is under 1%.

Hosts

(Data in the "Hosts per Type" tab of the sheet)

  • application/json is dominated by graph.facebook.com (e.g. the JSONP endpoint at https://graph.facebook.com/?callback=jQuery2140942788649414716_1511786863676&id=http%3A%2F%2Fwww.theactivetimes.com%2F&_=1511786863677). Perhaps @hillbrad can hook us up with someone who could change that type when a callback is present? There are a few other services with similar characteristics for JSONP endpoints: secure.livechatinc.com, us-ads.openx.net, addthis.com, api.instagram.com, instagram.com, sharethis.com, and many more.
  • Pubmatic serves several JavaScript files as text/html (e.g. https://ads.pubmatic.com/AdServer/js/gshowad.js). They don't appear to be anything other than server misconfigurations.
  • Likewise s4.histats.com serves files like http://s4.histats.com/stats/0.php?2887713&@f16&@g1&@h1&@i1&@j1511787906552&@k0&@l1&@mDAD%20Soft%20Free%20Download%20Full%20Version%20Pc%20Games%20And%20Softwares&@n0&@o1000&@q0&@r0&@s0&@ten-US&@u1200&@vhttp%3A%2F%2Fwww.dadsoft.net%2F&@w as text/html for no discernable reason.
  • ib.adnxs.com serves an empty text/html file (http://ib.adnxs.com/async_usersync?cbfn=AN_async_load).
  • sharethis.com on the other hand serves a hybrid HTML/JavaScript file as text/html (e.g. http://t.sharethis.com/1/d/t.dhj?rnd=1511786912267&cid=c010&dmn=www.stio.com). It would be useful to find someone there who would tell us why. (Also, their HTML version is broken, so.... I'm not terribly worried about breaking it. :) )

Also, their HTML version is broken, so.... I'm not terribly worried about breaking it. :) ) ????

I did a bit more research this morning via HTTP Archive: https://groups.google.com/a/chromium.org/d/msg/blink-dev/35t5cJQ3J_Q/jCHygAPuCQAJ

The high-level conclusion there is that Facebook and VK have ~widely used endpoints that serve JSONP as application/json and text/html respectively.

(And that not much else looks like it would break in a way visible to users)

@mikewest Have you talked to any of those JSONP providers about changing their Content-Type to JavaScript?

@evilpie: Facebook fixed their endpoint, VK never responded.

@mikewest It seems like application/json is still the most common wrong MIME type after text/html: https://mzl.la/2Qkxtn9

I am going to try blocking Worker scripts with wrong MIME types in Firefox.

The friendly folks at HTTP Archive turned on Sec-Fetch-Dest, and I did a tiny bit of analysis this morning as I just noticed that the 2019-05-01 dump got imported. The numbers are pretty small and only include public sites, so I'm not sure how representative we can say this is, but, FWIW:

Sec-Fetch-Dest=worker:

Content Type | Count |  
-- | -- | --
application/javascript | 17076 | 97.5326%
application/x-javascript | 179 | 1.0224%
text/javascript | 167 | 0.9538%
text/html | 44 | 0.2513%
application/octet-stream | 16 | 0.0914%
application/xml | 14 | 0.0800%
  | 9 | 0.0514%
text/plain | 2 | 0.0114%
text/x-js | 1 | 0.0057%

Sec-Fetch-Dest=serviceworker

Content Type | Count |  
-- | -- | --
application/javascript | 23531 | 76.7958%
text/javascript | 4045 | 13.2013%
application/x-javascript | 1611 | 5.2577%
text/html | 1006 | 3.2832%
  | 408 | 1.3315%
text/plain | 19 | 0.0620%
application/xml | 7 | 0.0228%
application/json | 5 | 0.0163%
application/binary | 3 | 0.0098%
application/javascript�text/javascript | 1 | 0.0033%
application/javascript�application/javascript | 1 | 0.0033%
application/ecmascript | 1 | 0.0033%
image/gif | 1 | 0.0033%
text/css | 1 | 0.0033%
text/ecmascript | 1 | 0.0033%

Sec-Fetch-Dest=sharedworker

Content Type | Count |  
-- | -- | --
application/x-javascript | 23 | 47.9167%
application/javascript | 21 | 43.7500%
text/html | 2 | 4.1667%
text/javascript | 2 | 4.1667%


Queries are of the form:

SELECT 
  JSON_EXTRACT_SCALAR(payload, '$.response.content.mimeType') as type,
  count(*) as count
FROM `httparchive.requests.2019_05_01_desktop`
WHERE
  STRPOS(payload, '{"name":"Sec-Fetch-Dest","value":"worker"}') != 0
GROUP BY type

The service worker results seem somewhat suspect given that the service worker specification explicitly requires a correct MIME type, if I remember correctly.

These are just examining the headers of outgoing requests and incoming responses, not necessarily successful executions.

That said, I'm just blindly trusting the mimeType that HTTP Archive parses out when evaluating a response. It's probably worth doing more analysis, but spot-checks look reasonable:

I wouldn't be shocked if something about HTTP Archive made some pages respond strangely, but it does seem like the data is pointing in a reasonable direction.

I suspect the non-js SW types are for scripts that are no longer hosted on servers, but there are still clients with registrations asking for updates. Things with text/html are probably 404 pages, etc.

The data I'm pointing to here is HTTP Archive. I don't think they keep clients around with state that could affect future runs? @pmeenan or @rviscomi would know.

Nope. Each page load is done with a completely clean browser profile. More
likely there is code in the page still referencing a worker that was
removed (or never created).

On Tue, May 28, 2019 at 8:26 AM Mike West notifications@github.com wrote:

The data I'm pointing to here is HTTP Archive. I don't think they keep
clients around with state that could affect future runs? @pmeenan
https://github.com/pmeenan or @rviscomi https://github.com/rviscomi
would know.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/whatwg/html/issues/3255?email_source=notifications&email_token=AADMOBOSZ2ODIHOKTUPDQSLPXVFKLA5CNFSM4EFHKU52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMP2QQ#issuecomment-496565570,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADMOBN4T4WPPD5FT7UOMZDPXVFKLANCNFSM4EFHKU5Q
.

I am trying to ship this in Firefox (Bug 1523706), what are the future plans for Worker/SharedWorker on the Chrome side, @mikewest? The number of non-JS MIMEs for those seems to be quite low. Especially compared to importScripts for us.

Any ideas what Safari might do?

@evilpie: I would love to ship the same in Chrome, but I haven't had time to dig into this any more recently. At the moment, the data I have access to (see above) tells me that that's going to have some impact on ~0.5% of sites that use workers. That's not nothing!

If y'all are able to ship it without uproar, I'm happy to push for doing the same. If you need Chromium to ship at the same time, then I'm going to need more time to put together a risk analysis that folks can accept.

Fwiw, we decided to delay blocking worker scripts in release and are limiting this change to beta/nightly. It would be useful to coordinate the blocking in the future.

@mikewest Will you have time to look into this?

I am currently aware of one failure with color.adobe.com: Bug 1583657

I'm willing to try to land this based on your implementation experience. I don't have any additional metrics to point to. So I'll point to yours and see if folks scream.

@mikewest Hey! Nice, I see you landed this back in November. Did you encounter any problems?

We intent to ship this change with Firefox 75. /cc @annevk

I'll work on a spec PR

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jyasskin picture jyasskin  ·  3Comments

travisleithead picture travisleithead  ·  4Comments

NE-SmallTown picture NE-SmallTown  ·  4Comments

empijei picture empijei  ·  3Comments

benjamingr picture benjamingr  ·  3Comments