Section | Chapter | Author | Reviewers
-- | -- | -- | --
IV. Content Distribution | 16. Caching | @paulcalvano | @yoavweiss @colinbendell
Due date: To help us stay on schedule, please complete the action items in this issue by June 3.
To do:
Current list of metrics:
馃憠Optional AI (@paulcalvano): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.
馃憠 AI (@paulcalvano): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.
The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.
Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.
Additional resources:
Would be interesting to see metrics on:
Last-Modified vs. ETag validatorsCache-Control: max-age vs. ExpiresVary (how many dimensions, what headers, etc.)Cache-Control directives (e.g., public, private, immutable)Few more ideas
@paulcalvano @yoavweiss @colinbendell we're hoping to finalize the metrics for each chapter today. Could you edit https://github.com/HTTPArchive/almanac.httparchive.org/issues/18#issue-446806416 and update it with anything that's missing? I see a bunch of other metrics were discussed in the comments. When that's done please tick the last TODO checkbox item and close this issue. Thanks!
Sorry I'm late, and know this is closed, but any thought in measuring whether ETags actually work?
They don't work in Apache for example if gzip or br is used (as I would hope they would be!) and you won't ever get 304 responses. Try it at www.apache.org for example - gzipped resources return 200 on refresh but images (which are not gzipped) correctly return a 304. So they should be turned off and Last-Modified should be used instead. Apache is pretty popular so imagine this affects a non-trival number of servers since ETags are enabled by default and most people turn on compression for performance reasons. Other servers may also have similar issues with them not actually working.
Also in the past ETags were often based on the inode which caused issues with load balanced servers, but not aware of anyone doing that anymore so not too worried about that. More worried about other implementation issues like Apache has. Though if can measure both together then why not.
It would require hitting at least one resource twice though (once with no cache, and then again with it cached) to see if 200 or 304 is returned so not sure how doable that is.
Not too late to add a metric if @paulcalvano sees fit. Just update the first comment.
Investigating how well Etag validation is supported would be great. Just to note -- that apache bug is specific to mod_deflate; if you use Multiviews for negotiating encoding, it works fine (e.g., see www.mnot.net). That said, it'd be interesting to see how widespread that is.
Looking over https://cache-tests.fyi for inspiration, a few other things come to mind:
Cache-Control: public (even though it usually isn't required)?Date and Age that don't make sense (see this paper)?Pragma in responses (even though it doesn't mean anything)?Set-Cookie on cacheable responses?One additional thought: Might worth adding an experimental headers section and include in-the-wild uses of Variance or Key (if any)
I think ETag validation would be out of scope for this because we aren;'t making a repeat request. I agree it would definitely be interesting to explore whether servers are returning 304 status codes to requests with valid ETags.
@mnot - great idea to look at the cache tests. I'll add some of these to the list.
On the topic of valid dates - I ran into many invalid Date and Last-Modified headers in a recent analysis I did, so it would be interesting to explore what is going on there.
@colinbendell - do you have an example of Variance or Key headers? I'm not familiar with those.
Hoping we can resolve the open questions about metrics and close this issue ASAP.
Last call for metrics. @paulcalvano please update the final list and close this issue today. (sorry, couldn't think of a caching pun)
Most helpful comment
Sorry I'm late, and know this is closed, but any thought in measuring whether ETags actually work?
They don't work in Apache for example if gzip or br is used (as I would hope they would be!) and you won't ever get 304 responses. Try it at www.apache.org for example - gzipped resources return 200 on refresh but images (which are not gzipped) correctly return a 304. So they should be turned off and
Last-Modifiedshould be used instead. Apache is pretty popular so imagine this affects a non-trival number of servers since ETags are enabled by default and most people turn on compression for performance reasons. Other servers may also have similar issues with them not actually working.Also in the past ETags were often based on the inode which caused issues with load balanced servers, but not aware of anyone doing that anymore so not too worried about that. More worried about other implementation issues like Apache has. Though if can measure both together then why not.
It would require hitting at least one resource twice though (once with no cache, and then again with it cached) to see if 200 or 304 is returned so not sure how doable that is.