almanac.httparchive.org 🚀 - HTTP/2 2020

Question about what this chapter should be called given that HTTP/3 is now here (even if not quite officially signed off yet). Stick with HTTP/2? Change to HTTP? Should we rename the 2019 chapter (with redirect obviously) or leave as is? Probably best to wait until we've got an author/authors to let them help decide.

bazzadp on 27 Jun 2020

That's a great question, I'm not sure what the best name is. Agreed, let's see where the content planning takes us and keep the option open to rename the chapter if needed.

Safe to add you as a reviewer for this chapter, Barry?

rviscomi on 27 Jun 2020

My thoughts: frame it as HTTP, and break it down to an acceptable level - accepting that it gets very complex quickly if you look at HTTP/1.1 vs HTTP/2 (streams, prioritization, etc) vs HTTP/3 (QUIC transport, etc).

A lot of the HTTP semantics users & web developers interact with are consistent across versions, and we should start there, with sub-sections for what HTTP/2 and HTTP/3 bring (and why they exist).

elithrar on 27 Jun 2020

Agree with that.

Still considering what to do on last year's chapter. Do we rename it? Gut feel is no, as it was very HTTP/2 focused (with a quick dip into HTTP/3 at the end), even if that does lead to a slight inconsistency in the naming cross year.

I spent a good part of last year's HTTP/2 chapter giving the basics as still think this is a fairly new (even if it was approaching it's 5 years anniversary back then), and little understood technology. Think it would be good to have a similar intro to HTTP/3 this year, and perhaps less on HTTP/2 (we can refer back to the previous year's chapter for that).

However, the main point of the Almanac IMHO is not to act as a reference of the technology (though some background is good, and necessary), but to look at it's usage through the HTTP Archive and help explain that to readers. So need to be conscious not to spend too much time on background/theory. I may have overdone it last year but, as I say, I think it was needed more so than for other chapters given how new the technology is and how niche the expertise is. And given HTTP/3 is even newer, maybe that need is still there this year?

Saying all that, I'm struggling to think what new stats to query for this chapter. But we'll worry about that once we've got authors and reviewers!

And on that subject I'm definitely up for reviewing this year. Can author too if we get really stuck but would prefer to hear from someone new if anyone volunteers! Either way, I'm defintely interested in following how this chapter progresses and to help in anyway I can for it.

bazzadp on 27 Jun 2020

👍1

Hello everyone. I'd like to again be a reviewer for this chapter this year. I could also contribute text on HTTP/3 and QUIC concepts if we go that route.

My 2 cents would be that the almanac should indeed focus more on the practical use of the tech seen over the past year, as measured by the HTTP archive runs. From that perspective, there won't be much to discuss on HTTP/3 yet, as few servers and browsers offer it and it's not ready for prime time (though, by the end of the year, it might be a bit more wide-spread).

For this year, you could look at how many sites offer H3 by looking at the alt-svc headers though. You could also look at TLS 1.3 adoption for H2, as this is kind of related to QUIC (or at least could give an indication of how up-to-date backends are). You could also research coalescing (or at least certificate contents) a bit more, as this will stay highly relevant for QUIC (and 0-RTT!) as well (maybe Matt Hobbs could help with that? Given his in-depth waterfall discussion blog posts on this). Finally, an idea of the measured RTTs to the backends would be useful, as that's where QUIC/H3 will provide most benefits.

To actually test H3 down the line, the HTTP archive runs would have to be adapted to also try a (secondary) load over QUIC after the normal H2 (H1?) connection, which might be something to think about @rviscomi (and probably also needs support from @pmeenan, who's been talking about this on twitter a bit as well).

rmarx on 29 Jun 2020

❤1

Hello everyone. I'd like to again be a reviewer for this chapter this year. I could also contribute text on HTTP/3 and QUIC concepts if we go that route.

Great stuff!

My 2 cents would be that the almanac should indeed focus more on the practical use of the tech seen over the past year, as measured by the HTTP archive runs. From that perspective, there won't be much to discuss on HTTP/3 yet, as few servers and browsers offer it and it's not ready for prime time (though, by the end of the year, it might be a bit more wide-spread).

Think you'd be surprised with CDNs starting to offer it. It does seem to be growing. Especially if you include gQUIC.

For this year, you could look at how many sites offer H3 by looking at the alt-svc headers though. You could also look at TLS 1.3 adoption for H2, as this is kind of related to QUIC (or at least could give an indication of how up-to-date backends are). You could also research coalescing (or at least certificate contents) a bit more, as this will stay highly relevant for QUIC (and 0-RTT!) as well

Yeah those are the sorts of things I tried to look at last year too. The new author would be well advised to look at the metrics we settled on last year and the discussions around that (#22 )

(maybe Matt Hobbs could help with that? Given his in-depth waterfall discussion blog posts on this).

Ping @nooshu

Finally, an idea of the measured RTTs to the backends would be useful, as that's where QUIC/H3 will provide most benefits.

To actually test H3 down the line, the HTTP archive runs would have to be adapted to also try a (secondary) load over QUIC after the normal H2 (H1?) connection, which might be something to think about @rviscomi (and probably also needs support from @pmeenan, who's been talking about this on twitter a bit as well).

Reminds me of this discussion on trying to measure impact of HTTP/2

bazzadp on 29 Jun 2020

👍1

I should be able to review this chapter.

ibnesayeed on 29 Jun 2020

I'm happy to help review this chapter.

pmeenan on 6 Jul 2020

(maybe Matt Hobbs could help with that? Given his in-depth waterfall discussion blog posts on this).

Thanks @rmarx, I'd be happy to help.

Nooshu on 7 Jul 2020

Some more thoughts on this chapter:

Last year we concentrated on HTTP/2, with a bit of a mention of HTTP/3. Probably should talk a lot about HTTP/3 this year even if usage might be low.

However last year I almost completely ignored the whole topic of the underlying HTTP semantics. Should we add some more of that this year?

For example, how many HTTP Headers are sent? And what size are they? What's the size of headers compared to bodies on requests and responses? Some headers (e.g. CSP) can be quite large and we're adding new headers like feature-policy and with structured headers this could grow over time. This is of course another benefit of HTTP/2 and HTTP/3 as it has header compression.

What else could we consider along those lines for this year?

Do be aware that some of the other HTTP semantics are covered in other chapters:

Caching and Compression chapters cover the respective headers for these - in fact there's a question as to whether they need full chapters again this year or if they should be collapsed into this chapter?
Number, size and type of requests is captured by Page-Weight chapter (as well as individually in Media, Fonts, CSS and JavaScript chapters).
Security covers HTTPS and Cookies, and this year we may add a dedicated Cookies chapter (or at least talk about them in the new Privacy chapter).

bazzadp on 7 Jul 2020

I do agree it would be interesting to have a discussion on HTTP semantics and things like structured headers (and things that have been going wrong with their practical deployments, cc @yoavweiss).

However, like you say, several newer headers and their impact are discussed elsewhere and cutting-edge stuff like feature policy probably won't show up much this year. We then also should definitely re-name the chapter away from HTTP/2 imo.

As you know, I'm also highly skeptical about the practical impact of HPACK/QPACK for the normal web page loading use case. One area where you'd see improvements would be is with large cookies, but I'm not sure if the current test setup is ideal for measuring those (given that European sites shouldn't be setting cookies on first visit (theoretically) and some high-impact cookies probably only come into play after login/shopping cart stuff). However, this could also be an excellent opportunity to prove me wrong on both counts :) It would probably also unearth some cool/disturbing outliers. Do the WPT results include sizes for _compressed_ headers? If not, we might setup something to run the plaintexts through HPACK and QPACK libraries to compare etc.

rmarx on 7 Jul 2020

However, like you say, several newer headers and their impact are discussed elsewhere and cutting-edge stuff like feature policy probably won't show up much this year. We then also should definitely re-name the chapter away from HTTP/2 imo.

Feature Policy was discussed in security chapter though annoyingly it didn't discussed actually adoption (very small - looks to be about 1,000 sites at most looking at the raw data) and just which options were used when it was deployed. It's probably grown but not by that much. Referrer Policy looks to be used a lot more. Point is use of headers is growing and there is lots of innovation in this space.

As you know, I'm also highly skeptical about the practical impact of HPACK/QPACK for the normal web page loading use case. One area where you'd see improvements would be is with large cookies, but I'm not sure if the current test setup is ideal for measuring those (given that European sites shouldn't be setting cookies on first visit (theoretically) and some high-impact cookies probably only come into play after login/shopping cart stuff). However, this could also be an excellent opportunity to prove me wrong on both counts :) It would probably also unearth some cool/disturbing outliers.

I dunno. Some CSP headers are pretty big! But they're on the response where the files are usually much bigger so maybe you're right.

Do the WPT results include sizes for _compressed_ headers? If not, we might setup something to run the plaintexts through HPACK and QPACK libraries to compare etc.

Discussed last year and not easily available.

bazzadp on 7 Jul 2020

Anyone on this thread interested in taking on the Author role? Or suggestions who could?

@elithrar not sure what role you were thinking of and if would be interested in Authoring?

@bagder @dotjs , as last year's other reviewers any interest here? Or suggestions of Authors?

And @Lpardue any further suggestions on this after our chat the other week given your role on QUIC-WG?

bazzadp on 7 Jul 2020

I'm happy to work on this chapter again. Sounds like a few people are interested in providing some content. I'm happy to pull it all together and convincing @LPardue to join in.

dotjs on 7 Jul 2020

👍3

@MikeBishop - any interest in co-authoring this chapter?

paulcalvano on 8 Jul 2020

@siyengar same question to you! 😀

bazzadp on 8 Jul 2020

@dotjs just want to confirm that you've reviewed the authoring commitment and the process works for you. Would love to have you as the lead author :)

obto on 9 Jul 2020

Yes, I'd be happy to help, as author or reviewer.

MikeBishop on 9 Jul 2020

👍3

I am willing and able to participate on authoring.

LPardue on 9 Jul 2020

👍3

@dotjs just want to confirm that you've reviewed the authoring commitment and the process works for you. Would love to have you as the lead author :)

reviewed and looks fine to me

dotjs on 10 Jul 2020

🚀1

Hi. I would sign up for either chapter reviewer or analyst

gregorywolf on 10 Jul 2020

👍3

@dotjs thank you for agreeing to be the lead author for the HTTP2 chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.

The immediate next steps for this chapter are:

Establish the rest of your content team. Several other people were interested or nominated (see below), so that's a great place to start. The larger the scope of the chapter, the more people you'll want to have on board.
Start sketching out ideas in your draft doc.
Catch up on last year's chapter and the project methodology to get a sense for what's possible.

There's a ton of info in the top comment, so check that out and feel free to ping myself or @rviscomi with any questions!

@MikeBishop @LPardue @rmarx @ibnesayeed @pmeenan @Nooshu I've put you down as reviewers for now, and will leave it to @dotjs to reassign at their discretion

@gregorywolf Put you down as both a reviewer and analyst :)

obto on 10 Jul 2020

🎉2 ❤1

With this massive line-up already signed up I can stand down this year.

bagder on 12 Jul 2020

Hey @dotjs, hope you had a great weekend.

As you know, we're tying to have the outline and metrics settled on by the end of the week so we have time to configure the Web Crawler to track everything you need. Anything you need from me to keep things moving forward?

Also, can you remind your team to properly add and credit themselves in your chapter's Google Doc?

obto on 13 Jul 2020

Added myself as a reviewer. Know we have a lot of them, but feel I deserve my place having written last year's chapter 😀 @dotjs you gonna move some of the reviewers to co-authors? Or taking on the full task yourself?

@gregorywolf happy to help out with Analysis here if you need any help. And the awesome @pmeenan being on team HTTP/2 will undoubtedly help if we have any questions as to what the HTTP Archive crawl currently does (or can!) get!

bazzadp on 14 Jul 2020

Thanks all - Curernt thoughts are to use co-authors. If everyone who has expressed an interst can request edit access to the doc. We can start to plan the content there. Let's focus any potential intersting metrics/measurements that were not part of last years run.

@rmarx @LPardue Keen on your thoughts on what intersting propoerties we can measure for QUIC/H3 etc.
@pmeenan I'm personally interested in quantifying the impact of multiple domains/protocols on resource loading. This could include the impact of connection coalescence. Any thoughts on how we can quantify the 'thunderdome' ? How often is H2 prioritisation even relevant ?

dotjs on 14 Jul 2020

@dotjs I'm not sure if it will be possible with bigquery but it might be possible with a script that crawls through the raw HAR files on GCS since the data includes chunk timings (and sizes), priority and connection info.

In theory you could check to see how often a higher priority response download is interrupted by chunks for a lower priority response (ignoring some small amount for headers). You could detect broken HTTP/2 prioritization when it happens on the same connection or cross-connection contention when it happens on a separate connection.

We'd have to noodle a bit to think of how that should be represented as a summary metric.

pmeenan on 14 Jul 2020

@dotjs How is the outline coming along? Want to get that finished up by the end of the week so we have time to get the Web Crawler setup :)

obto on 16 Jul 2020

Have a first pass at an outline. I'm still not sure about going with HTTP with sections for H2 and H3 as suggested by Matt. I've added some thoughts about other things to discuss with regards HTTP e.g. semantics, DoH and websockets. Any reviewers/authors please add to the doc as I would like as many ideas on what other people would like to see in this chapter as possible. paging @gregorywolf , @Nooshu , @MikeBishop, @ibnesayeed , @bazzadp, @elithrar, @pmeenan and @LPardue

dotjs on 17 Jul 2020

👍3

That's pretty comprehensive @dotjs ! Will rack my brains and see if I can think of anything else but can't at the mo...

bazzadp on 18 Jul 2020

Hello. I am coming up to speed with the project and specifically my task for the HTTP chapter as an analyst. My goal for this weekend is to finish reviewing all key material. I also want to look at all of the 2019 HTTP SQL queries and start retro fitting them to use on the sample data that @paulcalvano created. I am new to this process so PLEASE direct me as necessary. I look forward to working with the team.

gregorywolf on 18 Jul 2020

@gregorywolf I am a comment here that might be useful: https://github.com/HTTPArchive/almanac.httparchive.org/issues/914#issuecomment-659205330

And since I’m on this chapter, I’ll update it specifically for this chapter 😀:

Start with the Analysts Guide and set up BigQuery (Good guide on that by our very own @paulcalvano who's leading the Analyst team here on the Web Almanac). Also be aware this can be expensive but there's a generous free tier and Paul will provide credits beyond that for Almanac work. There are also sample tables which are much cheaper to query and it should be difficult to go beyond the free budget with those. Then join the #web-almanac slack and Paul will invite you to the Analysts channel on that.

For this chapter, you can read last year's chapter, look at last year's SQL for this chapter (and the actual results it produced) - both of these are linked at the bottom of the chapter btw. Familiarise yourself with all this, then work with @dotjs and the reviewers to figure out what metrics you want to use this this year and then convert them into queries. Would suggest reusing a lot of last year's queries but also adding some to give a fresh take. Liaise with the other Analysts and @paulcalvano if you have any questions on the data set and what's available. I can also help with this as on this chapter and similarly we’re lucky to have @pmeenan the God of WebPageTest (which is what our crawler uses) on this chapter if any queries on what’s possible or not.

We're planning to run the crawl for the 2020 dataset throughout August so critical point is to quickly figure out and implement any custom metrics required for that crawl before it starts. Would hope there shouldn't be too many (if any) as there is quite a lot of detail in the current dataset and we didn't need any for the HTTP/2 chapter last year. Luckily this chapter deals Mostly with the headers and meta data rather than stuff in the expensive bodies. Thought that may change this year depending on what we want to query.

Hope that helps and gives you something to get started on!

bazzadp on 18 Jul 2020

❤1

Hi. Quick update. I have updated all of the HTTP 2019 SQL queries. I have not submitted a PR yet. Once the sample_data tables are completed/finalized, I will start testing to make sure the output looks as expected. At that time I will submit a PR. I would be interested to know if anyone has any ideas on what data would be interesting that is above and beyond what was extracted last year.

gregorywolf on 23 Jul 2020

@gregorywolf check out the analyst workflow doc if you haven't already. It may be helpful to create the PR now as a draft, and use it to keep track of metrics already implemented vs those not yet implemented using a markdown checklist. (steps 4 and 5)

Do any of the 2020 queries require custom metrics? (querying the DOM at runtime)

rviscomi on 23 Jul 2020

There are some interesting ideas that may or may require some digging into the HARs. @rviscomi Is there any precedent for this ? For example I'm interested in measuring multiplexing concurrency, concurrent connections etc. @gregorywolf Happy to chat through the metrics whenever you are ready.

dotjs on 23 Jul 2020

There are some interesting ideas that may or may require some digging into the HARs. @rviscomi Is there any precedent for this ?

Could you clarify? Not sure if you're asking if any chapter has looked at the HAR data before or only if this is new for the H2 chapter.

rviscomi on 23 Jul 2020

@gregorywolf Took a look over the chapter and it looks like we've got most if not all of the data you need. Can you double check though? Only got a little more time left to make changes to the Crawler to collect extra data

obto on 24 Jul 2020

@rviscomi Hi. I just submitted a draft PR for the sql 2019 queries formatted to use the sample_data tables.

@dotjs I think talking live would be great. Let's communicate via Slack DM to coordinate.

gregorywolf on 25 Jul 2020

👍1

@dotjs @gregorywolf for the two milestones overdue on July 27 could you check the boxes if:

the outline has been reviewed and all feasible metrics have been identified
any necessary custom metrics have been created and you've created a draft PR to track which feasible metrics have had their queries implemented (we've updated the milestone description to clarify this)

Keeping the milestone checklist up to date helps us to see at a glance how all of the chapters are progressing. Thanks for helping us to stay on schedule!

obto on 30 Jul 2020

I've updated the chapter metadata at the top of this issue to link to the public spreadsheet that will be used for this chapter's query results. The sheet serves 3 purposes:

Enable authors/reviewers to analyze the results for each metric without running the queries themselves
Generate data visualizations to be embedded in the chapter
Serve as a public audit trail of this chapter's data collection/analysis, linked from the chapter footer

obto on 1 Sep 2020

Hi. I am very close to finalizing all of the queries for the chapter. @rviscomi has been kind enough to run all of my queries so I do not run into a BQ quota issue. I will provide another update once all of the newest results have been generated and transferred to the results Google Sheet

gregorywolf on 22 Sep 2020

All of the query results are posted in the results sheet. I have created pivot tables for all of the tabs. Please take a look at the data and provide feedback. I made a decision to NOT filter out any key fields that contain blanks. I will leave the filtering to the author

gregorywolf on 24 Sep 2020

👍2

What's the best venue to provide feedback? Here on the issue, comments in the results sheet, etc.?

MikeBishop on 25 Sep 2020

Personally I’d prefer it here. Or at the very least an “FYI I’ve made a comment on tabs 1, 2 and 5” type comment on this issue.

bazzadp on 25 Sep 2020

👍1

First off, thanks for the work you've already put in. This is an immense amount of data to digest, and you've clearly put in a lot of work slicing it into interpretable chunks.

For all of these, the pivot tables you mentioned would be useful to slice things, but I'm not able to actually filter anything in the sheet itself; I'm wondering if that's because I don't have edit access to the sheet? But I can copy the sheet and add filter views, it looks like.

Here's my first pass through the different pages:

Adoption of H2 tab: How do we interpret the blank outcome? I don't want to just discard nearly 4% of requests, but it's not clear that it directly maps to any of the other versions, since they are represented.
Grouped by server:
- Same about the blanks, but it's more sensible here as some servers don't include that header.
- I wonder about spinning these two tabs together, to see whether there are trends of servers more or less likely to serve HTTP/2. I imagine that, exempting those which simply don't implement HTTP/2, it would turn into a statement about default-on vs. default-off.
Alt-Svc headers: I think it would be more useful to break these down into what percentage offer certain things in Alt-Svc, rather than just the discrete header values. (Though I'm very surprised there are enough instances to gather any appreciable percentages on a specific value; when I did a similar query a few years ago, I found that "clear" was the only thing that had enough consistency for that.) For example:
- Percentage that are "clear", the only defined keyword for this header, which we can already see from this table
- Percentage that offer h2
- Percentage that offer various QUIC/H3 versions
- Percentage that refer to same/different host or port
- Distribution of ma values
- How many alternatives per protocol? How many different protocols?
For the Upgrade header, I'd like the ability to filter those by HTTP/HTTPS. Upgrading to h2c is only supposed to be offered on clear-text connections, but a recent article pointed out that some servers that support it will still do the Upgrade within an HTTP/1.1 TLS connection (presumably because something else is terminating TLS and the server sees it as a clear-text connection).
I'm more than a little surprised by the number of HTTP/2 connections returning the Upgrade header. That's... supposed to be illegal. Not feedback on the presentation of the data, just... interesting. Thanks for including that.
Percentage loaded over HTTP: Should I read this as percentage of resources on a page loaded over cleartext, given the protocol used for the base page?
TLS version by HTTP version: What does blank mean here? I assume that we're not considering cleartext HTTP/2, so it's not "no TLS" for that. The sampled QUIC versions are presumably using Google Crypto, so the advertisement of any TLS version is interesting, even though small.

MikeBishop on 25 Sep 2020

👍1

Chiming in to give a couple of unsolicited Sheets tips: don't hesitate to request edit access if it'd help you explore the data, and change the default notification settings from "Only Yours" to "All" to be emailed on all comments even if you're not explicitly mentioned.

rviscomi on 25 Sep 2020

@MikeBishop , I can answer some of these based on experience last year as author and person who came up with a lot of these stat requests, and investigations I did on some of the same questions on last years stats.

For all of these, the pivot tables you mentioned would be useful to slice things, but I'm not able to actually filter anything in the sheet itself; I'm wondering if that's because I don't have edit access to the sheet? But I can copy the sheet and add filter views, it looks like.

Could be. Could you request edit permission to see?

Here's my first pass through the different pages:

Adoption of H2 tab: How do we interpret the blank outcome? I don't want to just discard nearly 4% of requests, but it's not clear that it directly maps to any of the other versions, since they are represented.

We had the same last year and investing showed these to be mostly HTTP/1.1:

Annoyingly, there is a larger percentage where the protocol was not correctly tracked by the HTTP Archive crawl, particularly on desktop. Digging into this has shown various reasons, some of which can be explained and some of which can't. Based on spot checks, they mostly appear to be HTTP/1.1 requests and, assuming they are, desktop and mobile usage is similar.

It's a similar result this year - desktop is ~4% short of mobile and we have ~4% uncategorised.

Even better news, is I spent some time on this after (cause it bugged me to!) and figured out why this is the case and fixed it - unfortunately too late for this year's Almanac month (August) but we can look at October data to confirm this just before we go live. From the work on that fix we know the "protocol" is not always set for HTTP/1.1 and the parsing to try to pull it out from the request and response was broken. I'm pretty confident the vast majority is HTTP/1.1 and think we should assume this, explain it like I did last year, and quickly double check it after the October run to confirm.

Grouped by server:

Same about the blanks, but it's more sensible here as some servers don't include that header.

I wonder about spinning these two tabs together, to see whether there are trends of servers more or less likely to serve HTTP/2. I imagine that, exempting those which simply don't implement HTTP/2, it would turn into a statement about default-on vs. default-off.

Some interesting stats and discussion on that last year. @gregorywolf I added client to some of the pivot tables as percentages were wrong without them as adding up (unless Apache really is 95% of server usage 😁)

Alt-Svc headers: I think it would be more useful to break these down into what percentage offer certain things in Alt-Svc, rather than just the discrete header values. (Though I'm very surprised there are enough instances to gather any appreciable percentages on a specific value; when I did a similar query a few years ago, I found that "clear" was the only thing that had enough consistency for that.) For example:

Percentage that are "clear", the only defined keyword for this header, which we can already see from this table

Percentage that offer h2

Percentage that offer various QUIC/H3 versions

Percentage that refer to same/different host or port

Distribution of ma values

How many alternatives per protocol? How many different protocols?

For the Upgrade header, I'd like the ability to filter those by HTTP/HTTPS. Upgrading to h2c is only supposed to be offered on clear-text connections, but a recent article pointed out that some servers that support it will still do the Upgrade within an HTTP/1.1 TLS connection (presumably because something else is terminating TLS and the server sees it as a clear-text connection).

That's why I'm a fan of giving the raw data and letting authors/reviewers slice and dice as they see it in the spreadsheet! Though can revert to SQL if easier once we know what we want. After digging into the data we should decide what stats are interesting and so what to include in the chapter and in what format.

I'm more than a little surprised by the number of HTTP/2 connections returning the Upgrade header. That's... supposed to be illegal. Not feedback on the presentation of the data, just... interesting. Thanks for including that.

Again good discussion on this last year - which is where a lot of these queries came from. Will be interesting to see if it's better or worse than last year.

Percentage loaded over HTTP: Should I read this as percentage of resources on a page loaded over cleartext, given the protocol used for the base page?

Sorry don't understand your question or what you are talking bout cleartext. Is this "percentage_of_resources_loaded_over_HTTP_by_version_per_site" tab? That's any HTTP version regardless of HTTPS status.

TLS version by HTTP version: What does blank mean here? I assume that we're not considering cleartext HTTP/2, so it's not "no TLS" for that. The sampled QUIC versions are presumably using Google Crypto, so the advertisement of any TLS version is interesting, even though small.

Yes we should dig into this more. Suspect it's QUIC and TLS version is not being recorded correctly, but that's a guess. This is a new stat for this year btw so nothing to compare on this last year. There's a lot but Google does account for a lot of traffic when looking at request level (between Google Analytics, Ads and Marketing tags, YouTube, Google Fonts..etc.) so it's possible. Definitely one to dig into @gregorywolf .

bazzadp on 25 Sep 2020

Percentage loaded over HTTP: Should I read this as percentage of resources on a page loaded over cleartext, given the protocol used for the base page?

Sorry don't understand your question or what you are talking bout cleartext. Is this "percentage_of_resources_loaded_over_HTTP_by_version_per_site" tab? That's any HTTP version regardless of HTTPS status.

"Percentage of resources loaded over HTTP" as opposed to what? That is, where the number is less than 100% loaded over HTTP, what were the other resources loaded over? I could read this as HTTP vs. HTTPS, same versus different version used for subresources, network vs. cache, references to data: URLs that don't hit the network, etc.

Or it's something totally different and I'm having a total mental disconnect figuring out what this query is measuring.

MikeBishop on 25 Sep 2020

Ah gotcha now. Yeah I don't understand this stat either. Would expect each line to add up to 100%, so we have for example 30% HTTP/1.1 and 70% HTTP/2. @gregorywolf ?

bazzadp on 25 Sep 2020

All. I have been away for a bunch of days and am just getting back on line. I will take a look at the above comments and comment in the next few days.

gregorywolf on 29 Sep 2020

I have reviewed all of the comments and had a chance to speak live with @MikeBishop and @dotjs. The following is a summary of issues/comments.

adoption_of_http_2_by_site_and_requests
@bazzadp You mentioned that you dug into a concerning issue and "fixed" something but it was not in time for the August crawl. Please outline what was fixed. Should the query be modified and rerun against the now posted September data?

count_of_h2_and_h3_sites_grouped_by_server
I am acknowledging the change @bazzadp made to the associated results tab pivot table

detailed_alt_svc_headers
I am going to see if I can modify the query to pull out the desired data outlined by @MikeBishop:

Percentage that are "clear", the only defined keyword for this header, which we can already see from this table
Percentage that offer h2
Percentage that offer various QUIC/H3 versions
Percentage that refer to same/different host or port
Distribution of ma values
How many alternatives per protocol? How many different protocols?

Once the revised data is posted in results tab I will create a pivot table that filters by HTTP/HTTPS

percentage_of_resources_loaded_over_HTTP_by_version_per_site
Acknowledging the feedback that the resulting data is confusing and ambiguous. I will figure out how to make this data more useful

tls_adoption_by_http_version
Going to dig into the data to figure out why we are getting blanks for TLS version for HTTP 0.9, 1.0, 1.1, and H2

Final item is that in speaking with @dotjs he expressed a desire to incorporate TCP connection information into the results in order to draw conclusions on efficiency of protocol usage based on available bandwidth. I will dig into this but initial feedback I have received is that this may not be possible.

@pmeenan Are you aware of any field or way that we could extract meaningful bandwidth usage out of the HA crawls?
@dotjs Jump in an provide additional context as required

gregorywolf on 4 Oct 2020

http_2_by_site_and_requests
@bazzadp You mentioned that you dug into a concerning issue and "fixed" something but it was not in time for the August crawl. Please outline what was fixed. Should the query be modified and rerun against the now posted September data?

As you're aware we used the protocol field as the reqHttpVersion and respHttpVersion often was blank or contained non-sensical values like (us:). However the protocol field was also often blank (and particularly for HTTP/1.1 messages on Desktop by looks of things) so that's not ideal either :-(

Anyway I discovered that the reqHttpVersion and respHttpVersion fields try to just parse the messages for an-HTTP/1 style message (e.g. GET / HTTP/1.1 request or HTTP/1.1 200 OK response). Obviously this was never going to work in an HTTP/2 world, but it also didn't work too well for HTTP/1 messages due to some other bugs in it (which also meant that HTTP/2's status: 200 pseudo-header was parsed as if it was an HTTP/1 response - hence where the us: value came from).

So basically I fixed the reqHttpVersion and respHttpVersion fields with this pull request to fix some of the logic and also fall back to the protocol field when that still doesn't work (e.g. HTTP/2). This should give us the best of both worlds and allow us to revert back to the reqHttpVersion and respHttpVersion fields with more confidence.

This will first be available in the October crawl (it wasn't in time for August for Web Almanac, nor September). As mentioned above, I would assume the blanks are HTTP/1.1 (they appeared to be from my investigations last year) and hen we can validate this assumption once we have the October data, which should be available just before publication.

I still don't know why the protocol field is sometimes blank, but this appears to be set by the browser so less in our control to fix.

bazzadp on 4 Oct 2020

The original query percentage_of_resources_loaded_over_HTTP_by_version_per_site has been revised and renamed to average protcol requests per page. The intention of this query to answer the question on a given page what are the average number of resources loaded for a given HTTP version.

The original query tls_adoption_by_http_version has been revised and the new results are now contained in TLS versions per page (same-domain). I worked closely with @bazzadp and @tomvangoethem to minimize the NULL entries. More details can be viewed in #1344

The results tab detailed_alt_svc_headers has been modified by adding two additional columns, contains h3? and contains quic?. A new tab has been created called detailed_alt_svc_headers_pivot which contains two pivots tables of the resulting column data. The last item that was done is another tab was created called detailed_alt_svc_headers_unique which is an extraction of the 'ugrade column from detailed_alt_svc_headers tab. From here it is very easy to see the various components laid out in a readable format.

gregorywolf on 7 Oct 2020

👍1

Thanks, @gregorywolf. Reading through the results again, here are some updated comments and observations:

count_of_h2_sites_grouped_by_server: Given that the values sum to 100%, I believe this table is showing the distribution by server of the HTTP/2 traffic. That's an interesting metric, particularly to the extent that it differs from the traffic distribution of HTTP/1.1 or overall HTTP traffic. I think the original intent was the percentage of traffic to each server type which is using HTTP/2. We should be able to create those visualizations by combining this with the count_of_non_h2_sites_grouped_by_server tab.
average protocol requests per page: I'm still having trouble parsing this. At first, I thought this was saying that a typical page loads two thirds of its subresources over HTTP/2. But the numbers sum to greater than 100%, so that interpretation doesn't work. Based on the description, I'd be expecting something like a scatterplot or CDF of number of subresources per base page, with a separate graph per protocol used to load the base page.
detailed_alt_svc_headers_pivot et al.: This is a good start, and I think the remainder can be drawn from the data already in the page. Some refinements I'll work on adding if you don't mind me editing your post-processing formulas:
- I think the split between h3 and quic is a little murky, given that we have HTTP/3 over IETF QUIC (h3-29), HTTP/3 over Google QUIC versions (h3-t051, h3-q050), and non-IETF Google QUIC (quic with v= parameters), as well as HTTP/2 (h2). We may want to clarify with someone from Google (@ianswett, @DavidSchinazi?) that we're classifying these tokens correctly, and split them into three buckets rather than two.
- I'd also like to be able to slice by whether the target is same-host or different host, since support for cross-host Alt-Svc varies so much.
- I'd like to get a distribution of the max-age values; I'm assuming they have some common peaks. It would also be interesting to see whether the max-age values ever vary within an instance; that is, does everything advertised always have equal lifetime?
measure_number_of_tcp_connections_per_site: Assuming I'm reading the lower table correctly, the impact of a multiplexed transport on this metric is minor at the median, but very noticeable at both high and low percentiles. The upper table, however, appears to just be the sum of the data in the lower table rather than indicating equivalent data across all protocol versions. I don't think it needs to be rebuilt for that (we can just ignore it), but it's misleading.
adoption_of_http_2_by_site_and_requests (which I assume is more about requests than sites) and measure_of_all_http_versions_for_main_page_of_all_sites: The combination of these two is interesting. It says that roughly half of all main pages are over HTTP/2 (and hardly any QUIC; possibly an artifact of how the run proceeds), but two-thirds of requests are over HTTP/2. That suggests the average HTTP/2-enabled site makes more requests for subresources than an HTTP/1.1-enabled site. Also that some of these subresources are loaded over QUIC. This may tend to inflate the connection number per page, since an initial request may be over HTTP/2 and an Alt-Svc header causes a new QUIC connection to be used for subresources. A second load of the page would presumably run entirely over QUIC and use fewer connections.
number_of_h2_and_h3_pushed_resources_and_avg_bytes: Obviously, these are percentiles out of the subset of connections where pushes are non-zero, which is small. The fact the QUIC appears to push more aggressively is even more notable given the previous bullet showing that QUIC is almost never used on the base page in these runs. That means that requests for subresources are pushing other things. That's not generally how we expect push to work, which is interesting.
number_of_h2_and_h3_pushed_resources_and_bytes_by_content_type: I assume each of these are out of the subset of connections where at least one resource of the given type was pushed. That leads to some interesting looking curves. For example, if any XML was pushed, exactly one XML was pushed because it's 1 at 10th and 90th percentile with the same byte count. Logic suggests that there's probably a sample size of one there. I think it would be more interesting to take these out of all connections that use push, if we can draw it in a way that's not misleading. That is, of connections that use push, what are they pushing? Do some sites push all JS while other sites push a mix of types?

MikeBishop on 9 Oct 2020

h3-t051 is a variant of gQUIC using TLS 1.3 and h3-q050 is a version of gQUIC. Both versions use IETF QUIC invariant headers.

ianswett on 10 Oct 2020

To clarify, here are the Alt-Svc values currently supported by google.com:
1) IETF drafts of HTTP/3: h3-29, h3-27
2) HTTP over Google QUIC versions that use the IETF Alt-Svc format: h3-Q050, h3-Q046, h3-Q043, h3-T051, h3-T050
3) HTTP over Google QUIC versions that use the legacy Google Alt-Svc format: quic; v="46,43" (note that this advertises the same Google QUIC versions that are advertised by h3-Q046, h3-Q043 in the IETF format) (also note that this old format will be removed soon so there's not much need to discuss it apart from documenting history)

DavidSchinazi on 10 Oct 2020

count_of_h2_sites_grouped_by_server: Given that the values sum to 100%, I believe this table is showing the distribution by server of the HTTP/2 traffic. That's an interesting metric, particularly to the extent that it differs from the traffic distribution of HTTP/1.1 or overall HTTP traffic. I think the original intent was the percentage of traffic to each server type which is using HTTP/2.

Be careful with the word "traffic". The HTTP Archive has no concept of traffic and crawls all it's sites evening so www.google.com will get just as much weighting as barrystinysite.com (assuming that the site meets the minimum threshold to be included in CrUX and so HTTP Archive). Better to think of it as sites rather than traffic.

adoption_of_http_2_by_site_and_requests (which I assume is more about requests than sites) and measure_of_all_http_versions_for_main_page_of_all_sites: The combination of these two is interesting. It says that roughly half of all main pages are over HTTP/2 (and hardly any QUIC; possibly an artifact of how the run proceeds), but two-thirds of requests are over HTTP/2.

Surely it's unsurprising hardly any pages are loaded over QUIC since (until very, very recently) Chrome (Which the HTTP Archive crawler uses) only loaded pages over QUIC for Google owned properties and not other sites unless a command line flag is used? Except maybe a few origin trails - was that a thing for QUIC support? Google pages are relatively few when compared to the 6.5million pages we crawl. And I've even checked a few blogspot pages and app engines (as a Google owned properties but with potentially more domains) and they don't appear to be QUIC enabled yet.

In fact I got so curious what these pages are and queried all of the QUICs sites and was surprised to see loads of non-Google properties! I've added these as a new tab.

Does anyone know what are the criteria for Chrome used in August (when the crawl ran) to decide whether QUIC was used or not? As I say I thought it was only used on Google Properties to surprised by this.

Further investigation also shows some oddities in WebPageTest and how it decides whether a request is the main page - particularly for QUIC. Doesn't look entirely accurate to me for QUIC (much more accurate for the other protocols), which might explain why ANY main pages show QUIC as, as you sake @MikeBishop , we would have expected the first request to be TCP and only subsequent requests to be QUIC (except maybe for some Google properties if QUIC support is baked into the Chrome code?).

It says that roughly half of all main pages are over HTTP/2 (and hardly any QUIC; possibly an artifact of how the run proceeds), but two-thirds of requests are over HTTP/2. That suggests the average HTTP/2-enabled site makes more requests for subresources than an HTTP/1.1-enabled site. Also that some of these subresources are loaded over QUIC. This may tend to inflate the connection number per page, since an initial request may be over HTTP/2 and an Alt-Svc header causes a new QUIC connection to be used for subresources. A second load of the page would presumably run entirely over QUIC and use fewer connections.

I'm not sure I agree with that first sentence @MikeBishop - are you not considering the impact of third-party sub resources here? For example, if example.com loads over HTTP/1.1 but then uses Google Fonts or Google Analytics then it will have an HTTP/2 (or even QUIC request) for those two sub-resources so a measure of HTTP/1.1 and HTTP/2 is incomplete to make any assumptions here unless we include the home page protocol as well. It would be a similar story for sharded domains if example.com only supported HTTP/1.1 but assets.example.com supported HTTP/2.

bazzadp on 10 Oct 2020

Google QUIC and IETF QUIC are both enabled based on Alt-Svc advertisement. There isn't currently and to my knowledge there has never been an explicit list of 'Google sites' for which QUIC is enabled, but disabled for other sites. Akamai has been supporting Google QUIC for a while and no special configuration was necessary to allow that.

ianswett on 12 Oct 2020

average protocol requests per page: I'm still having trouble parsing this. At first, I thought this was saying that a typical page loads two thirds of its subresources over HTTP/2. But the numbers sum to greater than 100%, so that interpretation doesn't work. Based on the description, I'd be expecting something like a scatterplot or CDF of number of subresources per base page, with a separate graph per protocol used to load the base page.

@MikeBishop In regards to the percentage exceeding 100% I think this is to be expected. Since the calculation is based on average I think the averages for each protocol will get skewed based on the large data size. With that said I will reevaluate the query and figure out how to tighten up the results.

gregorywolf on 14 Oct 2020

Been looking at this average protocol requests per page query at @gregorywolf 's request and think I understand why. I've submitted a pull request in #1368 to fix this, though it needs reviewing. In meantime I've added the data from that new query, in addition to Greg's, to the spreadsheet and it adds up to 100% (though we still have the null protocol requests we've discussed before).

I've also added a second query showing the percentile of sites using HTTP/2 or above and it makes interesting reading I think:

Did you know that less than 7% of sites make no HTTP/2 or QUIC requests at all? Guess the likes of popular third parties (e.g. Google Analytics, Google Fonts, Facebook/Twitter advertising tracking tags) all supporting HTTP/2 or above mean just about everyone (well 93% of sites), use at least a little of the new protocols.

And 10% of sites make only HTTP/2 or QUIC requests - with no HTTP/1.1 requests at all! Originally I thought that was quite high, but the more I think about it, the more I'm surprised it's not higher since we know about half of home pages are now served over HTTP/2 and you'd think that most popular third-parties would have adopted it by now. Still it's more than the 7% of HTTP/1 only sites 🙂

Interesting stats I thought anyway, but would like someone to double check my work to make sure I'd not made a mistake in this. @gregorywolf can you look over the new queries for a start and then will also hopefully get someone else on the analysts team to check too. Will let you all know if they are changed and when merged.

bazzadp on 19 Oct 2020

Thanks @bazzadp. I was looking into this data last night and wondered what we could capture above and beyond the percentage of total requests over http/2. If 50% of first party HTML is now HTTP/2. How does that compare with last year?
I like the percentiles concept which can show but as you mention it will reflect the common third party tags.
@gregorywolf Do we have the data to show resource level distributions ? Interested to see the distribution for common static asset serving domains.

dotjs on 19 Oct 2020

I added the 2019 percentiles for requests by site to the sheet for comparison. Not too different truth be told, though numbers have gone up as expected.

Last year I looked at all home pages (about 36% of home pages were served over HTTP/2) and also HTTPS only since HTTP/2 is only support in browsers over HTTPS (about 55% of HTTPS home pages were served over HTTP/2). Looks like we tried to gather that again this year and looks to be 50% overall and 65% for HTTPS.

We could look at just domains matching the home page, however that will exclude shared assets domains (e.g. assets.example.com). Might be better looking the Third Party chapter for other ideas to quantify this?

bazzadp on 19 Oct 2020

Hi. I have reviewed the changes made by @bazzadp and agree the new results look good

gregorywolf on 19 Oct 2020

@dotjs in case you missed it, we've adjusted the milestones to push the launch date back from November 9 to December 9. This gives all chapters exactly 7 weeks from now to wrap up the analysis, write a draft, get it reviewed, and submit it for publication. So the next milestone will be to complete the first draft by November 12.

However if you're still on schedule to be done by the original November 9 launch date we want you to know that this change doesn't mean your hard work was wasted, and that you'll get the privilege of being part of our "Early Access" launch.

Please see the link above for more info and reach out to @rviscomi or me if you have any questions or concerns about the timeline. We hope this change gives you a bit more breathing room to finish the chapter comfortably and we're excited to see it go live!

obto on 22 Oct 2020

Yes saw the note. Just had a very busy week or so so the extra time is useful. Will continue with the analysis and draft.

dotjs on 25 Oct 2020

Hi all

In a previous comment above, I'd commented on the fact that 4% of requests did not list the protocol . I'd mentioned that I'd identified one reason and submitted a fix to WebPageTest and the results would be available after the October crawl. It now looks like that crawl has finished so can share these results with you.

The results are in this sheet but will summarise them for you here.

We have basically three versions of the HTTP protocol:

protocol as reported by Chrome - this is what we used for our analysis but it's sometimes blank.
response protocol as parsed from HTTP/1.1 requests (e.g. 200 OK HTTP/1.1 response lines)
request protocol as parsed from HTTP/1.1 requests (e.g. GET / HTTP/1.1 request lines)

The bug was in processing the last two incorrectly meaning it included blank lines, and also bits of the HTTP/2 pseudo headers.

It is also possible to get slightly different versions if client requests HTTP/1.0 (or even HTTP/0.9) and gets a response back as HTTP/1.1. If we look at them in that order of precedence we got below in August crawl:

http_version | desktop | mobile
-- | -- | --
| 3.95% | 0.34%
1.1 | 0.00% | 0.00%
: / | 0.53% | 0.01%
http/0.9 | 0.00% | 0.00%
http/1.0 | 0.04% | 0.03%
http/1.1 | 30.56% | 34.09%
HTTP/2 | 63.70% | 63.78%
http/2+quic/46 | 1.20% | 1.70%
me: | 0.00% |
od: | 0.00% | 0.00%
ori | 0.01% | 0.00%
Grand Total | 99.99% | 99.95%

So here we see our problem as we're seeing 3.95% of desktop requests unclassified and also some rubbish (1.1, :/, me:, od:, ori - the later three being the incorrect parsing of the HTTP/2 :status, :method and :origin pseudo headers). As can be seen it affects Desktop more than Mobile for some reason. It was my opinion that that 3.95% was most likely HTTP/1.1 requests as then desktop and mobile would be roughly inline, but I wanted to confirm this.

The October crawl results are shown below:

http_version | desktop | mobile
-- | -- | --
| 0.05% | 0.07%
h3-Q050 | 0.95% | 1.33%
http/0.9 | 0.00% | 0.00%
http/1.0 | 0.03% | 0.03%
http/1.1 | 33.28% | 32.93%
HTTP/2 | 65.69% | 65.62%
QUIC | 0.01% | 0.00%
Grand Total | 100.01% | 99.98%

So, pleasingly there are now very few unclassified results (0.05% for desktop and 0.07% for mobile) and mobile and desktop are very much inline. Mobile has a few more h3-Q050 results, which started rolling out in Chrome in October and a few less HTTP/2 results, but those h3-Q050 results most likely would have been HTTP/2 if it was not switched on at the time of the desktop crawl at which point they are very similar.

Looking at the underlying stats in the October sheet it looks like there is still some gibberish for the request_http_version whcih I'll see if I can fix, but as that's used last precedence it's only picked up for 1 site in each crawl (where it is correctly set!) so that can be ignored for now. Will try to fix it before next year's run!

So I think it's safe to say the unclassified 3.95% is mostly HTTP/1.1. And hopefully next year we'll not have this anomaly in our stats.

Let me know if you have any questions.

bazzadp on 25 Oct 2020

@bazzadp wrote:
Mobile has a few more h3-Q050 results, which started rolling out in Chrome in October

That's not quite right. Chrome rolled out h3-Q050 in June 2020. In October 2020, Chrome rolled out h3-29 in addition to h3-Q050. In other words:

from June 2019 to June 2020: Chrome used http/2+quic/46 (that version is sometimes also referred to as h3-Q046)
from June 2020 to October 2020: Chrome used h3-Q050
since October 2020: Chrome supports both h3-29 and h3-Q050 and uses the one that the server prefers

The above should apply equally to Desktop and Mobile.

DavidSchinazi on 25 Oct 2020

Ah sorry you’re right - difficult to keep up with all these version numbers! Then maybe the difference might be just due to the difference (and extra) sites mobile crawls? We crawled 16% more mobile sites than Desktop in October, and some of them are different, so that might explain it (e.g. if desktop sites are more corporate sites with less Google Analytics and Google AdWords... etc). Anyway I’m guessing now.

Still, I think the findings still stands that the missing 4% on desktop is mostly HTTP/1.1. You can see from filtering the October sheet on where protocol is blank and you see that 3.63% of the desktop requests fall into this category but with a response version of HTTP/1.1 based on parsing the response itself.

You agree?

Btw I also submitted that further fix to WPT
to avoid the weird requests versions we still see an @pmeenan has kindly merged already. So should be in a much better state next year.

I do wonder why Chrome fails to set the version for these ~4% of HTTP/1.1 request in the protocol field though and so why WPT has to fall back to finding it by parsing the response? Might dig up some examples to find it and raise a bug with the Chrome team if I do figure it out. Unless anyone here has any ideas?

bazzadp on 25 Oct 2020

That's definitely odd. Please do file a bug at https://crbug.com ideally with repro steps (such as an example URL that's causing issues) if possible

DavidSchinazi on 25 Oct 2020

@bazzadp . I took a look at the third parties chapter and I think there is some interesting info if we join against the third_parties table
third_party AS ( SELECT category, domain FROM `httparchive.almanac.third_parties` WHERE date = '2020-08-01' )
I started to dig into the distributions of 1st vs 3rd party by protocol and category but as I am no longer an analyst it's no longer free to query. If someone could run a query joining with the protocol request count and possibly content type.

dotjs on 25 Oct 2020

@dotjs do you have the query you want to run to hand? Or didn't get that far?

bazzadp on 25 Oct 2020

@dotjs Please provide some more detail about what your looking to see and I will get the query run and post the results

gregorywolf on 26 Oct 2020

@dotjs , after our discussion on slack, I stole some queries and adjusted them to include the % of protocol of HTTP/2 and QUIC and came up with the following two metrics: https://docs.google.com/spreadsheets/d/1op_UrJGo7CGRXWy5iK7-aQ1lHALEm4_8gkXM2huyvL0/edit?usp=sharing

Let me know if that's along the lines of what you are thinking and, if so, can work with @gregorywolf to add these queries to the report and run against the full data set (the results are on a 10k random sample set).

Or if there's some other way you'd rather see the data then let us know.

bazzadp on 26 Oct 2020

I've updated the functions for identifying h3 (h3-\d+=) and Google QUIC ((quic|h3-[qt]\d{3})=) in the Alt-Svc page, as well as added two additional columns to identify cross-host entries (="[^":]+:) and to extract the max-age. I'm not trying to handle multiple max-ages used in the same header since it doesn't appear common from a cursory glance.

MikeBishop on 30 Oct 2020

@dotjs , after our discussion on slack, I stole some queries and adjusted them to include the % of protocol of HTTP/2 and QUIC and came up with the following two metrics: https://docs.google.com/spreadsheets/d/1op_UrJGo7CGRXWy5iK7-aQ1lHALEm4_8gkXM2huyvL0/edit?usp=sharing

Let me know if that's along the lines of what you are thinking and, if so, can work with @gregorywolf to add these queries to the report and run against the full data set (the results are on a 10k random sample set).

Or if there's some other way you'd rather see the data then let us know.

@dotjs Please take a look at the results that @bazzadp posted. If this data provides you the info your are looking for let me know and I will have the queries run against the full data set and posted in the Results sheet.

gregorywolf on 31 Oct 2020

@gregorywolf The results describe 1st party vs 3rd party. Is it more accurate to describe as 'known 3rd party' vs other requests (i.e first party and static hosts). Could you split the query for HTTP and non-HTTP sites i.e. I'm interested if anything is different for the 3rd party distributions and I think it will disambiguate the not known 3rd party. If possible I would like to plot a CDF of the distributions. The current data tells me that under 10% of sites have less than 50% of 3rd party requests over HTTP/2 and over half the sites have 95% or more. It might be interesting to look at some more points below 25%. The same comment applies to the breakdown by content type and category.

The other ask I have is join the HTTP/2 non HTTP/2 firstHTML data with a page rank. See https://github.com/HTTPArchive/almanac.httparchive.org/issues/1378 for further details. I think it will be interesting to show if the non-adoption of HTTP/2 is indeed in the long-tail.

dotjs on 31 Oct 2020

@dotjs I have read your request and I am not really following what you are requesting. Please elaborate. Thanks.

gregorywolf on 4 Nov 2020

@gregorywolf

Can I see more percentiles than [10,25,50,75,90] as the data for non-3rd party is [0%,0%,60%, 100%, 100%]. Only 1 data point that isn't 0 or 100. Otherwise it is hard to define shape of distribution.
1. I wanted this query split by firstHTML over TLS as I wanted to see if there was a significant difference in other assets possibly being served over TSL and hence maybe H2. For TLS origins it is also interesting to see how 3rd party vs non 3rd party H2 distributions look. This is more important than 1.
Ask for a join against a page rank. Just to see if all the larger sites have all migrated to H2 and it is the smaller sites on Apache/IIS and H1.

dotjs on 4 Nov 2020

A couple of thoughts:

Most chapters standardize on [10,25,50,75,90] to summarize the distribution. If there's a particular value of interest, you could measure the % below that threshold. But for the sake of communicating the distribution to readers, be cautious with how deeply statistical you get, and simpler/fewer percentiles make the distribution easier to digest.

As for page rank, we don't have a reliable data source for that info, so it's not possible. Other chapters have been interested in this too but for consistency I'm discouraging its use.

rviscomi on 4 Nov 2020

I've updated and rerun the queries in my test sheet against the full dataset and with 10% percentiles (plus 5% and 95%). However not sure it makes much differences - since third-party and CDN support of HTTP/2 is so high there tends to be a very big, very quick jump from 0% to 100%. So while I think there is a good reason to move away from the fewer number of "standard percentiles" most other chapters use, to see when that cut happens, I don't think you've going to see much of a spread here.

I'm not sure what value there is to split by firstHTML? By definition third party can't be firstHTML. And we know that most sites (especially third-party's which typically use a CDN) are served over HTTPS (though I admit that protocol-relative URLs are probably still common).

Ultimately we're now very late on in the day to be continually adding new metric requests. We had the chance to suggest metrics previously and these didn't come up so think we have to look at what we've got and seeing what we can use from them. While I want you to have the data you need to write this chapter, I'm just concerned that we can start going down rabbit holes here and continually add to the data.

We need to get these bits of SQL added to git and reviewed as part of that, if you are intending to use them and then copy the data to the proper results spreadsheet - it's entirely possible I've made a mistake in this SQL! - and that will take time and effort from the other almanac analysts. So I would strongly suggest calling it a day on the data we have and seeing if we have enough in that to write the chapter.

bazzadp on 5 Nov 2020

Agreed thanks both for keeping me on track

dotjs on 5 Nov 2020

❤1

FYI queries have been merged into repo (looks like no mistakes!) - thanks @gregorywolf for submitting #1419 and for copying data to the real spreadsheet.

bazzadp on 6 Nov 2020

👍2

The first draft is close enough for review. There are a few questions already raised regarding the HTTP_VERSION of HTTP/2+gQUIC and QUIC. Could @gregorywolf or @barrypollard please confirm the QUIC version in particular. There is a bit more work on H3 in practice and conclusion. I'll try and review that with Lucas tomorrow.

dotjs on 17 Nov 2020

🎉1

Could @gregorywolf or @barrypollard please confirm the QUIC version in particular.

We observed two values in the August data:

HTTP/2+QUIC/46
QUIC

For the first, this comment further up from @DavidSchinazi says that http/2+quic/46 is sometimes also referred to as h3-Q046.

For the second I'm not sure what the version is. That's all that was reported. The alt-svc header has the following:

"name": "alt-svc",
"value": "h3-29=\":443\"; ma=2592000,h3-27=\":443\"; ma=2592000,h3-T050=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\""

And you can see, if you scroll to the right, there is an alt-svc type of just quic. @DavidSchinazi any ideas what this is?

bazzadp on 18 Nov 2020

quic was the old Alt-Svc name for QUIC, and it communicated the specific version of QUIC using the v parameter.
In this case, quic=":443"; v="46,43" means the exact same thing as h3-Q046=":443", h3-Q043=":443".
We deprecated that old quic format in version h3-Q047, so newer versions such as h3-Q050 no longer use it.

DavidSchinazi on 18 Nov 2020

👍1

OK so based on that both QUIC and HTTP/2+QUIC/46 is basically IETF QUIC? And probably draft 46. Any reason why we have two values for that?

Also it means that gQUIC is basically not measured - which ties in with last year when we didn't see gQUIC in our stats despite it being used, and saw it captured under HTTP/2.

If that all correct? Or am I misunderstanding this?

If so, @dotjs it seems like we should just merge those status under HTTP/3 (possibly with a caveat that it's not the final version of HTTP/3) and make a note that gQUIC is not measured separate from HTTP/2.

bazzadp on 18 Nov 2020

I wouldn't say that Q046 is entirely the same as IETF QUIC (there's no such thing as draft 46), but that it's also not the original gQUIC anymore, but kind of an in-between version with gQUIC evolving to IETF QUIC over time (correct me if I'm wrong, @DavidSchinazi).

I don't think that type of nuance necessarily has to be conveyed here though and we can just name this HTTP/3 (though indeed mentioning that it's experimental versions of H3).

rmarx on 18 Nov 2020

Here's the history: Google QUIC was a project at Google providing an alternative to TLS/TCP. At the time, the mindset was that we would run HTTP/2 over QUIC. When QUIC was brought to the IETF, the group decided to make more changes to the HTTP/2-over-QUIC layer and, after those, decided to rename HTTP/2-over-QUIC to HTTP/3. At that time, the IETF decided that the Alt-Svc for the HTTP/3 RFC would be h3 and that the ALPN for IETF QUIC drafts would be h3-nn where nn is the draft number (e.g., the most widely deployed today is h3-29). After this, Google decided to rename Google QUIC versions to match the IETF format: so Google replaced http/2+quic/46 with h3-Q046 - they're still the same version of HTTP and of Google QUIC, it's just that it has a new name that's more consistent with IETF QUIC.

So, today, we have:

h3-nn (where nn is a number) is IETF QUIC draft nn -- today Google supports only h3-29.
h3-Q0nn or h3-T0nn(where nn is a number) is Google QUIC version nn-- today Google supports h3-Q043, h3-Q046, h3-Q050, and h3-T051.

DavidSchinazi on 18 Nov 2020

Ah ok so I did get it completely the wrong way about 😀 Thanks for explaining.

So we only have gQUIC for the time of the crawl and no IETF QUIC (am sure that would be different if we crawled now but we’re basing our data on the August crawl). And so we should just treat QUIC and HTTP/2+QUIC/46 both as the same and as gQUIC.

bazzadp on 18 Nov 2020

Yes, anything that involves http/2+quic/nn or Alt-Svc: quic is guaranteed to be gQUIC

DavidSchinazi on 18 Nov 2020

@dotjs @MikeBishop @LPardue @rmarx @ibnesayeed @pmeenan @Nooshu @gregorywolf @bazzadp @gregorywolf All: this chapter's draft is looking great, thank you all for your hard work! If all reviewers have already read it and left their feedback then we're in great shape to have it ready for the launch in two weeks. If not, please try to submit all of your feedback by the end of the week to keep us on schedule. Thanks!

rviscomi on 25 Nov 2020

The chapter looks in a very good shape. I have provided some feedback.

ibnesayeed on 27 Nov 2020

Almanac.httparchive.org: HTTP/2 2020

Part IV Chapter 22: HTTP/2

Content team

Milestones

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

Most helpful comment

All 92 comments

Related issues