@HTTPArchive/data-analysts
It'd be helpful to have a sample dataset of ~1000 pages to try out queries and play with the data to see what types of metrics are possible.
The dataset should contain tables for each of the different data types:
httparchive.almanac_sample.
blink_featureslighthouse_mobilerequests_ [desktop, mobile]response_bodies_ [desktop, mobile]summary_pages_ [desktop, mobile]summary_requests_ [desktop, mobile]technologies_ [desktop, mobile]A random sample of 1000 pages from the most recent 2019_05_01 crawl for both desktop and mobile should do.
@paulcalvano do you have time to set this up?
Sure - I鈥檒l work on this tonight.
Just missing the summary_requests.2019_05_01_mobile table from the last crawl. Working to get that regenerated now.
Finished regenerating that table. While I was gnawing on that, I got to thinking about ways to make sampling easier and drafted this proposal to utilize partitioning and clustering so that analysts could simply use a clause like LIMIT 1000 to avoid paying the exorbitant cost of querying the entire table. WDYT?
Does Limit 1000 reduce the query cost? Last time I checked, the size of the query remained the same...
It would with clustering, like magic. The way our datasets are organized today, LIMIT does nothing on cost.
I've created a dataset named sample_data. The dataset contains samples of all tables based on 1,000 sites and 10,000 sites. Tables are named httparchive.sample_data.summary_pages_mobile_1k and httparchive.sample_data.summary_pages_mobile_10k respectively.
The sets of sites were selected at random, but I've ensured that each sampleset (ie, all the mobile_1k tables) are based on the same data. The SQL I used for this is below -
-- httparchive.sample_data.summary_pages_mobile_1k
SELECT *
FROM `httparchive.summary_pages.2019_05_01_mobile`
ORDER BY RAND() ASC
LIMIT 1000
-- httparchive.sample_data.summary_requests_mobile_1k
SELECT *
FROM `httparchive.summary_requests.2019_05_01_mobile`
WHERE pageid IN ( SELECT pageid FROM `httparchive.sample_data.summary_pages_mobile_1k` )
-- httparchive.sample_data.lighthouse_mobile_1k
SELECT *
FROM `httparchive.lighthouse.2019_05_01_mobile`
WHERE url IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )
-- httparchive.sample_data.pages_mobile_1k
SELECT *
FROM `httparchive.pages.2019_05_01_mobile`
WHERE url IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )
-- httparchive.sample_data.requests_mobile_1k
SELECT *
FROM `httparchive.requests.2019_05_01_mobile`
WHERE page IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )
-- httparchive.sample_data.response_bodies_mobile_1k
SELECT *
FROM `httparchive.response_bodies.2019_05_01_mobile`
WHERE page IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )
-- httparchive.sample_data.technologies_mobile_1k
SELECT *
FROM `httparchive.technologies.2019_05_01_mobile`
WHERE url IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )
Table sizes -

^ query for that summary:
SELECT table_id, row_count, ROUND(size_bytes/1024/1024,2) Size_GB
FROM httparchive.sample_data.__TABLES__
Thanks Paul! I'm wary about the response bodies sample tables still exceeding the 1TB monthly quota and being prohibitive to casual exploration. It seems like we'd need a super small table (~10 pages) to be reasonably cost effective for multiple queries. It might be necessary to look into partitioning/clustering as a way of optimizing joins and sampling.
Whoops - that last table was mislabled GB, when it was really MB. Here's the corrected table. response_bodies_mobile_10k is 14.93 GB.

Much better! Thanks Paul!