Almanac.httparchive.org: Create an almanac_sample dataset

Created on 4 Jun 2019 · 10Comments · Source: HTTPArchive/almanac.httparchive.org

@HTTPArchive/data-analysts

It'd be helpful to have a sample dataset of ~1000 pages to try out queries and play with the data to see what types of metrics are possible.

The dataset should contain tables for each of the different data types:

httparchive.almanac_sample.

blink_features
lighthouse_mobile
requests_ [desktop, mobile]
response_bodies_ [desktop, mobile]
summary_pages_ [desktop, mobile]
summary_requests_ [desktop, mobile]
technologies_ [desktop, mobile]

A random sample of 1000 pages from the most recent 2019_05_01 crawl for both desktop and mobile should do.

@paulcalvano do you have time to set this up?

analysis

Source

rviscomi

All 10 comments

Sure - I’ll work on this tonight.

paulcalvano on 4 Jun 2019

❤1

Just missing the summary_requests.2019_05_01_mobile table from the last crawl. Working to get that regenerated now.

rviscomi on 4 Jun 2019

Finished regenerating that table. While I was gnawing on that, I got to thinking about ways to make sampling easier and drafted this proposal to utilize partitioning and clustering so that analysts could simply use a clause like LIMIT 1000 to avoid paying the exorbitant cost of querying the entire table. WDYT?

rviscomi on 6 Jun 2019

Does Limit 1000 reduce the query cost? Last time I checked, the size of the query remained the same...

dougsillars on 6 Jun 2019

It would with clustering, like magic. The way our datasets are organized today, LIMIT does nothing on cost.

rviscomi on 6 Jun 2019

I've created a dataset named sample_data. The dataset contains samples of all tables based on 1,000 sites and 10,000 sites. Tables are named httparchive.sample_data.summary_pages_mobile_1k and httparchive.sample_data.summary_pages_mobile_10k respectively.

The sets of sites were selected at random, but I've ensured that each sampleset (ie, all the mobile_1k tables) are based on the same data. The SQL I used for this is below -

-- httparchive.sample_data.summary_pages_mobile_1k
SELECT  *
FROM `httparchive.summary_pages.2019_05_01_mobile` 
ORDER BY RAND() ASC
LIMIT 1000

-- httparchive.sample_data.summary_requests_mobile_1k 
SELECT  *
FROM `httparchive.summary_requests.2019_05_01_mobile` 
WHERE pageid IN ( SELECT pageid FROM `httparchive.sample_data.summary_pages_mobile_1k` )

-- httparchive.sample_data.lighthouse_mobile_1k
SELECT  *
FROM `httparchive.lighthouse.2019_05_01_mobile` 
WHERE url IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )

-- httparchive.sample_data.pages_mobile_1k
SELECT  *
FROM `httparchive.pages.2019_05_01_mobile` 
WHERE url IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )

-- httparchive.sample_data.requests_mobile_1k
SELECT  *
FROM `httparchive.requests.2019_05_01_mobile` 
WHERE page IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )

-- httparchive.sample_data.response_bodies_mobile_1k
SELECT  *
FROM `httparchive.response_bodies.2019_05_01_mobile` 
WHERE page IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )

-- httparchive.sample_data.technologies_mobile_1k
SELECT  *
FROM `httparchive.technologies.2019_05_01_mobile` 
WHERE url IN ( SELECT url FROM `httparchive.sample_data.summary_pages_mobile_1k` )

paulcalvano on 17 Jun 2019

🎉1

Table sizes -

^ query for that summary:

SELECT table_id, row_count, ROUND(size_bytes/1024/1024,2) Size_GB
FROM httparchive.sample_data.__TABLES__

paulcalvano on 17 Jun 2019

Thanks Paul! I'm wary about the response bodies sample tables still exceeding the 1TB monthly quota and being prohibitive to casual exploration. It seems like we'd need a super small table (~10 pages) to be reasonably cost effective for multiple queries. It might be necessary to look into partitioning/clustering as a way of optimizing joins and sampling.

rviscomi on 17 Jun 2019

Whoops - that last table was mislabled GB, when it was really MB. Here's the corrected table. response_bodies_mobile_10k is 14.93 GB.

paulcalvano on 17 Jun 2019

😄1

Much better! Thanks Paul!

rviscomi on 17 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Create templates for content pages

rviscomi · 5Comments

Join the 2020 Editors team

rviscomi · 3Comments

Typo in JS featured snippet

rviscomi · 3Comments

Logo tweak

rviscomi · 5Comments

Write content: Chapter 13. Ecommerce

rviscomi · 6Comments