I've been doing some thinking and background research on what's out there. I'm taking down my thoughts here as a sounding board so we can narrow down to an MVP.
LH team needs to be able to understand metric variance and overall runtime/latency (bonus if possible: accuracy) in different environments, how changes we make are affecting these attributes, and how we are trending over time.
Need to monitor:
Across:
Use Cases:
n times in a particular environment on given URLs and storing the results in some queryable formatmaster resultsmaster resultsThe good news: we have an awesome community that has built lots of things to look at LH results over time :)
The bad news: their big selling points usually revolve around ease of time series data and abstracting away the environment concerns (which is the one piece we will actually need to change up and have control over the most) :/
Only one of the use cases here is really a timeseries problem (and even then it's not a real-time timeseries, it's commit level timeseries). That's not to say we can't repurpose a timeseries DB for our use cases, graphana still supports histograms and all that, it just is a bit of shoehorn for some of the things we'll want to do.
Other problem, one of the big things we actually care about most in all of this is differences between versions of Lighthouse. Given that abstracting the environment away and keeping it stable is a selling point of all these solutions, breaking in to make comparing versions our priority is really cutting against the grain. Again, not impossible, but not exactly leveraging the strengths of these solutions.
K-I-S-S, keep it simple stupid. Great advice, hurts my feelings every time.
Simple CLI with 2 commands.
run - handle the run n times and save piece, single js file for each connector we need to run, just local and LR to startserve - serve a site that enables the visualization piecesThese two commands share a CLI config that specifies storage location. I'm thinking sqlite to start to avoid any crazy docker mess and work with some hypothetical remote SQL server. We can include a field for the raw response so we can always add more columns easily later.
Thoughts so far? Did I completely miss what the pain is from others' perspective? Does it sound terrifyingly similar to plots 馃槺
I didn't explicitly make my case for why I think going with a time series database won't help us very much, and I realize I didn't put up any of my scribbled drawings up either, so here goes a cleaned up version of what I imagined.
For our "health dashboard" at a minimum I'm thinking we need to see
None of these are time series questions and all involve doing mean/median/std. dev on a collection of runs matching some git hash/version ID/run ID
We want to be able to jump around hashes and easily compare the same version of LH in multiple environments (which is non-trivial with LR to CLI since they will very rarely line up and need to do some nearest git neighbor junk).
All this gave me the idea of something like the below.

Bad drawings aside, does the bulleted list sound way off to other folks?
I like everything here in this second comment and agree these will help to illuminate things.
But I also think variance and runtime plotted over our commits would be really valuable so we can see how our fixes improve things. E.g. "Our median run time was 30s last month and now its 14s, as you can see. and our 90 pct run time dropped even farther, mostly due to this commit."
But I also think variance and runtime plotted over our commits would be really valuable so we can see how our fixes improve things. E.g. "Our median run time was 30s last month and now its 14s, as you can see. and our 90 pct run time dropped even farther, mostly due to this commit."
totally agree plotted over commits would be great 馃憤
This is likely just my timeseries db incompetence, but it always appeared to me that there's not a 'bucket by hash in custom sorted order' style chart that can reuse all the timeseries magic, it always wants timestamps. By that token, I think that a graph by time will lose a decent bit of signal that a graph by commit would show. I will keep digging though.
Update here based on my experiences trying to get this to work with InfluxDB + Grafana:
It gets us 80% of the way there really quickly. There's definitely a substantial amount of fighting with the timeseries-nature going on. Most of it has workarounds with extra management/planning, but I think it's unlikely we can satisfy 100% of the use cases outlined in initial comment. Maybe that's fine and we're happy with the trade-off, but I'll outline what I found the main challenges to be.
Example: there is no GROUP BY x HAVING y-style clause which would have helped in a few cases. This isn't a huge deal given we plan ahead and are flexible with how we look at the output. I was able to workaround this limitation with preprocessing the points from each URL in a run + using a table for output instead of a single cell.
Workaround, we fake timestamps for each datapoint so they're unique. Limitation is super minor, just no running different jobs concurrently or risk losing data.
This one feels like I have to be doing something wrong. Even though I've removed every group by and time-related grouping option I can find, histograms still don't always want to represent every data point. Strangest part: changing the # of buckets changes the total count every time (?????)
I found a workaround to treat each build like a next data point in a time series, but the visualization options become very limited using this approach and will not scale with many different builds. Example of limitation: it's bar chart only, no line graphs or variance bars, cannot control the units used, etc. Overall it feels like we'd give up on this and just use an uneven time series.
Overall I think we can live with these or build enough tools around them to limit their impact, but it does mean a lot of use cases will become more difficult, i.e. comparing two different builds will always be manually selecting their respective hashes and presenting their reports side-by-side instead of some sort of custom in-line diff. Previously folks seemed to think these limitations were worth the tradeoff.
WDYT?
Grafana is sadly great for timeseries but not for other data. Setting it up and making dashboards look really easy 馃挴 but maybe it's not the stack we need?
If we don't have timeseries data we might want to look to the ELK stack? (elastic + kibana) I'm pretty sure they support line graphs with a custom x axis. The downside is that you need elastic as a data source so we can't have a sqlite or whatever. (http://logz.io has a free trial we might want to use to test)
Another way is to create some custom graphs ourselves but that defeats the purpose of K-I-S-S and is something we don't really want to manage.
One more example WTF with the histograms, here's a side-by-side of changing absolutely nothing but the number of buckets on the X-axis.


OK. I think i'm sufficiently convinced that these tools optimize for metrics happening continuously and requiring grouping. Our usecase of discrete metrics every hour or day isn't supported well by them. (your 3rd bold point).
I appreciate the attempt to make it work, but agree that grafana isn't a great solution for what we're trying to do here.
The ELK stack is not a big difference. I could set it up on a private instance and give you guys access to import some data and make the correct graphs. Than you don't have to be bothered to set up the stack.
OK, so I think we're all on the same page about timeseries solutions not being the best for us. Before jumping into the next attempt, I want to get some clarity on what exactly we want to show up in the dashboard.
This is what the grafana one looked like:

It surfaced
Table of mean run time by URL
Mean TTI std dev over time
@brendankenny you mentioned you had different metrics in mind
@exterkamp you mentioned you really wanted a line graph for over time graph
Any more other feedback on this set of metrics before I go off?
Hey, so yeah, I was thinking that I want to see data commit-over-commit so that I could see if a specific commit has introduced a problem, or that we can see that a variance has been reduced.
I am still liking the idea of a candlestick graph with the data like this:

This would allow us to visualize when the variance was narrowing i.e. the std dev would be going down over time:

Or if a specific commit increased variance:

So that is kind of how I like to visualize the scores over time, either with a candlestick chart, or with a line chart + shaded area of +/- 1-2 std dev around it to show the variance.
I like the current visualizations esp. broken down by URL. But personally I want to see line charts/candlestick charts that show me what each metric is doing over time so that I can see if something is getting out of hand over time or degrading slowly. But for snapshots I like all the called out percentages and variance boxes coded red/yellow/green.
hey @patrickhulce have you looked into Superset
(Disclaimer: I used to work on the airflow DAG ingestion and viz on a tool that used Superset so I like it and it's _python_)
Made some candles with some of the dumped data to show what variance in duration of run could look like over multiple commits in candle form.

I totally dig candlesticks (though I must say this name is new to me, is it different in any particular way from a box plot or are they the same thing?) this is also how I imagined the visualization of data over time 馃憤
I've spent some time with superset now, and I think it might be overkill for our use case. The basic installation sets up multiple dbs with a message broker and a stateless python server. We are like the farthest thing from big data and it's super scalable selling points :)
Perhaps it's the incubator status, or the docs don't have the same love as the impl, or I'm just finding all the wrong docs 馃槅, but I ran into several roadblocks where the setup docs led you to an error (even their super simple docker ones had the wrong commands 馃槙) and I had to peruse the source to fix and move on. After struggling with still broken dashboards after setting everything else up, I took a whack at a bespoke frontend.
In the same time it took to build the docker compose setup and troubleshoot the dashboard, this is what I got goin'

It can be deployed in two static files to create a new now.sh URL for every PR or w/e, or we can build out a fancier setup with customizable queries and whatnot if need-be. Do we envision needing to create lots of exploratory new queries and dashboards such that the nice GUI dashboard creator of superset would be worth the complexity?
If you guys are swamped and would prefer I just move forward with what I think will help the most and hope it's close enough to what you'd ideally want you can say that too :)
@patrickhulce you got this. plenty of things to bikeshed but i think we're aligned on the general approach. +1 to moving forward.
So I've been trying to use DZL on a few PRs so far, and I've run into a few issues that really hamper it's usability so far...
This is mostly solved by creating complex static sites we can measure, it's mainly CNN/news sites that change images, ads, etc within hours. Here's an example of SFGate TTI with 0 Lighthouse changes

I don't have a great answer for this one. Maybe we try and take a better baseline of each machine? Here's an example with 0 FCP affecting changes on the localhost fonts smoketest.

This one has an easy answer, I just need to build this! Just didn't expect I'd want it so quickly :)
@Hoten I'd be really eager to hear what you didn't like about the first DZL experience.
What did you want to see? Did you have no clue where to start? etc :)
Do the results stream in? When I first looked at DZL for #6730, there was only 2 sites in the hash-to-hash page. But, I see all of them now. At first I figured I broke something.
Since the above PR introduced a new audit, there's not a graph for the variance of that audit's score. Would be nice to still show it, even with nothing to compare it to.
Since the above URL introduced a new audit, there's not a graph for the variance of that audit's score. Would be nice to still show it, even with nothing to compare it to.
Great point 馃憤
When I first looked at DZL for #6730, there was only 2 sites in the hash-to-hash page.
I need to fix the automatic hash selection. It automatically picks the most recent batch, but sometimes that's a batch that just started instead of the most recent one that's done. You can try one of the older official-ci batches to find one that has all the URLs if that happens again.
Closing this in favor of a new tracking issue #6775
Most helpful comment
Hey, so yeah, I was thinking that I want to see data commit-over-commit so that I could see if a specific commit has introduced a problem, or that we can see that a variance has been reduced.
I am still liking the idea of a candlestick graph with the data like this:

This would allow us to visualize when the variance was narrowing i.e. the std dev would be going down over time:

Or if a specific commit increased variance:

So that is kind of how I like to visualize the scores over time, either with a candlestick chart, or with a line chart + shaded area of +/- 1-2 std dev around it to show the variance.
I like the current visualizations esp. broken down by URL. But personally I want to see line charts/candlestick charts that show me what each metric is doing over time so that I can see if something is getting out of hand over time or degrading slowly. But for snapshots I like all the called out percentages and variance boxes coded red/yellow/green.
hey @patrickhulce have you looked into Superset
(Disclaimer: I used to work on the airflow DAG ingestion and viz on a tool that used Superset so I like it and it's _python_)
Made some candles with some of the dumped data to show what variance in duration of run could look like over multiple commits in candle form.
