SPLOMs are coming to plotly.js.
For the uninitiated, docs on the python api scatterplotmatrix figure factory are here. Seaborn calls it a pairplot. Matlab has plotmatrix draw function.
Some might say that SPLOMs are already part of plotly.js: all we have to do is generate traces for each combination of variables and plot them on an appropriate axis layout (example).
But, this technique has a few limitations and inconveniences:
Numerous solutions are available. This issue will attempt to spec out the best one.
cc @dfcreative @alexcjohnson @cldougl @chriddyp
Add a new do-it-all splom (and possible a splomgl too) trace type that generates its own _internal_ scatter traces and its own axes - with an api similar to parcoords:
trace = {
dimensions: [{
values: [/* */],
// some scatter style props ...
// some axis props reused from cartesian axes
}],
// some splom-wide options e.g.:
showdiagonal: true || false,
showupperhalf: true || false,
showlowerhalf: true || false,
direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
// ...
}
Port make_subplots and append_traces from the python api in plotly.js (docs). For example:
var Plotly = require('plotly.js')
var fields = [
[/* */],
[/* */],
// ...
]
var layout = Plotly.makeSubplots({rows: fields.length, cols: fields.length})
var data = []
for (var i = 0; i < fields.length; i++) {
for (var j = 0; j < fields.length; j++) {
var trace = {
mode: 'markers',
x: fields[i],
y: fields[j]
}
Plotly.linkToSubplot(trace, i, j)
data.push(trace)
}
}
Plotly.newPlot(gd, data, layout)
This could be combined with solution 2 to solve the data-array-duplication problem. But this would allow require some backend work for plot.ly support.
In short, we could add a new _top-level_ argument to Plotly.newPlot and Plotly.react
var columns: [
{name: 'col 0', values: [/* */]},
{name: 'col 1', values: [/* */]},
// ...
]
// unfortunately, in this paradigm columns should really be labeled data,
// and data -> traces
var data = [{
x: 'col 0',
y: 'col 1'
}, {
x: 'col 1',
y: 'col 0'
}]
Plotly.newPlot(gd, {
columns: columns,
data: data,
layout: {}
})
I think it's clear we want to encapsulate a splom in a single trace, like solution 1. Solution 2 won't give the necessary performance benefits. Solution 3 may give some of the performance we need, and may be useful for more generalized trace linking in the future (for example, things like 2dhistogram_contour_subplots where the x and y data are duplicated in the scatter and histogram2dcontour traces, then x and y each get another copy in the 1D histograms) but will still suffer from duplication at the calc/plot level, that I suspect will be prohibitive for us. Likewise it seems to me it's only reasonable to make this as a WebGL type.
The question in my mind is whether we can do it by linking the splom trace to regular cartesian axes, and using it to tailor the defaults for those axes, or if we need to have even the axes encapsulated in the trace itself. If we can do the former, then we retain the flexibility to display other traces on those same subplots. Extra data that we only have for one attribute pair, for example, or a curve fit, or some different type of display on the diagonal. Or even another splom that might even have a disjoint set of dimensions from the first (might be a huge headache but see below for more thoughts)
trace = {
dimensions: [{
values: [/* */],
name: 'Sepal Width' // used as default x/y axis titles
xaxis: 'x' | 'x2' ... // defaults to ith x axis ID for dimension i
yaxis: 'y' | 'y2' ...
}],
marker: {
// just like scatter, and all the same ones are arrayOk.
// goes outside the `dimensions` array because the same data point should get
// the same marker in all subplots.
}
// domain settings - not used directly, just fed into the defaults for all the
// individual x/y axis domains
domain: {
// total domain to be divided among all x / y axes
x: [0, 1],
y: [0, 1],
// blank space between x axes, as a fraction of the length of each axis
// possibly xgutter and ygutter?
gutter: 0.1
}
// some splom-wide options e.g.:
// maybe turn these into a flaglist 'upper+lower+diagonal'?
// these and related attrs will affect the default x/y axis anchor and/or side attributes
showdiagonal: true || false,
showupperhalf: true || false,
showlowerhalf: true || false,
// maybe xdirection and ydirection?
direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
// ...
};
layout = {
xaxis: { /* overriding any of the defaults set by SPLOM */ },
xaxis2: { /* */ },
xaxis3: { /* */ },
... ,
yaxis: { /* */ },
...
};
One variation that might be nice but I'm not sure: separate the list of axes from the dimensions. This could make it easier for example to reorder the dimensions without having to do all sorts of gymnastics with swapping axis attributes (though we might need to swap axis titles still, if they're not inherited from the dimension names):
trace = {
dimensions: [{
values: [/* */],
name: 'Sepal Width' // used as default x/y axis titles
// some scatter style props ...
}],
xaxes: ['x', 'x2', 'x3', ...], // defaults to the first N x axis IDs. info_array, Not data_array.
yaxes: ['y', 'y2', 'y3', ...],
...
}
Also, it might be nice to move the axis arrangement to layout, but still have splom provide defaults for this. That way we could reuse it for other cases that want a grid of axes, not just splom:
// splom trace would still have axis ids in it but no axis layout info (domain or gutter)
layout = {
grid: {
xaxes: ['x', 'x2', 'x3', ...],
yaxes: ['y', 'y2', 'y3', ...],
domain: { x: [0, 1], y: [0, 1] },
gutter: 0.1
}
}
Cases like splom would use a 1D arrays of x/y axes, as all rows share the same x axes and all columns share the same y axes, but we could also allow 2D arrays for when you want a grid of uncoupled axes. And if you put '' in any entry it leaves that row/col/cell blank, and at some point we can make a way to refer to empty cells in other trace/subplot types - so in a pie trace or a 3d scene etc you could add something like gridcell: [1, 2] which would automatically generate the appropriate domain for you.
Actually, this would make it easy to support multiple splom traces regardless of whether they have the same or different dimensions:
supplyDefaults we'd look through all splom traces and find the full set of xaxes and yaxes to use as the defaults in fullLayout.grid (but the user could override these lists if they wanted) as well as to populate the axis and subplot lists in fullLayout._subplots.fullLayout.grid, we'd coerce grid.domain and grid.gutter.gridcell attributes), default domain values would be generated based on grid.supplyDefaults step, grid and gridcell attributes would be ignored because the appropriate domain values would have been filled in already.That way all of this would happen automatically if you just make a splom trace with N dimensions and don't say anything about its layout, but you could alter it all at various stages if you want to.
What I'm trying to avoid above, but might be even higher performance at the expense of flexibility,
as the axis rendering could be tailored to the splom case:
trace = {
dimensions: [{
values: [/* */],
xaxis: { /* all the x axis attributes like title, tick/grid specs, fonts, etc */ },
yaxis: { /* same for y - or these could go in xaxes/yaxes arrays but still in the trace */ }
}]
}
or in trace.xaxes and trace.yaxes which would be arrays of objects rather than arrays of IDs... either way the point is no other traces would be able to use these axes, which means they could use stripped down rendering machinery for better performance but less flexibility.
My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.
Thanks for the :books: @alexcjohnson
I'm a big fan of those xaxes and yaxes info arrays in the traces :+1: Using the plural here is great as they won't conflict with the current xaxis / yaxis trace attributes.
About your grid proposal, I'm curious to see if we could combine the numerous _xy subplot-wide but not graph-wide_ requested settings in them (https://github.com/plotly/plotly.js/issues/1468, https://github.com/plotly/plotly.js/issues/233, https://github.com/plotly/plotly.js/issues/2274 and per subplot plot_bgcolor to name a few).
Now, to give a more concrete example (to e.g. @dfcreative :wink:), the Iris splom (e.g. https://codepen.io/etpinard/pen/Vbzxqa) would be declared as:
var url = 'https://cdn.rawgit.com/plotly/datasets/master/iris.csv'
var colors = ['red', 'green', 'blue']
Plotly.d3.csv(url, (err, rows) => {
var keys = Object.keys(rows[0]).filter(k => k !== 'Name')
var names = rows.map(r => r.Name).filter((v, i, self) => self.indexOf(v) === i)
var xaxes = keys.map((_, i) => 'x' + (i ? i + 1 : ''))
var yaxes = keys.map((_, i) => 'y' + (i ? i + 1 : ''))
var data = names.map((name, i) => {
var rowsOfName = rows.filter(r => r.Name === name)
var trace = {
type: 'splom',
name: name,
dimensions: keys.map((k, j) => {
// 'label' would be better here than 'name' (parcoords uses 'label')
label: k,
values: rowsOfName.map(r => r[j]),
}),
marker: {color: color[i]},
// the default (for clarity)
showlegend: true,
xaxes: xaxes,
yaxes: yaxes
}
return trace
})
var layout = {
grid: {
xaxes: xaxes,
yaxes: yaxes
domain: { x: [0, 1], y: [0, 1] },
gutter: 0.1
}
}
Plotly.newPlot('graph', data, layout)
That is, one splom trace per :wilted_flower: type and one dimension per observed field in each trace.
My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.
Interesting point here about the grid lines. It shouldn't be too hard to draw them in WebGL (much easier than axis labels :wink: at least), if we find SVG too slow.
May I add my 2垄?
Why don't we just use existing scatter trace data/naming convention as
Plotly.newPlot(document.body, [{
type: 'scattermatrix',
x: [[], [], ...xdata],
y: [[], [], [], ...ydata]
}])
That would be familiar already for the users who know trace types and options.
May I add my 50 cents?
Usually it's 2垄 but we like you so sure :)
Why don't we just use existing scatter trace data/naming convention
Two things I don't like about this:
Anyway we do have a precedent for the structure I'm proposing, in parcoords. Then the marker attributes would be inherited directly from scatter
About your
gridproposal, I'm curious to see if we could combine the numerous _xy subplot-wide but not graph-wide_ requested settings in them (https://github.com/plotly/plotly.js/issues/1468, https://github.com/plotly/plotly.js/issues/233, https://github.com/plotly/plotly.js/issues/2274 and per subplotplot_bgcolorto name a few).
I suppose we could let grid provide these settings, the same way grid would be providing domain values for individual axes. But I wouldn't want this to be the only way to provide per-subplot settings, because not every multi-subplot layout can be described as a grid - think of insets, or layouts like
+-------+ +---+
| | | |
| | +---+
| | +---+
| | | |
+-------+ +---+
I guess ^^ could be massaged into the grid format with concepts like colspan / rowspan, and maybe we'll do that, but that would still make it awkward to provide per-subplot attributes, and insets would still be difficult to describe this way.
So I still think we'll need something like https://github.com/plotly/plotly.js/issues/2274#issuecomment-359310606 but perhaps grid would be allowed to provide defaults to that when the layout is conducive to it.
@dfcreative don't worry about grid while implementing splom - just use explicitly positioned x and y axes, and I'll work on grid separately, then once it and splom are both ready we can integrate them.
Branch splom has some preliminary work on the user-attributes-full-attributes side of things (i.e. pretty much everything except the regl-scatter2d calls).
Things to note:
splom traces have their own basePlotModule (similar to pie, parcoords, ...) that reuses some Cartesian methodssplom default step _generates_ default xaxes and yaxes list using the number of dimensions the trace hasgrid.xaxes and grid.yaxes defaultsfullLayout._subplots.cartesian and fullLayout._subplots.(x|y)axes so that things _just works_.Just a couple of clarifying questions:
splomtraces have their ownbasePlotModule
Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.
we'll make one regl-scatter2d (or equivalent) call per splom trace
I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?
Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.
Yes, for sure :ok_hand:
I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?
Here's a sneak peak:
Here are some observations on splom-generated cartesian subplots:
Off the splom branch with commits from https://github.com/plotly/plotly.js/pull/2474 and using the following script:
var Nvars = ???
var Nrows = 2e4 // make no difference for now
var dims = []
for(var i = 0; i < Nvars; i++) {
dims.push({values: []})
for(var j = 0; j < Nrows; j++) {
dims[i].values.push(Math.random())
}
}
Plotly.purge(gd);
console.time('splom')
Plotly.plot(gd, [{
type: 'splom',
dimensions: dims
}])
console.timeEnd('splom')
I got:
where I added console.time / console.timeEnd pairs in the slowest subroutines i.e. the ones that scale with the total number of subplots or Math.pow(dimensions.length, 2)
A few quick hits:
initInteractions execution can be :hocho: by setting staticPlot: false (duh) but even setting the more obscure config option showAxisDragHandles and showAxisRangeEntryBoxes to false can reduce its execution time by a factor of 4lsInner is currently called twice via layoutStyles here and here (and a third time on graphs with margin-pushing things). At 40 dimensions (that's 200 subplots), it takes a whooping 2700ms to execute. That is, more that half of the total plotting time is in there. I'll try to first make sure the slow parts are called only once. But, we might need more aggressive optimization at some pointAxes.doTicks speeds up the doAxes step by a factor of 2. That's good because we can probably use regl-line2d to draw those lines more efficiently. That said, we'll also have to speed label-drawing step mostly via https://github.com/plotly/plotly.js/issues/1988 and fixOverlappingLabels.Work in progress https://dfcreative.github.io/regl-scattermatrix/
Quick update:
lsInner calls was easy enough in https://github.com/plotly/plotly.js/pull/2474/commits/e810c1ee55900132caa009b5f96e1644d272634b. Next, I'll try to _merge_ as much logic as possible from Cartesian.drawFramework with lsInner so that we can hopefully loop over all the <g subplot> only once.Interesting finding:
Drawing.setClipUrl call can speed up lsInner by 10x at 40 dimensions (or 1600 subplots)! Even when the page has no <base>! I suspect that traversing the DOM when you have 1600 <g subplot> is slow :turtle: (duh!). This should be an easy fix: call d3.select('base') once (i.e. not for every Drawing.setClipUrl call ) and stash it somewhere.There's also document.baseURI perhaps we can bypass base, just check if
document.baseURI === window.location.href

too bad. Although :arrow_heading_up: is from w3school :laughing:
https://developer.mozilla.org/en-US/docs/Web/API/Node/baseURI is incomplete:

New benchmarks post https://github.com/plotly/plotly.js/pull/2474/commits/5887104139256934bbf554bf62685fbec62585d2 (which I pushed to https://github.com/plotly/plotly.js/pull/2474 - hopefully @alexcjohnson won't mind):

Things are looking up :guitar:
Next steps:
Math.pow(dimensions.length, 2))A first attempt at drawing grid lines using @dfcreative 's regl-line2d was positive.
Here are the numbers (in ms) with all axes having the same gridcolor and gridwidth:
| # of dims | SVG | regl-line2d |
| ------------ | ---- | --------------- |
| 10 | 70 | 80-100 |
| 20 | 200 | 140-150 |
| 30 | 500 | 150-200 |
| 40 | 800 | 300 |
| 50 | 1500 | 350 |
In brief, we start to see improvements over SVG at around 15 dimensions (i.e 15x15=225 subplots).
Most helpful comment
Work in progress https://dfcreative.github.io/regl-scattermatrix/