Plotly.js: Transform input data: groupby, filter

Created on 8 Sep 2016 · 10Comments · Source: plotly/plotly.js

Previously discussed (some lists are from @chriddyp) :

A groupbytransform should split apart traces as per unique values or bins of the groupby dimension. Example:

groupby: ['a', 'b', 'a', 'b']
x: [1, 2, 1, 2]
y: [10, 20, 30, 40]

should generate two traces:

trace 1:
x: [1, 2]
y: [10, 20]

trace 2:
x: [1, 2]
y: [30, 40]

Static `groupby` as a means of splitting spatially and/or aesthetically

[ ] distinct categorical values: numbers, strings or datetime strings
[ ] evenly spaced bins based on numerical data or time (datetime strings) in the groupby attribute, reusing logic of the preexisting plotly algorithm for histograms

Functional aspects:

groupby needs to work across numbers, dates, and categories (@chriddyp in the JS context, meaning strings, correct?)
groupby needs to split across all of the arrays or array-like specifications in a trace, not just x and y. For example, marker.color or marker.line.color. Not all array-like specifications in a trace are actual arrays (consider colorscale)
There must be a way of specifying distinct styles for the split apart traces so that they're discernible - example:

transform: groupby: ['a', 'b', 'a', 'b'] marker: color: a: 'blue' b: 'red'

@etpinard found some issues with legend items as he wrote an initial version of transforms: https://github.com/plotly/plotly.js/pull/499#issuecomment-216597436. We'll probably need to modify some of the transforms and API. That's OK - transforms was made for groupby
All relevant denotations for groupby, and the related animation split use (see below) need to be in the JSON format for serializability, fitting in the current declarative structure
The transforms such as groupby must work in the restyle and relayout steps, not just the initial plot step
gd.data is expected to preserve the single trace and the groupby spec as the user supplied, and _fullData on the other hand has the individual (spllt) traces and no longer has the groupby attribute
We must ID traces in _fullData back to groups or styles in data. Styling controls will be populated with the defaults from _fullData (e.g. _fullData[4].marker.color) but they’ll need to update the attributes in the data object (e.g. data[0].transform.marker.color.d). That’s because we serialize and save data, not _fullData.

Preliminary work

Related PR, containing the initial, analogous filter work by @timelyportfolio : https://github.com/plotly/plotly.js/pull/859
groupby: https://github.com/plotly/plotly.js/blob/master/test/jasmine/assets/transforms/groupby.js

Planned `groupby` coverage of the initial sprint

It would cover a positive list of attributes for groupby such as x and y but not all at once - HOWEVER the preferred solution aims for generality because other transforms will need to use a similar approach e.g. filter, and future arraylike attributes should be covered without code coupling to transformations (consequence: we'll have to check if there's enough attribute metadata that allows us to tell if it's arraylike, or we need further metadata; also, whether there's a programmatic way of separating arraylike data e.g. colorscale that's not represented as an array at input, otherwise we need to handle them attribute by attribute (we'll have to come back to this topic after a first round of work).
Initial attributes at least: x, y, marker.color, marker.size (scatter, bar, histogram, box)
Then lat, lon (maps), a, b, c (ternary), ‘z’ (scatter3d), error_y.array
It would cover a set of (initially, non-WebGL) traces
First goalpost is separation by category (JS number or string)

It is expected that the trace separation (and transformations in general) is being performed in the supply defaults step.

Subsequent goal: splitting data for animations

Instead of generating n different paths as described above, plotly would arrive at a temporal sequence of n frames

Possible future items:

Incremental recalculation (e.g. of bins, upon newly arriving data points)
Combine this with a subplots transform for rendering the traces into separate subplots (as small multiples plots)

feature

Source

monfera

👍2

Most helpful comment

closed in https://github.com/plotly/plotly.js/pull/936 and https://github.com/plotly/plotly.js/pull/978

etpinard on 26 Sep 2016

🎉2

All 10 comments

A quick update on progress:

As styling can be hierarchical, such as `{marker: {line: {color: "cyan"}}} and users already make a big investment learning about them, and in addition, we seek to avoid property-by-property handling (attribute metadata extension or manual additions) of styles, we agreed that the styling defs for groups would look as normal. Here's an example:

    var mockData02 = [{
        mode: 'markers',
        x: [1, -1, -2, 0, 1, 2, 3],
        y: [0, 1, 2, 3, 4, 5, 6],
        transforms: [{
            type: 'groupby',
            groups: ['a', 'a', 'b', 'a', 'b', 'b', 'a'],
            styles: {
                a: {
                    marker: {
                        color: "orange",
                        size: 20,
                        line: {
                            color: "red",
                            width: 1
                        }
                    }
                },
                b: {
                   // heterogeonos attributes are OK: 
                   // group "a" needn't define e.g .`mode` if defaults are alright
                    mode: "markers+lines", 
                    marker: {
                        color: "cyan",
                        size: 15,
                        line: {
                            color: "purple",
                            width: 4
                        },
                        opacity: 0.5,
                        symbol: "triangle-up"
                    },
                    line: {
                        width: 1,
                        color: "purple"
                    }
                }
            }
        }]
    }];

This is how the result looks like, OK it's decidedly outré but serves the point:

The benefit of the solution is that

it's very compact to implement (basically one line change in current groupby)
rather powerful - basically anything goes that could go with manual separation
robust - nothing is expected to break
conceptually simple to users:
- attributes are what users already know, use and have documented anyway
- definition is natural
doesn't tamper with the existing implementation structures

Its drawback stems from the same properties:

it can be a bit verbose in JS
it can't pry apart array-like palettes e.g. "Greens"
- however this might be in a subsequent iteration

monfera on 15 Sep 2016

As in the related PR, one additional note that, in general, scatter traces can now have ids in addition to x and y data arrays, which can be very useful for these sorts of operations.

rreusser on 15 Sep 2016

👍1

@etpinard @rreusser Here's another example, for these things:

Define styling at a super-group level - it can work per group, or the group can override it
Arrays at the super-group level are interpreted per group element

    var mockData03 = [{
        mode: 'markers',
        x: [1, -1, -2, 0, 1, 2, 3],
        y: [0, 1, 2, 3, 5, 4, 6],
        marker: {
            color: "darkred", // general "default" color
            line: {
                width: 8,
                // a general, not overridden array will be interpreted per group
                color: ["orange", "red", "green", "cyan"]
            }
        },
        transforms: [{
            type: 'groupby',
            groups: ['a', 'a', 'b', 'a', 'b', 'b', 'a'],
            styles: {
                a: {marker: {size: 30}, mode: "markers+lines"},
                b: {marker: {size: 15, color: "lightblue"}, mode: "markers+lines"} // override general color
            }
        }]
    }];

Result:

monfera on 15 Sep 2016

👍2

I like it. Transforms in general are kinda free-form and extremely flexible, which means it's probably good to develop a set of conventions (like styles the way you've defined it) so that it's clear how to write a new transform that conforms to the conventions used in the rest of the transforms.

rreusser on 15 Sep 2016

👍1

@monfera your API looks great.

I'd vote for transforms[i].style instead of transforms[i].styles as we like to keep plurals for Array containers.

One thing that we should attempt to handle better is the findArrayAttributes step. What we need to do is something similar to what Plotly.PlotSchema.get() does here where it looks for data_array and arrayOk attributes (which e.g. correctly skips over colorscale and domain) by looking into the fullData[i]._module.attributes

The more I think about it the more I think finding the list of all data_array + arrayOk attributes in a given trace will be very common to almost all transforms (including possible transforms written by community users). So I suggest we should find that list somewhere in plots.js and pass it to as an argument to the transform methods here.

etpinard on 15 Sep 2016

👍2

@etpinard @rreusser Do I understand that anything that's data_array and arrayOk must split by group just like x and y now? I.e. is it the only condition? I'd have thought there are array attributes that represent some value extent [from, to] or whatever in an array such that they must not be split by groupby trace.

Assuming the answer is yes: probably I can make (or plug into) code that crawls the entire set of attributes and distinguish between splittable arrays and non-splittable arrays. But there's the issue that the attribute tree can differ by plot type, and according to other values. I'm concerned that some attribute locations in a mother of all JSON attribute dictionary will be group-splitting arrays under some circumstances and non-splitting arrays under others.

monfera on 16 Sep 2016

Do I understand that anything that's data_array and arrayOk must split by group just like x and y now?

Yes. When an arrayOk attribute is set to an array, it should be interpreted as per-datum specifications (e.g. just like ids[i] that @rreusser mentioned earlier).

But there's the issue that the attribute tree can differ by plot type, and according to other values

That's correct. The list of data_array + arrayOk attribute should be given per plot-type.

etpinard on 16 Sep 2016

@etpinard Awesome, thanks! With this answer, @rreusser's answers and your examples I feel there's enough nooks and crannies to continue rock climbing :-)

monfera on 16 Sep 2016

Climb on!
hand