One thing that folks seem to use _a lot_ with ggplot is to add a regression line to a plot with something like geom_smooth(method = "lm", se = FALSE). Would be great if there was a way to do something like that in vega-lite as well.
Check this issue in vega
As of now, you have two choices in vega-lite:
1) use a calculate transform to generate the line based on the model formula fitted outside of the vega ecosystem,
2) directly provide the relevant points as data, the model still being fitted outside of vegalite.
I think it would be great if this could be done in pure vega-lite. Yes, i can run the regression outside of vega-lite, but that is way more cumbersome than what you get with ggplot. Also, imagine a situation where this is combined with interactivity, and then one would have to precompute potentially a lot of stuff outside of the plot.
I agree that this would be a nice feature for Vega-Lite. Maybe the best way to implement this is with a custom mark type.
Vega now has a regression transform: https://vega.github.io/vega/docs/transforms/regression/
Can't wait to see this in Vega-Lite!
It's coming in Vega-Lite 4.
@kanitw we can close this issue, right?
The question is do we want to provide a macro for this in Vega-Lite?
A few options to consider with some examples to begin conversation:
For example, I can see layer_point_line_regression have the following shorter form:
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"layer": [
{
"mark": {
"type": "point",
"filled": true
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
}
},
{
"mark": {
"type": "line",
"color": "firebrick"
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
},
"y": {
// if the regression property is with y, it's on x and vice-versa.
// other non groupby field can be used as x-y
"regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
]
}
This is clearly more concise that the full-form that we currently support and consistently with aggregation / timeUnit that have a short form in encoding. That said, one danger of this approach is that regression may not support clean combination with aggregate. (It's unclear what we should do if both aggregation and regression are specified.
This would be another option, which would work well for supporting regression line in polestar.
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"mark": {
"type": "point",
"filled": true,
"loess/regression": true | {
"mark": "line" // this should be implicit by default
"type": "linear" // default
"on": "x": // default (on: "y" would be a transpose of this)
}
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
I agree that we want to have a high-level mark or encoding property rather than relying on transforms.
A problem with the second approach is that it's hard to just have a regression line and no points. I'm also not a huge fan of the "on" property as it creates a link to the encodings and then the question is why we don't just put the regression in the encoding.
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
},
"layer": [
{
"mark": {
"type": "point",
"filled": true
}
},
{
"mark": {
"type": "line",
"color": "firebrick"
},
"encoding": {
"y": {
// if the regression property is with y, it's on x and vice-versa.
// other non groupby field can be used as x-y
"regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
// maybe we can temove the code below since we are defining the encoding at the layer level already
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
]
}
That said, one danger of this approach is that regression may not support clean combination with aggregate. (It's unclear what we should do if both aggregation and regression are specified.
We already have this with binning and aggregation.
A problem with the second approach is that it's hard to just have a regression line and no points.
That's good point. However, we should avoid overriding encoding as it's making it harder to read the code if there are overriding parts. (We actually throw warning when overriding exists.)
With the proposal C), users have to read the outer encoding and inner encoding seperately and process the merging (namely that inner y is used for line, and the regression still applies "on" the outer x, which doesn't get replaced). So we should definitely avoid it.
To avoid overriding y-encoding, I think it's better to put regression in mark, akin to ggplot2's geom_smooth.
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
},
"layer": [
{
"mark": {
"type": "point",
"filled": true
}
},
{
"mark": {
"type": "line",
"color": "firebrick",
"regression/loess": true | {
method: 'linear', // linear by default
order: ...,
extent: ...,
"on/predictor": "x": // default (on: "y" would be a transpose of this)
}
}
}
]
}
Note that the current loess does not output confidence interval band, but it might make sense to support that in the future. So we should see how this feature would interact with errorbar/band macro that we may add (#4131).
That said, one danger of this approach is that regression may not support clean combination with aggregate. (It's unclear what we should do if both aggregation and regression are specified.
We already have this with binning and aggregation.
To clarify, if both aggregate and regression are specified in the same encoding, that should be an error. However, there is also a case where aggregate is on one encoding (e.g., 'x') and regression is on another (e.g., 'y'). Then it's unclear what do to (while the same thing with bin+aggregate is still pretty clear).
In any case, regression/loess shows relationship between x and y, not just either x or y -- so I'm leaning toward the regression macro that doesn't introduce an extra mark. Let's see if there are other aspects of D) that should be iterate.
What is the strong argument for including this in either an encoding channel or mark? Isn’t this unnecessarily “complecting” the API? One transform plus a standard line mark doesn’t seem so bad, limits the surface area, and maintains modularity. I’d certainly feel a bit better if this were an encoding level directive that played nice with binning, aggregation etc, but if that’s not possible I’m not sure a new mark is necessary. (I’ve always had mixed feelings about the complected ggplot geoms that mix transforms and geometries, though I think they are a reasonable usability compromise for more complex layered forms like box plots and violin plots.)
Also keep in mind that the transform might evolve in the future; for example to generate a confidence interval alongside the regression values. I’d rather have transform plus line and area than specialized “smooth” and “ribbon” marks... but I’m interested in hearing other arguments!
What is the strong argument for including this in either an encoding channel or mark? Isn’t this unnecessarily “complecting” the API? One transform plus a standard line mark doesn’t seem so bad, limits the surface area, and maintains modularity.
I think the main argument for a macro is concision (independent of argument for a proper solution).
Consider errorbar/band, which is also not _so_ bad as separate transforms and layers (one can just make error bar/band with rules). However, it's still bad that users won't be content with requiring layer with transform and repetitive encodings. Even with the macro that we already have, users still expects to avoid manual layering as we discuss in https://github.com/vega/vega-lite/issues/4422.
I'd argue that if we will do a macro for errorbar (and already did even for a simpler things like line's point overlay), then it's a bit inconsistent to argue that regression/loess (esp. with CIs output) aren't complex enough to justify that to consider a macro for it.
Also, requiring layering will also makes it hard for non-layer tools like PoleStar/Voyager/CompassQL to leverage regression features.
So I think we should consider if there is a reasonable solution at all.
(We could choose _not_ do it, if there is no good solution to do.)
Also keep in mind that the transform might evolve in the future; for example to generate a confidence interval alongside the regression values
Definitely. I actually commented the same thing here.
In fact, once we have confidence interval, the case for a macro that can do point + loess line/area (for CIs) would make a case for macro even more convincing than the current stage. (At that point, it's definitely complex enough to justify that the macro outweighs the cost of complecting the design, just like boxplot is complex enough.)
I’d certainly feel a bit better if this were an encoding level directive that played nice with binning, aggregation etc, but if that’s not possible I’m not sure a new mark is necessary.
That's actually possible. I'm a bit more ok with proposal C) if we don't allow
encoding: {y: {regression: ...}} to stand alone without field/type like Dom suggested in the comment.
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
}
},
"layer": [
{
"mark": {
"type": "point",
"filled": true
},
"encoding": {
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
},
{
"mark": {
"type": "line",
"color": "firebrick"
},
"encoding": {
"y": {
"regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
]
}
We still need to deal with the following cases:
1) "aggregate is on one encoding (e.g., 'x') and regression is on another (e.g., 'y')". -- I guess we can either define that regression comes after aggregation or ban it entirely.
2) How to support the point + loess line/area (for CIs). This is still a big use case to consider.
I think regression as inline transform is still a bit awkward for this case as the line layer and ranged area layer (once we have CIs) takes different parts of the output from regression / loess. Plus, when we combine with raw layer, we need to repeat the transform multiple times:
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
}
},
"layer": [
{
"mark": {
"type": "point",
"filled": true
},
"encoding": {
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
},
{
"mark": {
"type": "line",
"color": "firebrick"
},
"encoding": {
"y": {
"regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
"field": "IMDB_Rating",
"type": "quantitative"
}
}
},
{
"mark": {
"type": "area/errorband",
},
"encoding": {
"y": {
"regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., } // with area/errorband, the regression will map CIs to the output?
"field": "IMDB_Rating",
"type": "quantitative"
}
}
}
]
}
For point + loess line/area (for CIs), a composite mark akin to geom_smooth that combines line and ranged area might actually make sense as it no longer makes sense to just augment a primitive mark with a regression/loess macro. Alternatively, we can consider how regression may interact with errorband+line macro.
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {
"url": "data/movies.json"
},
"encoding": {
"x": {
"field": "Rotten_Tomatoes_Rating",
"type": "quantitative"
},
"y": {
"field": "IMDB_Rating",
"type": "quantitative"
}
},
"layer": [
{"mark": "circle"},
{
"mark": {
// mark A)
"type": "regressionline/smooth" // regressionline is probably a more proper name for a mark than smooth?,
"line": ..., // all line properties
"errorband": ..., // all ranged-area properties
"method": 'linear' | 'loess' | ..., // switch between regression and loess here,
... // other properties of loess / regression
// mark B) -- if we follow the proposal in https://github.com/vega/vega-lite/issues/4422#issuecomment-496313449 and doesn't want to introduce a new mark
"type": "errorband",
"line": true
"regression": {
"method": 'linear' | 'loess' | ..., // switch between regression and loess here,
... // other properties of loess / regression
}
}
}
]
}
I don't think any of these are the ideal solutions yet, but we can iterate more on these different ideas.
Note: Ribbon is simply a ranged area, so we would never need it in VL.
I think a composite mark that does the line and area in one go would be my preferred solution. I think that is the most common scenario that users would want to create, so providing a concise option for that seems most valuable to me. Certainly having something roughly as short as geom_smooth is what would be most helpful for VegaLite.jl, and the reason I opened this issue in the first place :)
Most helpful comment
It's coming in Vega-Lite 4.
@kanitw we can close this issue, right?