Vega-lite: Support violin plot and probability density plots

Created on 3 Mar 2018 · 22Comments · Source: vega/vega-lite

From https://vega.github.io/vega/examples/violin-plot/

A violin plot visualizes a distribution of quantitative values as a continuous approximation of the probability density function, computed using kernel density estimation (KDE). The densities are additionally annotated with the median value and interquartile range, shown as black lines. Violin plots can be more informative than classical box plots.

https://vega.github.io/vega/examples/probability-density/ is another related example

[ ] Understand https://vega.github.io/vega/examples/violin-plot/ and https://vega.github.io/vega/examples/probability-density/ examples throughly, search online to understand other violin and density plot variants, and define the scope that we want to support.
[ ] Understand how we implement composite mark thoroughly by looking at the box-plot codebase
[ ] Design density transform in Vega-Lite and see if we can already use area mark to reproduce the density area for violin.
[ ] Design composite mark syntax for violin (and density plot?)
- [ ] First we can focus on just the violin area part: design MarkDefinition block for Violin so that we can define property of the underlying density transform and other related properties
- [ ] Decide if we need a composite mark for density plot -- (probably yes), and make sure that the syntax for violin and density are consistent. (Also think if there is a better name for density too)
- [ ] For violin plot, we need to decide if we want to include interquartile range and median as a part of the violin composite mark (which is sort of like the "box" overlay on top of violin plot). The syntax here should be very consistent with box-plot.
[ ] Implement the code. Note that there is probably a good way to share at least some part of the implementation between the violin and density plot.

Area - Macro / Composite Mark P2

Source

kanitw

👍14 🚀3

Most helpful comment

Btw, I run into a "split violin plot" in seaborn. It's definitely worth considering how this fits into our grammar.

kanitw on 30 May 2019

👍7

All 22 comments

The tricky part about this is that Vega's Violin plot depends on the Vega facet operator to split data into subgroups between passing it to density transform. (Density happens inside nested facet.)

1) Consider the solution above that suggests implementing density transform first.
Given VL's facet also always applies layout, we can't reproduce the violin example with axis using implement density as a transform unless we do _one of_ the following:

a) Make Vega density supports groupby (which is basically in place faceting)
b) Support a variant of facet without layout (pure facet in the data transformation sense)

__Note:__ we can reproduce violin plot using VL facet operator, but we will then rely on row instead of y position for each violin.

2) Alternatively, we could consider implementing violin as its own special mark that produce underlying density transform. However, this approach will be less composable. (For example, density plots https://vega.github.io/vega/examples/probability-density/ shouldn't be its own mark but rather using area plotting output from density transforms.)

kanitw on 24 May 2018

❤1

We meet today to talk about this and conclude that we should make Vega density transform supports groupby.

kanitw on 26 May 2018

👍3

We meet today to talk about this and conclude that we should make Vega density transform supports groupby.

@kanitw Any progress on implementation?

HarvsG on 2 Dec 2018

No update yet

kanitw on 2 Dec 2018

Thanks for Vega-lite.

I often use violin plots and I am looking forward to use them in Vega-lite/Altair.

In addition, I use a lot of ridge plots (half violin) like this one:

mcmc_areas-rstanarm

Would you consider adding an option to the violin plot to allow similar figures to be made?

I made an implementation in python, with mark area and a custom kde function, but it is rather tedious.

Also, would similar figures in histogram be possible (for discrete variable)?

I'm sure anyone using Bayesian statistics would be grateful.

romainmartinez on 27 Mar 2019

Yes, once we have a kde transform in Vega, we can also support ridge plots.

domoritz on 27 Mar 2019

Yes, once we have a kde transform in Vega, we can also support ridge plots.

Has it already landed in the vega 5.0 (https://vega.github.io/vega/docs/transforms/density/)?

denisshepelin on 28 Mar 2019

We've had this transform for a while but it does not support faceting and that's a deal breaker. We've come to the conclusion that we need a kde transform that has a group by key.

domoritz on 29 Mar 2019

👍5

Depends on https://github.com/vega/vega/pull/1783

domoritz on 26 Apr 2019

Once the new Vega KDE support lands, I think the first step here is probably to add a new density transform to Vega-Lite that maps to the Vega kde transform, with syntax such as:

{
  density: string; // value field to estimate density for
  groupby?: string[];
  method?: 'pdf' | 'cdf';
  extent?: [number, number];
  bandwidth?: number;
  steps?: number;
  as?: [string, string]
}

I think it should be called density rather than kde, as (1) density is a proper word, not an abbreviation, and (2) I can imagine extending the implementation in the future to fit a normal density (or log-normal, or Poisson, etc) to the input data, not just a kernel density estimate.

jheer on 26 Apr 2019

👍2

Maybe method?: 'pdf' | 'cdf'; -> cumulative?: boolean. as should not be optional in Vega-Lite.

domoritz on 26 Apr 2019

@domoritz I definitely prefer your suggestion of cumulative?: boolean.

Also, when adding violin plots we may want to support multiple scaling options. The default (at present) is that all violins share the same scale based on the sampled density estimates, which of course was a primary motivation for adding the kde transform with groupby support in Vega. We may still also want to support other forms of scaling or normalization.

The reason I'm thinking about this is that, if an explicit bandwidth parameter is not applied, each group will have its bandwidth independently set using an estimation heuristic. This means that each plot has different kernel width, which in turn means that one could have potentially large disparities in how much of the probability mass gets "clipped" when drawing violins only over the domain of observed data values. The tails of the KDE distribution get cut off, such that the total amount of probability mass shown in each violin is unequal. (This issue can still arise with a shared bandwidth parameter, it's just not as extreme.) It may be that the "right" thing to do is add a normalization pass in the KDE transform whenever we have more than one group.

So, I think we might need to do some additional research into the "proper" scaling and trimming of violins. I don't know how carefully other tools have looked at this!

jheer on 3 May 2019

The ggplot violin options page shows that these questions are largely left to end users, with the default being the same as proposed above (_without_ normalization of trimmed density areas):

From https://ggplot2.tidyverse.org/reference/geom_violin.html:

trim | If TRUE (default), trim the tails of the violins to the range of the data. If FALSE, don't trim the tails.
scale | if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width.

Note that Vega currently supports options corresponding to ggplot's area and width values for the scale parameter, based on how we configure the scale domain. Our KDE implementation normalizes (divides by the number of data points) to form a proper PDF, so we could support a count option (if desired) by multiplying the estimated density by the count of points within a group. If that is of interest we could update the kde transform accordingly.

jheer on 3 May 2019

@jheer said about implementing violing plots with the new KDE transform in Vega:

The issue is not one of performance or extra transforms, but of correctness. (FWIW, I'd want to avoid a "density-center" option, as that strikes me as confusing and an abstraction-level violation.) The previous Vega violin plot example used stack, and it worked because all densities we scaled independently and so used the full width/height of the scale band. But this independent scaling is misleading and hampers accurate comparison.

The new KDE transform supports groupby, so we can use the output to define the domain of a scale at the top-level, which then scales all the densities in a proper fashion. The result is that different densities have different max width/height. Yet, the stack transform center option only centers the mark relative to the observed height (not the max height among all densities), causing inappropriate, non-uniform center-line offsets for the different densities.

My solution in Vega is to instead use xc/width or yc/height for the violin densities (as well as using xc or yc for the median and IQR annotations). This is simple and correct. A top-level linear scale is used to provide the width / height values.

domoritz on 19 May 2019

❤1

Btw, I run into a "split violin plot" in seaborn. It's definitely worth considering how this fits into our grammar.

kanitw on 30 May 2019

👍7

Interesting! An alternative that might be a bit better perceptually could be to directly layer (overlay) the conditional violins (or zero-baseline distribution areas) with some opacity. That would make the value and shape comparisons even more apparent. I hope new VL extensions can also support that, which should hopefully be simpler to specify (or, at least, require less new surface area).

jheer on 30 May 2019

👍2

Ridge plots are another alternative for this kind of thing and often work well.

There's a good package for ggplot for generating them.

cmcaine on 25 Jul 2019

👍1

Looks like ridge plots are supported now (can groupby in density transform), haven't figured out how to pull off violin plots yet though

SamWoolerton on 9 Jan 2020

@domoritz said my comments were welcome so here you go. Do tell me if this is off topic :)

Basically my feeling about a lot of uncertainty vis these days is you break it into (1) a representation of a distribution (be it analytical or empircal) as a PDF (f(x)), CDF (F(x)), and inverse CDF (F^-1(x)); and (2) mappings of those functions onto visual channels.

Then the question is, is there a mark/geom (probably closest is area in vega-lite, though it might not be quite the right one---can you map a continuous variable onto color in an area?) that lets you use those mappings to create densities, violins, gradient plots, CDF barplots, etc. FWIW, I made a "slab" geom for doing this in tidybayes on top of ggplot (and a composite "slabinterval", which is a slab combined with an interval). All of the geoms below (except the dotplots) are just shortcuts for different variants of the underlying slab+interval geom:

It's a bit different from how area works in either ggplot or vega-lite in that, because it is not intended for stacking, it does not use the "y" aesthetic/channel for the height of the slab; rather it uses "thickness" (or I suppose you could call it "width" but that already has another meaning in ggplot). This allows you to map a different variable to the y axis to easily create ridge plots / half-eye densities / etc where you would normally use intervals, without having to screw around with creating facets (this is incredibly useful for visualizing coefficients and the like, because creating facets just for coefficients is a pain --- you have to mess with header text angle usually --- plus often you want to facet over something else). It also allows color and opacity to vary within the geom, which is useful for creating gradient plots and for creating densities with highlighted regions.

Anyway the upshot is, if you think abstract grammar-of-graphics mappings from data onto channels (so, not about the particular syntax of a given package, but a formal description of the visualization: "z -> x position" being the equivalent of aes(x = z) in ggplot or an encoding of {"x": "z"} in vega-lite), you might have a density plot for a variable z described as something like this:

z -> x position
f(z) -> thickness

or a gradient plot described as:

z -> x position
f(z) -> opacity

or a CCDF barplot described as:

z -> x position
1 - F(z) -> thickness

If you then add in the ability to do densities / CDFs / etc of analytical distributions (which is what the stat_dist_slab geom does), you can do the equivalent of:

z -> x position
f_Normal(z|mu, sigma) -> thickness

Which is how you'd do a density plot for a normal distribution. Given an implementation of the Normal and the scaled-and-shifted t distribution you'd be able to do confidence distributions for a lot of common ways of summarizing uncertainty from frequentist models (so that gets you, basically, halfeyes / gradient plots / whatever else for visualizing uncertainty).

Last bit is being able to map color within slabs means given a data table roughly like this:

| dist | theta |
|-----|-----|
| normal | [0,1] |
| student_t | [3,0,1]|

You can do stuff like:

x -> x position
dist -> y position
f_{dist}(x|theta) -> thickness
|x| < 1.5 -> fill color

Which yields something like this:

Anyway, I don't have specific suggestions for how these abstract specifications turn into syntax necessarily. What I did with slabinterval doesn't look exactly like the above abstract syntax, but I have found it helpful for thinking more formally about these visualization types.

mjskay on 30 Apr 2020

❤4

@mjskay -- Your comment is definitely very useful.

When we work more on this, we'll have to see how this interplay with offset channel that we plan to add (#4703).

kanitw on 30 Apr 2020

That's a good point --- having a different channel for thickness (rather than x/y) was partly motivated by how dodging works in ggplot (which is what offset is for in vega-lite?) because it makes it easy to do stuff like this:

which is pretty common when visualizing estimates from groups/subgroups

mjskay on 30 Apr 2020

Although there is no dedicated mark for this yet I noticed that #5066 has been implemented so is is it possible to manually map the area width/height to the density value instead of dedicating one of the axes to this? I would like to make a plot where the y-axis is categorical with one density per y-value and then also facet this plot, so I can't use the trick in the altair gallery where the facets essentially replace the y-axis. Like the boxplot below, but with violins/ridges/densities:

For now I am using a binned mark point with the size set to count to approximate a stepwise distribution, which looks pretty cool but is not very formal =) At least it captures multimodality better than a box blot.