Seaborn: Feature Request: Truncated Distributions for Violin Plots

Created on 28 Apr 2015 · 11Comments · Source: mwaskom/seaborn

I would like to generate violin plots for truncated distributions, e.g. for efficiency scores which are always between 0 and 100%. My current approach is to use the parameter cut=0 when calling sns.violinplot, but I think that a more informative approach is to reflect the density at the truncation point, so that, for example, the area which would be drawn below zero in an unrestricted kde will appear above zero in the truncated version.

Here is a little example that illustrates my concern and a potential solution:

import numpy as np, pandas as pd, pymc as pm, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

np.random.seed(12345)
df = pd.DataFrame(np.random.normal(size=(10,3)).clip(0,5))

sns.violinplot(data=df)

Note the disturbing non-zero density on negative values. Fixing this with cut=0 looks like this:

sns.violinplot(data=df, cut=0)

No more positive density outside the support of the data. But this truncated normal should have maximum density at zero, and the feature I am requesting is a way to ask for that. Here is a very hacky way to get something that would satisfy me:

t = sns.categorical._ViolinPlotter.fit_kde
def reflected_once_kde(self, x, bw):
    kde, bw_used = t(self, x, bw)

    kde_evaluate = kde.evaluate

    def zero_to_five_truncated_kde_evaluate(x):
        val = kde_evaluate(x)
        val += kde_evaluate(-x)
        val += kde_evaluate(5-(x-5))
        return np.where((x<0)|(x>5), 0, val)

    kde.evaluate = zero_to_five_truncated_kde_evaluate
    return kde, bw_used

sns.categorical._ViolinPlotter.fit_kde = reflected_once_kde
sns.violinplot(data=df, cut=0)

There is a previous feature request that asks for something similar at #244 which was closed when the implementation was overhauled in #410. Perhaps @PierreBdR or @mwaskom has some input about if and how my feature should be implemented.

I am up for doing some amount of work on this if it would be a welcome addition to Seaborn.

Source

aflaxman

Most helpful comment

Thanks for your work on this. In case anyone needs this sort of plot before the truncated KDE is finished, here is the monkey patch madness that I used in the end:

fit_kde_func = sns.categorical._ViolinPlotter.fit_kde

def reflected_once_kde(self, x, bw):
    lb=0
    ub=1

    kde, bw_used = fit_kde_func(self, x, bw)

    kde_evaluate = kde.evaluate

    def truncated_kde_evaluate(x):
        val = np.where((x>=lb)&(x<=ub), kde_evaluate(x), 0)
        val += np.where((x>=lb)&(x<=ub), kde_evaluate(lb-x), 0)
        val += np.where((x>lb)&(x<=ub), kde_evaluate(ub-(x-ub)), 0)
        return val

    kde.evaluate = truncated_kde_evaluate
    return kde, bw_used

sns.categorical._ViolinPlotter.fit_kde = reflected_once_kde
sns.violinplot(np.random.normal(size=10).clip(0,np.inf), cut=0, inner=None)

It made my violins look like gyro meat, which I kind of like:

aflaxman on 30 Apr 2015

👍2 🚀1

All 11 comments

In general, I would prefer that this kind of complexity in statistical estimation live upstream in the actual statistics libraries. Is truncated kernel density estimation only useful for visualization? Seems like it could be a nice addition to statsmodels.

mwaskom on 28 Apr 2015

I don’t know if it is used widely, but I did find a description of the approach I’ve described in a book: Bernard W. Silverman, Density Estimation for Statistics and Data Analysis, 1986 (p. 30). This is the approach used in the benchmarking R package in the eff.dens.plot function.

Perhaps a middle road is for seaborn to expose a way to pass in a user-specified density estimator instead of the default now used in .fit_kde.

aflaxman on 28 Apr 2015

I don’t know if it is used widely, but I did find a description of the approach I’ve described in a book: Bernard W. Silverman, Density Estimation for Statistics and Data Analysis, 1986 (p. 30). This is the approach used in the benchmarking R package in the eff.dens.plot function.

Not criticizing the approach, just saying that complicated stats should live in a stats package, not a visualization package.

mwaskom on 28 Apr 2015

👍1

No worries, although it is really not to complicated. But I understand the desire to keep the stats code out of the viz package.

aflaxman on 28 Apr 2015

Sure, the stats themselves might not be too complicated, but the implementation here (with the monkey patching) certainly is a hack. Getting this working properly in seaborn itself would likely mean implementing a full kde fit in seaborn, which I wouldn't be in favor of.

mwaskom on 28 Apr 2015

👍1

But this really does seem like it would be useful in statsmodels, and I would certainly add some compatibility in seaborn to allow for plots with bounded density estimation.

mwaskom on 28 Apr 2015

It looks like this work is already under way in statsmodels, although I'm not sure how the domain bounds will be specified: https://github.com/statsmodels/statsmodels/pull/2318

aflaxman on 28 Apr 2015

Closing this issue but feel free to poke at me when the truncated KDE lands in statsmodels.

mwaskom on 30 Apr 2015

Thanks for your work on this. In case anyone needs this sort of plot before the truncated KDE is finished, here is the monkey patch madness that I used in the end:

fit_kde_func = sns.categorical._ViolinPlotter.fit_kde

def reflected_once_kde(self, x, bw):
    lb=0
    ub=1

    kde, bw_used = fit_kde_func(self, x, bw)

    kde_evaluate = kde.evaluate

    def truncated_kde_evaluate(x):
        val = np.where((x>=lb)&(x<=ub), kde_evaluate(x), 0)
        val += np.where((x>=lb)&(x<=ub), kde_evaluate(lb-x), 0)
        val += np.where((x>lb)&(x<=ub), kde_evaluate(ub-(x-ub)), 0)
        return val

    kde.evaluate = truncated_kde_evaluate
    return kde, bw_used

sns.categorical._ViolinPlotter.fit_kde = reflected_once_kde
sns.violinplot(np.random.normal(size=10).clip(0,np.inf), cut=0, inner=None)

It made my violins look like gyro meat, which I kind of like:

aflaxman on 30 Apr 2015

👍2 🚀1

@aflaxman I hope the bounded KDE estimation will be very soon accepted as they provide more flexibility in how bounds are processed. For your solution, although is is a (good) way to do it, it depends on the kind of data you have. You need to think about the nature of the bounds, and if they exist because the negative values (for example) cannot exist, or if they are equivalent to the positive ones. Here, you are assuming that negative values are simply equivalent to their positive counter-part, and this is why reflective boundary condition apply. If this is not the case, you might need another method.

PierreBdR on 26 Oct 2015

is there any updates or ways to do this? I would like to have something like the clip option in kdeplot for violinplot so that I don't have to move to ggplot.
Thanks