Seaborn: calculated stats

Created on 27 May 2014  路  16Comments  路  Source: mwaskom/seaborn

Is there any way to access the stats once the plots are done? Say for example after plotting with lmplot. You end up with all sorts of regressions and fancy stuff happening under the hood. Is it possible to access the parameters of these regressions?

If not I think this would make a very nice addition. I'm plotting a linear model now for example and I would like to access the slope, etc. for manually adding this information onto the graph. Because I don't know how to access those things (if there's a way) I'm forced to recalculate everything...

Thanks

question

Most helpful comment

is there a parameter that will display the stats in the plot?

I agree that stats embedded as text are incredibly valuable for visualizations (example). The danger with the current paradigm is that one may accidentally calculate incorrect values when trying to replicate an abstracted analysis. Additionally, much of seaborn's convenience is lost when an entire modeling process must be explicitly coded.

I think that Hadley's workflow mentioned by @mwaskom makes sense for exploratory data science. However, for publication stage graphics I believe the opposite is true: visualization should directly represent the models that statistics are reported for.

All 16 comments

No, nothing like that exists.

Thanks for the quick reply.

Since everything is connected to the FacetGrid which includes the data as well, wouldn't it be... natural to have these parameters laying around for grabbing? :) just a thought...

On the contrary, lmplot is a combination of regplot and FacetGrid. For this to work, the regplot function can't know about what kind of plot it's being drawn into, and the FacetGrid can't know about what is being drawn on it. Because the fit is calculated within regplot, that information can't (and shouldn't) get propagated out to the FacetGrid.

OK, that makes sense, I knew that lmplot is using regplot in the background. But still, these parameters should be spewed out somehow from regplot (or in this case lmplot) otherwise we end up doing the re-calculation just to get the (same) parameters...

It just seems a petty to have to plot the scatter (in this case) with lmplot and then do our own fit (as a second step). There has to be a logical way to do this...

Thanks for your time, if I should ever find the time to dig into this I'll see if I can contribute with something towards this (if you'd be interested of course).

It just seems a petty to have to plot the scatter (in this case) with lmplot and then do our own fit (as a second step). There has to be a logical way to do this...

I don't agree, because while there is a close mapping between the type of model you are fitting and the plot you are drawing, it is not an identity. Consider this simple example:

import numpy as np
from numpy.random import randn
import seaborn as sns
import pandas as pd

x1 = randn(40)
x2 = np.repeat([0, 1], 20)
x3 = np.tile([0, 1], 20)
y = 1.5 * x1 + 2 * x2 + 0 * x3 + 2 * x1 * x2 + randn(40)
df = pd.DataFrame(dict(x1=x1, x2=x2, x3=x3, y=y))

grid = sns.lmplot("x1", "y", hue="x2", col="x3", data=df)

lmplot

What slopes should we be interested in getting here? The actual slopes that get calculated when plotting are y ~ x1 | x2 == 0, x3 == 0, y ~ x1 | x2 ==0, x3 == 1, etc. But looking at the plot, it suggests that the correct model is y ~ x1 * x2 (or perhaps y ~ x1 * x2 + x3). In general there should be an iterative loop between plotting and modeling of the sort that Hadley Whickam visualizes with this diagram:

tidy loop

In my head this is how is should play out (using your example):

I have a y variable and I want to know if I can fit it with a linear model. In your simple case we're plotting y vs x1 (regardless of x2, x3) so we must be interested in this (x1, y) relationship. Even though we might suspect x2 and x3 have a role in it, we're still interested in knowing the parameters for this simple linear model based only on x1. I just see it as a 2D numpy.ndarray of dict objects with keys being the hue values, something on these lines:

params = np.array(
    [[{0: ('y_intercept', 'slope'), 1: ('y_intercept', 'slope')},
      {0: ('y_intercept', 'slope'), 1: ('y_intercept', 'slope')}]], dtype=object
)

Where params[0, 0][0] would be the first plot on the top left and hue == 0. params[0, 0][1] would be the same plot with hue == 1 etc... (this would be the line between Visualize <-----> Model above - in my mind).

As I see it now, there's this huge power underneath the bonnet and we're using just a part of it because we don't have direct access...

Hope I'm not boring you with all this :).

I still think there are problems with how you map those results to the model that you want (which is probably the interactive model, and not the conditional regressions). But the way I see it, even if there were easy access to the model parameters, you still need a line of code to get them, and you need to know where they are. In contrast, you can fit exactly the model you want also with a single line of code, e.g.

import statsmodels.formula.api as smf
sns.lmplot("x1", "y", hue="x2", col="x3", data=df)
smf.ols("y ~ x1 * x2", df).fit().params["x1"]

I only used scipy, numpy for my needs so far, and a bit of pandas (especially since I stumbled upon seaborn). So yes, I can see now why you wouldn't view this as an improvement. But what would happen say for example if you were to have 1e7 points to fit? I know I'm going to the extreme here, but it makes the point. You'd do the calculation twice...

On the other hand I now realise that, if you don't specify row_order, col_order, there's no way of mapping to the desired location in the grid plot (using my proposed solution)... but I'm pretty sure this is easily done with a pandas.DataFrame, say for example, instead. By making use of the full info given to the plotting function (row, col, etc.).

I just don't know if I have the necessary tool set to actually convince you this is an improvement.

But what would happen say for example if you were to have 1e7 points to fit? I know I'm going to the extreme here, but it makes the point. You'd do the calculation twice...

I suspect the size of data for which this would be a problem is data that is too large to get an useful visualization with lmplot. By the way, on my system it takes 2.7 seconds to fit y ~ x with 1e7 datapoints in each :).

Good, so I'll drop this. Thanks for your time :).

No worries, it was fun to chat about.

I'm going to add to this, because I think there's a larger point here than the performance hit of running a model twice.

I agree with you that if you want to run a regression in a specific way (and have low-level access to the quantitative fit data) it is a better to know exactly what you are doing and code it yourself in statsmodels, scipy, etc. And definitely, if I want to run a robust regression with a Tukey Biweight norm (and get the fitted coefficient out the backend), that's quite easy:

import statsmodels.api as sm
model = sm.RLM(Y,X, M=sm.robust.norms.TukeyBiweight())
results = model.fit()
T = results.params[0]-273.15

(In this case, the datasets have been computed such that the fitted coefficient represents a temperature.)

However, if I do this, I want a plot of _my regression_, not some other one that is internal to seaborn. Seaborn makes it quite easy to plot regressions with bootstrapped confidence and extremely high visual quality (that would take me quite a bit of code to replicate on my own). But I have to dig into the code to decide what model Seaborn uses, and there's no ability to customize it. Can we have a function (or option on regplot) that enables us to easily load a statsmodels model or model results instance for plotting using the rest of the seaborn workflow? It seems from reading seaborn/linearmodels.py that such functionality wouldn't be too hard to implement.

However, if I do this, I want a plot of my regression, not some other one that is internal to seaborn.

I totally agree with this perspective, but I would say that plotting functions that faithfully represent a statsmodels object probably belong more in statsmodels proper. And actually, a lot of these do exist! The statsmodels devs have put substantial effort into making plotting functions for their various modeling and estimator objects (often they are just an object.plot() away). Because these functions are based on matplotlib, they'll pick up most of the seaborn styling elements by default if you have seaborn imported.

While it is understandable that the statistics cannot be accessed from the plot-related object, is there a parameter that will display the stats in the plot? In the case of lmplot, I would love to have the R^2 displayed to give a bit more quantitative feel for how good the fit is.

is there a parameter that will display the stats in the plot?

I agree that stats embedded as text are incredibly valuable for visualizations (example). The danger with the current paradigm is that one may accidentally calculate incorrect values when trying to replicate an abstracted analysis. Additionally, much of seaborn's convenience is lost when an entire modeling process must be explicitly coded.

I think that Hadley's workflow mentioned by @mwaskom makes sense for exploratory data science. However, for publication stage graphics I believe the opposite is true: visualization should directly represent the models that statistics are reported for.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tritemio picture tritemio  路  3Comments

JanHomann picture JanHomann  路  3Comments

bondarevts picture bondarevts  路  3Comments

sungshine picture sungshine  路  3Comments

TDaltonC picture TDaltonC  路  3Comments