Seaborn: Unnormalized y-axis when using KDE

Created on 20 Mar 2015  路  19Comments  路  Source: mwaskom/seaborn

When I do

sns.distplot(my_data, rug=True, norm_hist=False)

I get:

screen shot 2015-03-19 at 8 06 28 pm

and when I do:

sns.distplot(my_data, rug=True, norm_hist=False)

I get:

screen shot 2015-03-19 at 8 04 15 pm

Is there any way to have the Y-axis show raw counts (as in the 1st example above), when adding a kde plot? (2nd example above)?

distributions wishlist

Most helpful comment

It would be more informative than decorative. KDE and histogram summarize the data in slightly different ways. In general, when plotting a KDE, I don't really care about what the actual values of the density function are at each point in the domain. Rather, I care about the shape of the curve. This contrasts with the histogram in which the values of each bar are something much more interpretable (number of samples in each bin).

Thus, it would be great to set the normalization of the KDE so that the density function integrates to a custom value thereby allowing the curve to be overlaid on the histogram.

All 19 comments

No, the KDE by definition has to be normalized.

Thanks @mwaskom I appreciate the answer and understand that. I also understand that this may not be something that seaborn users want as a feature. Aside from that, do you know if there is a way to, for example:

  1. Get the bar plot
  2. Grab the y axis labels
  3. Overlay the KDE plot
  4. Overwrite the y axis labels using the ones from (2).

?

I currently run (1) and (3) in a single command:

sns.distplot(my_series, rug=True, kde=True, norm_hist=False)

which prevent me from running (2) above.

Any way to get the bar and KDE plot in two steps so that I can follow the logic above?

Hi, I too was facing this problem. My solution is to call distplot twice and for each call, pass the same Axes object:

sns.distplot(my_series, ax=my_axes, rug=True, kde=True, hist=False)
sns.distplot(my_series, ax=my_axes, rug=True, kde=False, hist=True, norm_hist=False)

This will plot both the KDE and histogram on the same axes so that the y-axis will correspond to counts for the histogram (and density for the KDE).

However, it would be great if one could control how distplot normalizes the KDE in order to sum to a value other than 1. This way, you can control the height of the KDE curve with respect to the histogram. That is, the KDE curve would simply show the shape of the probability density function.

I've also wanted this for a while. I normally do something like

f,ax1 = plt.subplots()
sns.distplot(data,kde=False,ax=ax1)
ax2 = ax1.twinx()
ax2.set_ylim(0,3)
ax2.yaxis.set_ticks([])
sns.kdeplot(data,ax=ax2)
ax1.set_xlabel('x var')
ax1.set_ylabel('Counts')

but it seems like adding a kwarg to the distplot function would be frequently used or allowing hist_norm to override the the kde option would be the cleanest.

For anyone interested, I worked around this like

ax = plt.gca()
fig2, ax2 = plt.subplots()
sns.distplot(data, color='b')
sns.distplot(data, ax=ax2, kde=False, norm_hist=False, color='b')
ax.yaxis = ax2.yaxis

In other words, plot the data once with the KDE and normalization and once without, and copy the axes from the latter into the former. You have to set the color manually, as otherwise it thinks the histogram and the data are separate plots and will color them differently.

I have no idea if copying axis objects like that is a good idea. It's matplotlib, so it seems like any kind of hacky behavior is kosher so long as it works.

It would be awesome if distplot(data, kde=True, norm_hist=False) just did this.

I guess my question is what are you hoping to show with the KDE in this context? Is it merely decorative?

It would be more informative than decorative. KDE and histogram summarize the data in slightly different ways. In general, when plotting a KDE, I don't really care about what the actual values of the density function are at each point in the domain. Rather, I care about the shape of the curve. This contrasts with the histogram in which the values of each bar are something much more interpretable (number of samples in each bin).

Thus, it would be great to set the normalization of the KDE so that the density function integrates to a custom value thereby allowing the curve to be overlaid on the histogram.

I agree. I care about the shape of the KDE. Are point values (say, of things like modes) ever even useful for density functions (genuinely don't know; I don't do much stats)? Seems to me that relative areas under the curve, and the general shape are more important.

Honestly, I'm kind of growing sceptical of KDEs in general after using them for a while, because they seem to just be squiggly lines that don't correspond to the real underlying density well. Maybe I never have enough data points. The only value I've seen is sometimes it alerts me to extreme values that I otherwise would have missed because the histogram bars were too short, but the KDE ends up being more prominent. This is obviously a completely separate issue from normalization, however.

A few comments:

  • As you'll see if look at the code, seaborn outsources the kde fitting to either scipy or statsmodels, which return a normalized density estimate. So there would probably need to be a change in one of the stats packages to support this.
  • It's not as simple as plotting the "unnormalized KDE" because the height of the histogram bars for a given range will be entirely dependent on the number of bins in the histogram.
  • A small amount of googling suggests that there is no well-known method for scaling the height of the density estimate to best fit a histogram. There's probably some sort of single parameter optimization that could be performed, but I have no idea what the correct/robust way of doing would be. If someone who cares more about this wants to research whether there is a validated method in, e.g. R, I will look into it. But my guess would be that it's going to be too complicated for me to want to support.

No problem. Thanks for looking into it! If the normalization constant was something easy to expose to the user, then it would have been nice.

I might think about it a bit more since I create many of these KDE+histogram plots.

No problem. Thanks for looking into it! If the normalization constant was something easy to expose to the user, then it would have been nice.

To repeat myself, the "normalization constant" is applied inside scipy or statsmodels, and therefore not something exposable by seaborn.

If you want to just modify the y data of the line with an arbitrary value, that's easy to do after calling distplot.

Here's my solution:

# Plotting hist without kde
ax = sns.distplot(filtered_bookings.booking_window, kde=False)

# Creating another Y axis
second_ax = ax.twinx()

#Plotting kde without hist on the second Y axis
sns.distplot(filtered_bookings.booking_window, ax=second_ax, kde=True, hist=False)

#Removing Y ticks from the second axis
second_ax.set_yticks([])

Hope this helps.

The solution of using a twin axis will give you a histogram and a squiggly line, but it will not show you a KDE that is fit to the histogram in any meaningful way, because the axis limits (and hence height of the kde) are entirely dependent on the matplotlib ticking algorithm, not anything about the data.

It is understandable that the y-vals should be referring to the curve and not the bins counting. But sometimes it can be useful to force it to reflect the bins count, as the values on the y-axis may be not relevant for certain cases.

My workaround is to change two lines in the file
/python_virtualenvs/venv2_7/lib/python2.7/site-packages/seaborn/distributions.py
First line to change is 175 to:

norm_hist = norm_hist # or kde or (fit is not None)

(where I just commented the or alternative. could be erased entirely for lasting changes).

Second line to be changed is 241, to:

       area = 1
        if not norm_hist:
            bins, values = np.histogram(a, bins=bins)
            area = np.sum(np.diff(values) * bins)
        y = area * pdf(x)

PR will be coming soon.

This is getting in my way too. There should be a way to just multiply the height of the kde so it fits the unnormalized histogram. This should be an option. It's the behavior we all expect when we set norm_hist=False. It's intuitive. Doesn't matter if it's not technically the mathematical definition of KDE.

Sorry, in the end I forgot to PR. Feel free to do it, if you find the suggestions above useful!

Sorry, in the end I forgot to PR. Feel free to do it, if you find the suggestions above useful!

the second part (starting from line 241) seems to have gone in the current release. Any ideas?

I second this feature.

I also think that this option would be very informative. If you have a large number of bins, the probabilities are anyway so small that they're no longer informative to us humans. With bin counts, that would be different.

Resolved with #2125

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vinay-jayaram picture vinay-jayaram  路  3Comments

wenhaosun picture wenhaosun  路  3Comments

Bercio picture Bercio  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  4Comments