Seaborn: distplot being very slow

Created on 8 Jun 2015  Â·  4Comments  Â·  Source: mwaskom/seaborn

I noticed that sometimes, depending on the data, KDE fitting takes a quite a bit to run (~ 3 minutes per plot). Are there any options that could limit, perhaps, the number of iterations, or still alternatively simplify the fitting process at the expense of accuracy?

Most helpful comment

Thank you @mwaskom. I have statsmodels installed, and I am actually noticing that distplot takes a long time even with rug=False, kde=False, and norm_hist=False. No idea what's to blame.

  • Here is the output of pip freeze (I did this 10 minutes ago: pip install --upgrade git+git://github.com/statsmodels/statsmodels@master)
  • Here is the array
  • And below is the code that takes ~ 2-3 minutes to run on a modern MacBook pro with 16 GB of RAM

python f, ax = plt.subplots(figsize=(6,6)) sns.distplot(my_array, rug=False, kde=False, norm_hist=False)

More data (OS X Yosemite):

  • $ uname -a:
Darwin macbook-pro.my.company.net 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64

All 4 comments

Do you have statsmodels installed? I think if you do seaborn will use its FFT-based algorithm, which should be faster. Otherwise not really, although you could randomly subsample your own data.

Thank you @mwaskom. I have statsmodels installed, and I am actually noticing that distplot takes a long time even with rug=False, kde=False, and norm_hist=False. No idea what's to blame.

  • Here is the output of pip freeze (I did this 10 minutes ago: pip install --upgrade git+git://github.com/statsmodels/statsmodels@master)
  • Here is the array
  • And below is the code that takes ~ 2-3 minutes to run on a modern MacBook pro with 16 GB of RAM

python f, ax = plt.subplots(figsize=(6,6)) sns.distplot(my_array, rug=False, kde=False, norm_hist=False)

More data (OS X Yosemite):

  • $ uname -a:
Darwin macbook-pro.my.company.net 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64

If you have a huge dataset, it may just be drawing a large number of bins. In that case the delay is probably just matplotlib drawing all the bars.

It's probably not that useful to draw more than ~50 bins anyway, and in 0.6.dev the automatic calculation is capped at that value to avoid this. 

—
Sent from Mailbox

On Mon, Jun 8, 2015 at 10:03 AM, Amelio Vazquez-Reina
[email protected] wrote:

Thank you @mwaskom. I have statsmodels installed, and I am actually noticing that distplot takes a long time even with rug=False, kde=False, and norm_hist=False. No idea what's to blame.

  f, ax = plt.subplots(figsize=(6,6))
  sns.distplot(change_df.dropna(subset=['diff'])['diff'].values, rug=False, kde=False, norm_hist=False)

Reply to this email directly or view it on GitHub:
https://github.com/mwaskom/seaborn/issues/587#issuecomment-110074176

Totally correct. Thank you @mwaskom Should have thought about that!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JanHomann picture JanHomann  Â·  3Comments

songololo picture songololo  Â·  4Comments

sungshine picture sungshine  Â·  3Comments

tritemio picture tritemio  Â·  3Comments

alexpetralia picture alexpetralia  Â·  3Comments