Altair: Crashing browser when generating histogram for fairly large dataset

Created on 14 Aug 2016 · 12Comments · Source: altair-viz/altair

The dataset I have (provided in the following link) has the following dimensions: (185459, 29). matplotlib has no trouble generating one with plt.hist(data_2015['net_value'], bins=50), for instance.

from altair import *
import pandas as pd
from urllib.request import urlretrieve

url = 'https://cl.ly/2T2Q0O1c2k35/download/2016-08-08-current-year.xz'
path = '/tmp/2016-08-08-current-year.xz'
urlretrieve(url, path)

data_2015 = pd.read_csv(path)
Chart(data_2015).mark_bar().encode(
    x=X('net_value', bin=Bin(maxbins=50)),
    y='count(*)',
)

Source

Irio

Most helpful comment

If you're creating a gist, you could add the dataset CSV files to the gist as well and reference them by URL.

Long term we should have a better story for this, though. I'll put some thought into it. If you have thoughts on how we might do that most effectively, please let us know.

jakevdp on 14 Oct 2016

🎉1 👍1

All 12 comments

Sounds right: with the current default renderer, all data is embedded into the web page as JSON. It's not surprising that embedding that much data would crash the browser. tou might be able to improve things by referencing the data by URL rather than by data frame - then the data itself won't be part of the JSON. To do that, you'll need to export your data as csv, then put it at a URL that your browser can see, and pass that address in place of the data frame.

jakevdp on 14 Aug 2016

as other rendering libraries begin supporting Vega-Lite specs, there might be better solutions for visualization of large-ish data. But for now, vega.js is the only option (and it's not really designed with big data in mind)

jakevdp on 14 Aug 2016

Actually, we regularly build histograms over data of similar sizes in
Vega without major issues other than perhaps some slow down... It would be
useful to share specific replication steps so we can investigate and test!

On Sunday, August 14, 2016, Jake Vanderplas [email protected]
wrote:

as other rendering libraries begin supporting Vega-Lite specs, there might
be better solutions for visualization of large-ish data. But for now,
vega.js is the only option (and it's not really designed with big data in
mind)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/ellisonbg/altair/issues/171#issuecomment-239674571,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMYL46hBW-aUxfte3vPgD7OanzOlMg7ks5qfxz0gaJpZM4Jjz79
.

jheer on 14 Aug 2016

Jeff - I think the issue is probably not vega's handling of the data per se, but the fact that by default Altair goes the route of embedding the data within the JSON spec itself.

jakevdp on 15 Aug 2016

We could have an option in Altair that saves the data to a file and loads it by URL.

ellisonbg on 17 Aug 2016

I am also having a problem saving Jupyter notebooks with large visualizations in them:

[I 14:18:42.413 NotebookApp] Malformed HTTP message from 172.21.0.1: Content-Length too long

My dataframe has the dimensions 213665 rows × 5 columns.

Is there anyway to render it in Jupyter without saving all of the data in the notebook as JSON?

saulshanabrook on 11 Oct 2016

Is there anyway to render it in Jupyter without saving all of the data in the notebook as JSON?

Yes: put the csv file in the same directory as the notebook and use, e.g. Chart('data.csv'). Examples of this can be seen here: https://github.com/jakevdp/altair-examples

jakevdp on 11 Oct 2016

@jakevdp Thank you!

I figured out that Jupyter notebook serves the raw files at the /files/ url.

df.to_csv("dataframe.csv")
Chart("dataframe.csv").mark_circle().encode(
    X('time:T', timeUnit='hours'),
    Y('time:T', timeUnit='day'),
    size='count(*):Q',
)

_edited to simplify path_

saulshanabrook on 11 Oct 2016

I think just dataframe.csv alone should work in the current version, if it's in the same directory as your .ipynb file.

jakevdp on 11 Oct 2016

👍1

For what it's worth, I've run into a similar issue, with 8 figures in my notebook (data from a dataframe), the notebook's file size is ~60 mb which caused an issue when trying to create a gist from it to share. Unfortunately, saving the data locally isn't a really a viable option.

brentonmallen1 on 14 Oct 2016

If you're creating a gist, you could add the dataset CSV files to the gist as well and reference them by URL.

Long term we should have a better story for this, though. I'll put some thought into it. If you have thoughts on how we might do that most effectively, please let us know.

jakevdp on 14 Oct 2016

🎉1 👍1

I added a FAQ to the docs that addresses this; we're also tracking this issue in #249.

jakevdp on 2 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings