The dataset I have (provided in the following link) has the following dimensions: (185459, 29). matplotlib has no trouble generating one with plt.hist(data_2015['net_value'], bins=50), for instance.
from altair import *
import pandas as pd
from urllib.request import urlretrieve
url = 'https://cl.ly/2T2Q0O1c2k35/download/2016-08-08-current-year.xz'
path = '/tmp/2016-08-08-current-year.xz'
urlretrieve(url, path)
data_2015 = pd.read_csv(path)
Chart(data_2015).mark_bar().encode(
x=X('net_value', bin=Bin(maxbins=50)),
y='count(*)',
)
Sounds right: with the current default renderer, all data is embedded into the web page as JSON. It's not surprising that embedding that much data would crash the browser. tou might be able to improve things by referencing the data by URL rather than by data frame - then the data itself won't be part of the JSON. To do that, you'll need to export your data as csv, then put it at a URL that your browser can see, and pass that address in place of the data frame.
as other rendering libraries begin supporting Vega-Lite specs, there might be better solutions for visualization of large-ish data. But for now, vega.js is the only option (and it's not really designed with big data in mind)
Actually, we regularly build histograms over data of similar sizes in
Vega without major issues other than perhaps some slow down... It would be
useful to share specific replication steps so we can investigate and test!
On Sunday, August 14, 2016, Jake Vanderplas [email protected]
wrote:
as other rendering libraries begin supporting Vega-Lite specs, there might
be better solutions for visualization of large-ish data. But for now,
vega.js is the only option (and it's not really designed with big data in
mind)—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/ellisonbg/altair/issues/171#issuecomment-239674571,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMYL46hBW-aUxfte3vPgD7OanzOlMg7ks5qfxz0gaJpZM4Jjz79
.
Jeff - I think the issue is probably not vega's handling of the data per se, but the fact that by default Altair goes the route of embedding the data within the JSON spec itself.
We could have an option in Altair that saves the data to a file and loads it by URL.
I am also having a problem saving Jupyter notebooks with large visualizations in them:
[I 14:18:42.413 NotebookApp] Malformed HTTP message from 172.21.0.1: Content-Length too long
My dataframe has the dimensions 213665 rows × 5 columns.
Is there anyway to render it in Jupyter without saving all of the data in the notebook as JSON?
Is there anyway to render it in Jupyter without saving all of the data in the notebook as JSON?
Yes: put the csv file in the same directory as the notebook and use, e.g. Chart('data.csv'). Examples of this can be seen here: https://github.com/jakevdp/altair-examples
@jakevdp Thank you!
I figured out that Jupyter notebook serves the raw files at the /files/ url.
df.to_csv("dataframe.csv")
Chart("dataframe.csv").mark_circle().encode(
X('time:T', timeUnit='hours'),
Y('time:T', timeUnit='day'),
size='count(*):Q',
)
_edited to simplify path_
I think just dataframe.csv alone should work in the current version, if it's in the same directory as your .ipynb file.
For what it's worth, I've run into a similar issue, with 8 figures in my notebook (data from a dataframe), the notebook's file size is ~60 mb which caused an issue when trying to create a gist from it to share. Unfortunately, saving the data locally isn't a really a viable option.
If you're creating a gist, you could add the dataset CSV files to the gist as well and reference them by URL.
Long term we should have a better story for this, though. I'll put some thought into it. If you have thoughts on how we might do that most effectively, please let us know.
I added a FAQ to the docs that addresses this; we're also tracking this issue in #249.
Most helpful comment
If you're creating a gist, you could add the dataset CSV files to the gist as well and reference them by URL.
Long term we should have a better story for this, though. I'll put some thought into it. If you have thoughts on how we might do that most effectively, please let us know.