Plotly.py: Integration with Dask

Created on 6 Nov 2018  路  12Comments  路  Source: plotly/plotly.py

The current implementation of plotly.py does not support Dask.

Is there any plans on integrating this?

enhancement

Most helpful comment

Thanks for the example. Yes, plotly.py doesn't directly accept Dask Series objects so you'll need to call compute on them before passing them to plotly.py

e.g.

plt = go.Scatter(
    x=ddf.dates.compute(),
    y=ddf.floats.compute(),
    ...
)

It's possible that plotly.py could check if the input is a dask Series and call compute on it automatically. Thanks for the suggestion!

All 12 comments

Hi @pedroallenrevez, dask is a super-awesome interactive distributed computing library (I'm actually an occasional contributor!), but I don't think I understand what you mean by having plotly.py support Dask. Could you elaborate?

import plotly.graph_objs as go
import pandas as pd
import numpy as np
import datashader as ds
from datetime import datetime
import datashader.transfer_functions as tf
import dask.dataframe as dd
from math import floor

# plotly
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

def generate_df(size=100000):
    d = {    
        'dates' : pd.date_range('2015-01-01', periods=size, freq='1min'),
        'unique_id' : np.arange(0, size),
        'ints' : np.random.randint(0, size, size=size),
        'floats' : np.random.randn(size),
        'bools' : np.random.choice([0, 1], size=size),
        'int_nans' : np.random.choice([0, 1, np.nan], size=size),
        'float_nans' : np.random.choice([0.0, 1.0, np.nan], size=size),
        'constant' : 1, 
        'categorical' : np.random.choice([10, 20, 30, 40, 50], size=size) , 
        'categorical_binary' : np.random.choice(['a', 'b'], size=size), 
        #'categorical_nans' : np.random.choice(['a', 'b', np.nan], size=size)
        'categorical_nans' : np.random.choice(['a', 'b', 'c'], size=size)
    }

    df = pd.DataFrame(d)
    # df['hardbools'] = df['bools'] == 1
    df['categorical_nans'] = df['categorical_nans'].replace('c', np.nan)
    # df['hardcategorical_nans'] = df['categorical_nans'].astype('category')
    df['categorical_binary'] = df['categorical_binary'].astype('category')
    df['categorical_nans'] = df['categorical_nans'].astype('category')

    # df = df.set_index('dates')
    df['dates_int'] = df['dates'].astype('int64')
    return df

def generate_dask_df(size=100000, partitions=6):
    df = generate_df(size)
    return dd.from_pandas(df, partitions)

ddf = generate_dask_df(100000)

plt = go.Scatter(
    x=ddf.dates,
    y=ddf.floats,
    mode = "lines",
    name = "signal",
    marker = dict(color = 'rgba(16, 112, 2, 0.8)'),
    text= 'floats'
)

This doesn't seem to work!
Produces the following stack trace, which leads me to believe that plotly does not support dask.

ValueError                                Traceback (most recent call last)
<ipython-input-11-173617ab9687> in <module>
      5     name = "signal",
      6     marker = dict(color = 'rgba(16, 112, 2, 0.8)'),
----> 7     text= 'floats'
      8 )

/usr/local/lib/python3.6/site-packages/plotly/graph_objs/_scatter.py in __init__(self, arg, cliponaxis, connectgaps, customdata, customdatasrc, dx, dy, error_x, error_y, fill, fillcolor, groupnorm, hoverinfo, hoverinfosrc, hoverlabel, hoveron, hovertext, hovertextsrc, ids, idssrc, legendgroup, line, marker, mode, name, opacity, orientation, r, rsrc, selected, selectedpoints, showlegend, stackgaps, stackgroup, stream, t, text, textfont, textposition, textpositionsrc, textsrc, tsrc, uid, unselected, visible, x, x0, xaxis, xcalendar, xsrc, y, y0, yaxis, ycalendar, ysrc, **kwargs)
   2370         self['visible'] = visible if visible is not None else _v
   2371         _v = arg.pop('x', None)
-> 2372         self['x'] = x if x is not None else _v
   2373         _v = arg.pop('x0', None)
   2374         self['x0'] = x0 if x0 is not None else _v

/usr/local/lib/python3.6/site-packages/plotly/basedatatypes.py in __setitem__(self, prop, value)
   2773             # ### Handle simple property ###
   2774             else:
-> 2775                 self._set_prop(prop, value)
   2776 
   2777         # Handle non-scalar case

/usr/local/lib/python3.6/site-packages/plotly/basedatatypes.py in _set_prop(self, prop, val)
   3009                 return
   3010             else:
-> 3011                 raise err
   3012 
   3013         # val is None

/usr/local/lib/python3.6/site-packages/plotly/basedatatypes.py in _set_prop(self, prop, val)
   3004         validator = self._validators.get(prop)
   3005         try:
-> 3006             val = validator.validate_coerce(val)
   3007         except ValueError as err:
   3008             if self._skip_invalid:

/usr/local/lib/python3.6/site-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v)
    339             v = to_scalar_or_list(v)
    340         else:
--> 341             self.raise_invalid_val(v)
    342         return v
    343 

/usr/local/lib/python3.6/site-packages/_plotly_utils/basevalidators.py in raise_invalid_val(self, v)
    242             typ=type_str(v),
    243             v=repr(v),
--> 244             valid_clr_desc=self.description()))
    245 
    246     def raise_invalid_elements(self, invalid_els):

ValueError: 
    Invalid value of type 'dask.dataframe.core.Series' received for the 'x' property of scatter
        Received value: Dask Series Structure:
npartitions=6
0        datetime64[ns]
16667               ...
              ...      
83335               ...
99999               ...
Name: dates, dtype: datetime64[ns]
Dask Name: getitem, 12 tasks

    The 'x' property is an array that may be specified as a tuple,
    list, numpy array, or pandas Series

Unless I'm doing something wrong, of course!
Thanks!

Thanks for the example. Yes, plotly.py doesn't directly accept Dask Series objects so you'll need to call compute on them before passing them to plotly.py

e.g.

plt = go.Scatter(
    x=ddf.dates.compute(),
    y=ddf.floats.compute(),
    ...
)

It's possible that plotly.py could check if the input is a dask Series and call compute on it automatically. Thanks for the suggestion!

The objective is not using compute (using distributed workers).
Thanks for the help!

What would you want plotly.py to do with a Dask Series other than compute it into a pandas Series and display it?

I would want plotly to handle distributed data without the need to compute and putting it in memory.
Imagine I have 50 million points I want to plot. I don't want to put them in memory.

I want plotly to have distributed computing logic, so it can handle distributed dataframes (dask).

Why would you need dask if you just compute?...
If I computed the dataframe, this issue wouldn't even make sense!

The question is, is this in the roadmap for plotly or not?

Here are a few thoughts:
1) All of the traces in the plotly.graph_objs package are direct wrappers around plotly.js trace types. Plotly.js is a pure clientside browser-based JavaScript library and so it needs to have the full data representation of what it's plotting in memory in the browser. So I don't see a path towards getting reasonably interactive performance on 50 million points in a plotly.graph_objs.Scatter trace.
2) To handle this many points, and to take advantage of dask's distributed nature for rendering them, would require a 'server-side' rendering pipeline. This is exactly the problem that the datashader project has been built for.
3) It's already possible (though not especially convenient) to use datashader for server-side rendering and then display the rendered image in plotly.py. See https://plot.ly/python/change-callbacks-datashader/.
4) I do want to build a convenient interface for doing what this example notebook does, probably as a new figure factory in the plotly.figure_factor package. This datashader-based figure factory would directly accept dask data structures and feed them into the datashader library to perform server-side rendering.

I think number 4 is the way to go.
I can now understand why "integration" with Dask might be a bit of a problem.

Effectively speaking, as you said, this is already done by Datashader, which is the final objective.

That interface would greatly help the usability of plotly for data scientists.

I'll post another issue on datetime conversions for datashader.

Alright if we close this issue in favor of #1270 as Dask support would come along with datashader integration?

Yes it is alright.

For folks finding this issue via Google, I highly recommend checking out this Dash + Datashader example that Jon Mease wrote:

Dash app: https://dash-gallery.plotly.host/dash-world-cell-towers/

Source code: https://github.com/plotly/dash-world-cell-towers

image

This app is current best practice for Plotly tech + Dask ATM IMHO.

is dask integration available now?

Was this page helpful?
0 / 5 - 0 ratings